Act I of VII

Information

Everything is a bit pattern, and the meaning is always an agreement.

A computer holds nothing but bits. A bit means something only when two parties agree what it means.

The bit

The smallest decision a machine can make: on or off, present or absent, 1 or 0.

A bit holds one of two states01off · 0 · falseon · 1 · true
Every richer idea — numbers, letters, images — is built by stacking bits and agreeing what the stack means.

Bit patterns and agreements

Eight bits is a byte. The same byte can be a number, a letter, or a pixel — only the agreement decides which.

One byte read three ways01000001As a number65unsigned 8-bit intAs a letterAASCII codepointAs a pixel25% gray (65/255)
The bits are identical. The interpretation is a contract between sender and reader.

Numbers from bits

Place values

Binary works like decimal, but each column is worth twice the one to its right.

Place values for an 8-bit binary numberposition76543210weight1286432168421bits0100000164 + 1 = 65
Lit columns add their weights. 01000001 = 64 + 1 = 65.

Addition

Add column by column, right to left. When a column overflows, carry the 1.

Adding 0011 + 0101 in binarycarry00100011(3)+0101(5)1000(8)
Same mechanic as decimal long-addition; only the base differs.

Negative numbers

A 4-bit register holds 16 patterns. Two's complement parks half on each side of zero so addition just works — no special hardware for subtraction.

4-bit two's complement number line1000100110101011110011011110111100000001001000110100010101100111-8-7-6-5-4-3-2-101234567leading bit = 1 → negativeleading bit = 0 → non-negative
Subtraction reduces to addition; the same adder hardware handles both. Adding past 0111 wraps to 1000 — overflow.

Floats

Real numbers don't fit in finite bits. IEEE 754 standardizes a compromise: binary scientific notation with a sign bit, a biased exponent, and a fractional mantissa.

IEEE 754 single-precision float layoutSEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM1 bit8 bits23 bitssignexponentmantissavalue = (-1)S × 1.M × 2E − 127
The formula above covers normal values; ±0, subnormals, ±∞, and NaN use reserved exponents. Most decimal fractions (even 0.1) round on the way in.

Text from bits

ASCII

Seven bits, 128 codepoints — every English letter, digit, and control code, all in one table.

A slice of the ASCII tablecodepoint(hex)0x200x300x410x610x0A0x7F·0AaLFDELspacedigit '0'letter 'A'letter 'a'newlinedelete'A' is 0x41 = 65. 'a' is 0x61 = 97. The gap of 32 (one bit) is the case toggle.
ASCII was the first widespread agreement on what text bytes mean. It still anchors the bottom of every modern encoding.

Unicode

ASCII covers English. Unicode covers everything: ~1.1 million codepoints organized into 17 planes of 65,536 each.

The 17 Unicode planes; the BMP holds most living textPlane 0 — BMPmost living scriptsU+0000 … U+FFFFsupplementary planes (emoji, historic scripts, CJK extensions)17 × 65,536 = 1,114,112 codepoints (U+0000 … U+10FFFF)
A codepoint is a number; a glyph is a drawing. Unicode standardizes the numbers; fonts decide the drawings.

UTF-8

A codepoint can be up to 21 bits. UTF-8 packs them into 1–4 bytes, self-synchronizing, with ASCII as the no-op case.

UTF-8 byte patterns1 byte2 bytes3 bytes4 bytes0xxxxxxx110xxxxx10xxxxxx1110xxxx10xxxxxx10xxxxxx11110xxx10xxxxxx10xxxxxx10xxxxxxU+0000 … U+007F (ASCII)U+0080 … U+07FFU+0800 … U+FFFFU+10000 … U+10FFFFleading 1s = total bytes; payload = the x's; continuations always begin 10.ASCII bytes (0xxxxxxx) are valid UTF-8 unchanged — that's why UTF-8 won.
Self-synchronizing: lose your place in the stream and the next leading byte tells you how to recover.

Beyond text

Images, audio, video — same trick, different sampling. Pick a grid, pick a rate, pick a bit depth. Agree.

Three media as bit-pattern agreementsImageAudioVideogrid of pixelssamples over timeframes over time8 × 8 pxe.g. 44,100 samples / stimee.g. 24 frames / s
Sampling rate and bit depth trade fidelity for size. Information not captured at sample time cannot be recovered later.

Compression

Most data has redundancy. Compression rewrites it in fewer bits — losslessly when the original must be recoverable, lossily when perception (or the use case) can absorb the error.

Run-length encoding: a tiny lossless trickAAAAAABBBC10 bytesrun-length encoding6 A3 B1 C3 pairs · 6 byteslossless: original recovers exactly
Lossless compression exploits patterns. Lossy compression (JPEG, MP3) goes further by also discarding what perception won't catch.

Serialization

Programs hold structured data in memory. To send it across a wire or write it to disk, the structure must be flattened to bytes — both sides agreeing on the layout.

The same record encoded as JSON and as Protocol Buffersid: 65name: "A"in-memory recordJSONProtocol Buffers{"id":65,"name":"A"}20 bytes · self-describing · human-readable0841120141tag(1)65tag(2)len'A'5 bytes · schema-required · binary
Trade-offs: human-readable vs compact, self-describing vs schema-bound, easy to debug vs fast to parse. Different formats win different fights.

Standards

Every agreement on this page has a canonical specification. Cite these, not blog posts.

  • ASCII — ANSI X3.4-1986 / ISO 646. 7-bit, 128 codepoints.
  • Two's complement integers — universal in modern hardware; no separate spec, but documented in every CPU ISA reference (e.g. Intel® 64 and IA-32 Architectures Software Developer's Manual, Vol. 1).
  • Floating-point — IEEE 754-2019 (revision of IEEE 754-1985 / -2008). Defines binary32, binary64, rounding modes, NaN/Inf semantics.
  • UnicodeThe Unicode Standard, current major version. Code charts, normalization, bidi, collation. Maintained by the Unicode Consortium; tracked by ISO/IEC 10646.
  • UTF-8 — RFC 3629 (obsoletes RFC 2279). Defines the byte sequences shown in the encoding diagram, including the prohibition on surrogate codepoints and overlong forms.
  • JSON — RFC 8259 / ECMA-404. Defines the grammar; JSON.stringify semantics live in ECMA-262 (JavaScript).
  • Protocol BuffersEncoding (developers.google.com/protocol-buffers/docs/encoding) and the .proto language reference. Wire format is stable across proto2/proto3; the schema language is not.
  • Run-length and other lossless schemes — RFC 1951 (DEFLATE) underpins gzip, zlib, PNG; RFC 7932 (Brotli) is the modern successor.
Going deeper

Branches that earn their own article.

  • Deep dive on IEEE 754 floating point.
  • Unicode specification and encoding details.
  • Image format internals (PNG, JPEG, WebP).
  • Audio/video codecs.
  • Information theory (Shannon, entropy).
  • Compression algorithms (Huffman, LZ family).
  • Serialization formats compared (ASN.1, MessagePack, Avro, Parquet).