Information — Act 1 | Digital Systems

A bit means nothing on its own. Meaning is an agreement layered on top: this byte holds an integer, that one holds a letter, those four hold a colour. Stack the agreements and you get numbers, text, images, audio, files, protocols. This act walks the stack from one switch up to bytes on a wire.

The same byte sits at the bottom of every interpretation. Compression and serialization don't care which interpretation you chose — they operate on bytes regardless.

The bit

A computer needs to record decisions in physical matter, but matter is noisy. The smallest distinction that survives noise is a two-state one: voltage above or below a threshold, magnetic domain pointing one way or the other, optical pit present or absent. A bit is that one binary decision, written 0 or 1. The substrate doesn't matter; the distinction does.

One bit alone is rarely useful. Stack n bits and you get 2ⁿ patterns: 8 bits → 256, 32 bits → about 4.3 billion, 64 bits → about 1.8 × 10¹⁹. Doubling the bit count squares the state space. Almost every storage and addressing decision in computing reduces to "how many bits do I spend here, and what does that buy?"

Every richer idea — numbers, letters, images — is built by stacking bits and agreeing what the stack means.

Bytes and agreements

A program needs a unit larger than the bit but small enough to handle individually. Eight bits is the universal answer: the byte. It holds 256 values — enough for every English letter and digit, and exactly the unit most CPUs and memory buses address directly.

A byte by itself still has no meaning. The bits 01000001 are the number 65, the letter A, a 25%-grey pixel, or a CPU opcode — depending on which decoder reads them. The agreement lives in file headers, type systems, and protocol specs.

Mismatch the agreement and the bytes don't fail loudly; they decode wrong. A JPEG opened as text shows mojibake. An integer field read as a float gives a nonsense number with no warning. Most "corrupted file" symptoms are agreement mismatches in disguise.

The bits are identical. The interpretation is a contract — encoded in file headers, type systems, network protocols. Mismatched contracts produce garbage: text fed to a JPEG decoder, integers read as floats.

Numbers from bits

Place values

The first thing we want from bits is to count with them. Binary works exactly like decimal — each column has a place value — but the base is 2 instead of 10. The rightmost column is worth 1, the next 2, then 4, 8, 16, doubling outward. Sum the lit columns and you have the number.

An n-bit unsigned integer covers 0 to 2ⁿ − 1: a byte is 0–255, a 16-bit short is 0–65,535, a 32-bit word is 0–4,294,967,295. The upper bound is fixed at compile time. Hit it and the value wraps — addition past the top rolls back to zero, which is how integer overflow bugs are born.

Lit columns add their weights. 01000001 = 64 + 1 = 65.

The same value can be written in any base. Decimal is what humans count in; binary is what hardware stores; hex is the readable shorthand for binary because each hex digit corresponds to exactly four bits. Two hex digits cover one byte. That convenience is why hex dominates byte-level work — debuggers, protocol dumps, colour codes, addresses — while binary itself almost never appears outside diagrams.

Same value, four notations. Hex is the lingua franca of byte-level work because 0x00–0xFF covers exactly one byte.

Addition

Once values are stored as bits, the CPU needs a way to add them. The mechanic is decimal long-addition with the base swapped to 2: process columns right to left, and when a column produces 2, write 0 and carry 1 into the next column. A circuit called a full-adder does this for one column; chain n of them and you have an n-bit adder. Multiplication and division reduce to repeated shifted addition and subtraction, so this single circuit underpins all integer arithmetic.

Same mechanic as decimal long-addition; only the base differs.

Bit operations

Sometimes you don't want arithmetic on a number — you want to inspect or flip individual bits, treating the byte as a row of independent flags. The CPU offers four bitwise primitives that act column by column with no carry: AND, OR, XOR, and NOT. They map directly to the four logic gates that make up every digital circuit, and they're how packed flags, masks, and bitfields are read and written.

The pattern is small but pervasive: AND clears bits not in the mask (flags & 0x0F keeps the low nibble), OR sets bits (flags | READY), XOR toggles them, NOT inverts every bit. Almost every flag-manipulation routine in a kernel or protocol parser uses some combination of these four.

Four operations cover almost every flag pattern: AND tests, OR sets, XOR toggles, NOT inverts.

Shifts

A shift slides every bit left or right by a fixed number of positions, dropping bits that fall off one end and filling new ones at the other. Left-shift by 1 doubles an unsigned value (each bit moves to a higher weight, a zero feeds in at the bottom); right-shift by 1 halves it. The CPU does either in a single cycle, which is why compilers turn x * 2 into x << 1 whenever they can prove the multiplication won't overflow.

The wrinkle is right-shift on signed numbers. Logical right-shift feeds a zero in at the top, which corrupts negative values. Arithmetic right-shift copies the sign bit instead, preserving the sign so that division by 2 still works. CPUs offer both, and high-level languages choose which one based on the operand's signedness.

Compilers turn x * 2 into a shift when they can. The shift takes one cycle; the multiply may take several.

Negative numbers

Unsigned integers can't represent debt, temperature below zero, or any subtraction whose result goes negative. The naive fix — reserve one bit for the sign — works but produces two zeros (+0 and -0) and requires the adder to branch on the sign before computing.

The trick that wins is two's complement. The top bit's weight is negated: in a 4-bit register, instead of +8, the leading bit is worth -8. So 1000 is −8, 1111 is −1, and 0111 is +7. The range becomes −2ⁿ⁻¹ to 2ⁿ⁻¹ − 1, with one zero and no special cases. To negate a value: invert every bit and add 1.

The payoff is that the same adder hardware handles signed and unsigned addition with no branch on the sign bit. Subtraction a − b becomes a + (−b), and the CPU's overflow flag tells the program when the result wrapped.

Pitfall — the asymmetric range. There's one more negative number than positive. INT_MIN (−2³¹ in 32-bit) has no positive counterpart, so negating it overflows back to itself. In C this is undefined behaviour; in Rust it traps in debug builds and silently wraps in release. The classic place this surfaces: abs(INT_MIN) returns INT_MIN.

Worked example: encoding −5 in 8-bit two's complement

To negate a value, invert every bit and add one. Apply that to +5:

Start with +5 in 8 bits: 0000 0101.
Invert every bit (the ones' complement): 1111 1010.
Add 1: 1111 1011.

That's the 8-bit two's-complement encoding of −5.

Verify by reading the place values. With the top bit's weight negated, the columns are worth −128, 64, 32, 16, 8, 4, 2, 1. The lit columns in 1111 1011 are −128 + 64 + 32 + 16 + 8 + 2 + 1 = −5. Matches.

Verify by addition. Add +5 and −5 and you should get zero:

  0000 0101   (+5)
+ 1111 1011   (-5)
-----------
1 0000 0000   (carry out of bit 7 is discarded)
  0000 0000   (= 0)

The carry past the top bit falls off the register. The eight bits that remain are exactly zero. Same adder hardware, no branch on the sign — that's the payoff.

Subtraction reduces to addition; the same adder hardware handles both. Adding past 0111 wraps to 1000 — overflow.

Floats

Integers handle whole quantities but not fractions, distances, or measurements. Real numbers are infinite, so we round. IEEE 754 is the standard for how to round: each value is stored in binary scientific notation — a sign, a magnitude, and where the binary point sits. Most decimal fractions have no exact binary form, so they round at the boundary; this is why 0.1 + 0.2 prints 0.30000000000000004. Not a bug. Predictable arithmetic with a finite alphabet.

The format splits the bits across three fields. A sign bit, an exponent that says where the binary point goes, and a mantissa that holds the leading digits. Single precision (float) uses 32 bits and gives about 7 decimal digits of precision; double precision (double) uses 64 bits and gives about 15. A few exponent values are reserved to encode ±0, ±∞, and NaN (not-a-number, the result of 0/0 or sqrt(-1)).

Single precision: 32 bits split 1/8/23. Double precision: 64 bits split 1/11/52 with the same shape and bigger fields.

Pitfall — decimal rounds on entry. 0.1 becomes a repeating binary fraction that gets cut off at 23 or 52 bits, so the stored value is slightly off. Add two slightly-off numbers and the errors compound. The fix is not "use more bits" — it's "don't use floats for exact decimal arithmetic." Money goes in fixed-point or decimal types; equality comparisons use abs(a - b) < epsilon, never ==.

Worked example: encoding 0.15625 in IEEE 754 binary32

The whole encode-decode cycle in seven steps. We encode 0.15625 as a 32-bit single-precision float and verify it round-trips.

1. Decimal → binary fraction. 0.15625 = 1/8 + 1/32 = 2⁻³ + 2⁻⁵, so in binary it's 0.00101. (This number happens to be exact in binary; most decimals are not.)

2. Normalize to scientific form. Shift the binary point until exactly one 1 sits to its left: 0.00101 = 1.01 × 2⁻³. The leading 1. is implicit in the format and is not stored.

3. Sign bit. Positive → S = 0.

4. Exponent. True exponent is -3. Add the bias of 127 to get the stored value: 124. In 8-bit binary: 01111100.

5. Mantissa. Take the bits after the implicit leading 1. — here 01 — and pad to 23 bits: 01000000000000000000000.

6. Assemble. Concatenate S | EEEEEEEE | MMM…M:

0 01111100 01000000000000000000000

Grouped as bytes: 0011 1110 0010 0000 0000 0000 0000 0000 = 0x3E200000 (big-endian).

7. Verify by decoding. Sign 0 → positive. Exponent 01111100 = 124, minus bias 127 = -3. Mantissa bits 01 → fractional value 1.01₂ = 1.25₁₀. Final: +1.25 × 2⁻³ = 0.15625. Round-trip clean.

The numbers that don't round-trip cleanly — like 0.1 — are the ones whose binary expansion is non-terminating. Step 1 is where the loss happens; every step after is bookkeeping.

0.1 has no exact binary representation. The stored float32 is the nearest representable value, off by ≈1.5×10⁻⁹ — and the error compounds when you add it to another float.

Text from bits

ASCII

Text needs a byte-to-character agreement, and the first widely adopted one was ASCII. Seven bits, 128 codepoints — enough for every English letter, digit, and punctuation mark, plus 33 invisible control codes for things like newline and tab. The 8th bit was left unused, which various vendors filled with incompatible "code pages" until Unicode arrived.

ASCII is still the floor of every modern text encoding. Look at any HTTP header, source file, or config file in a hex dump and you're reading ASCII bytes. The CRLF-versus-LF newline argument between Windows and Unix descends directly from two of those control codes (0x0D and 0x0A) and which combination counts as "end of line."

The first widespread agreement on what text bytes mean. UTF-8 preserves every ASCII byte unchanged, which is why ASCII is still everywhere.

A hex dump is the standard way to look at any file's raw bytes: an offset on the left, the hex values in the middle, and an ASCII gutter on the right where printable bytes show their letter and everything else shows as a dot. It's the universal "what's actually in this file?" view.

The hex column is the truth; the ASCII gutter is a hint. Anything below 0x20 or above 0x7E renders as a placeholder.

Unicode

ASCII covers English. The world has many more scripts. Unicode is the standard that assigns a unique number — a codepoint — to every character in every script: 1,114,112 slots organized into 17 planes, written U+0000 through U+10FFFF. The first plane (the BMP) holds almost all living text; later planes hold emoji, historic scripts, and CJK extensions.

Unicode is only an assignment of numbers to characters. It says nothing about how to write those numbers as bytes — that's the job of an encoding (UTF-8, UTF-16, UTF-32). Confusing the two is the source of half the "encoding bugs" in legacy systems.

A codepoint is a number; a glyph is a drawing. Unicode standardizes the numbers; fonts decide the drawings.

Three concepts get conflated and shouldn't be. A codepoint is the number Unicode assigns. A glyph is the visual shape a font draws for that codepoint. A grapheme cluster is what the user perceives as one character. Sometimes one codepoint, one glyph, one grapheme. Often not — Á can be one codepoint or two, and a family emoji is seven codepoints joined into one grapheme.

This is why "string".length lies. Most languages count codepoints (or worse, UTF-16 code units), not graphemes. So a single emoji can have a "length" of 4, 7, or 11. To count what the user sees, use a grapheme-aware library.

"string".length in most languages counts codepoints, not graphemes. That's why an emoji can have a length surprisingly larger than what you see.

A related hazard: the same visible string can have two different byte sequences. "café" can be stored as a single precomposed codepoint for é, or as e plus a combining acute accent. Both render identically; neither is byte-equal to the other. Unicode normalization (NFC, NFD, NFKC, NFKD) collapses both forms into one canonical sequence so comparisons work. Always normalize on the way in for filenames, identifiers, and database keys — Linux and macOS filesystems disagree on which form they prefer, and the bug surfaces at file-not-found time on the other OS.

Always normalize on the way in. Cross-platform tooling has to convert because filesystems disagree on which form they prefer.

UTF-8

Unicode codepoints go up to 21 bits. Storing every character in 21 (or 32) bits would quadruple the size of every English text file. UTF-8 solves this with a variable-length encoding: ASCII characters take 1 byte, Latin-script accents take 2, most other living scripts take 3, and emoji take 4.

The leading bits of each byte say which form is in use. A byte starting 0 is a 1-byte ASCII codepoint. A byte starting 110, 1110, or 11110 begins a 2-, 3-, or 4-byte sequence. Continuation bytes always start 10, so any byte tells you whether you're at a character boundary or in the middle of one. This is what "self-synchronizing" means — drop into a UTF-8 stream at a random offset and you can resync at the next byte that doesn't start with 10.

Self-synchronizing: every continuation byte begins 10. Lose your place mid-stream and the next byte starting with anything else marks the next character boundary.

ASCII bytes are valid UTF-8 unchanged. That backwards compatibility is why UTF-8 won out over UTF-16 and UTF-32 on disk and on the wire — every existing English file became a valid UTF-8 file with zero conversion.

Worked example: encoding "€" in UTF-8

The euro sign sits at codepoint U+20AC. Walk it through the encoder.

1. Write the codepoint in binary. 0x20AC = 0010 0000 1010 1100 (16 bits, leading zeros included).

2. Pick a length. The codepoint needs at least 14 bits of payload (drop the two leading zeros). One byte holds 7 payload bits, two bytes hold 11, three bytes hold 16. Fourteen fits in three.

3. Lay out the template. A 3-byte sequence has the shape 1110xxxx 10xxxxxx 10xxxxxx, with 4 + 6 + 6 = 16 payload slots.

4. Pad the codepoint to 16 bits and pour it into the slots. U+20AC is already 16 bits: 0010 000010 101100. Slide each group into the xs:

Template	Filled
`1110xxxx`	`1110 0010`
`10xxxxxx`	`1000 0010`
`10xxxxxx`	`1010 1100`

5. Read the resulting bytes. E2 82 AC. That's the three-byte UTF-8 encoding for €.

Decoding goes the other way. A reader sees E2, notes the 1110 prefix and knows "three-byte sequence, two continuations to follow." It strips the prefix bits from each byte (0010 from E2, 000010 from 82, 101100 from AC), concatenates the payload (0010 000010 101100), and reads it as U+20AC. If the second byte didn't start with 10, the decoder would know the stream was corrupted and could resynchronize at the next non-continuation byte — that's "self-synchronizing" made concrete.

Pitfall — invalid UTF-8. A decoder must reject overlong encodings (using 2 bytes for an ASCII character, for instance), unpaired surrogates (codepoints reserved by UTF-16), and codepoints above U+10FFFF. Accepting these has historically been a security hole: an attacker encodes / as an overlong 2-byte sequence, the URL parser thinks it's harmless, the filesystem decodes it and traverses the path.

Beyond text

Pictures, sound, and motion don't fit naturally into bits. They're continuous: smooth gradients of colour, smooth waves of pressure. To store them you sample — read the value at fixed points and quantize each reading to a finite bit depth. The agreements get bigger (grid size, sample rate, bit depth, channel count, layout) but the principle is the same as a byte: bits are meaningless without a header that says what they describe.

Image — a 2D grid of pixels. Each pixel holds the intensity of red, green, and blue, usually 8 bits per channel for 24-bit colour (about 16.7 million shades). HDR formats use 10 or 16 bits per channel for more headroom in highlights and shadows. Raw size is width × height × bytes-per-pixel: a 4K frame at 24-bit colour is 25 MB, before any compression.

Coral red lives at one point in 3-D colour space; the two coordinate systems just label that point differently. CSS accepts both, browsers convert HSL → RGB at paint time.

Audio — a 1D stream of samples, each a fixed-width integer or float capturing instantaneous air pressure. CD-quality audio is 44,100 samples per second, 16 bits each, two channels — 1.41 Mbps uncompressed. Sample rate decides the highest frequency you can capture; bit depth decides the dynamic range between the loudest and quietest sounds you can distinguish.

Video — a stack of images at 24, 30, or 60 frames per second, plus a parallel audio track. Raw 1080p60 is about 3 Gbps; codecs like H.264, HEVC, and AV1 squeeze that by 100–500× by exploiting the redundancy between consecutive frames (most pixels barely change frame to frame).

Sampling rate and bit depth trade fidelity for size. Information not captured at sample time cannot be recovered later.

How densely should you sample? The Nyquist–Shannon theorem sets the floor: to reconstruct a signal containing frequencies up to f, you need at least 2f samples per second. CD audio's 44.1 kHz captures up to about 22 kHz, just above the limit of human hearing.

Sample below the Nyquist rate and you don't just lose the high frequencies — they fold down into the captured range as fake low-frequency content. This is aliasing, and once it's in the samples no software can remove it. The only fix is a low-pass filter in front of the analog-to-digital converter that strips out frequencies above Nyquist before they get sampled.

Software fixes are impossible; the anti-aliasing filter has to live in the analog domain, before the samples are taken.

Pitfall — bit depth and channel count are part of the agreement. A 16-bit, mono, 44.1 kHz file decoded as 24-bit stereo produces silence or noise. File formats embed these parameters in their headers; raw streams require out-of-band agreement. Each extra bit of depth doubles the level count and adds about 6 dB of dynamic range, so 16-bit gives about 96 dB (CD), 24-bit about 144 dB (studio headroom for editing).

Compression

Raw representations are wasteful because real data is repetitive. English text uses e and t constantly and q rarely. Photos contain large patches of similar colour. Audio has near-silence between notes. Compression rewrites data in fewer bits by exploiting that redundancy, in two flavours: lossless (every byte recoverable, used for code and archives) and lossy (close-enough recovery, used for media where perception forgives small errors).

There's a hard floor on lossless compression. Claude Shannon's entropy measures, in bits per symbol, how unpredictable the source is: a uniform random byte stream needs all 8 bits and can't be compressed; a stream that's 99% letter A carries almost no information per byte and compresses heavily. No lossless coder can beat this floor on average. Compression is the art of estimating symbol probabilities accurately and spending bits in proportion.

The uniform case is already at the limit; the skewed case has headroom no lossless coder can exceed on average.

Three families of lossless tricks cover almost everything in production:

Run-length encoding. Replace AAAAAA with (6, A). Trivially fast, but only beats raw on highly repetitive input — fax pages, simple icons, black-and-white bitmaps.
Dictionary coding (LZ77 family). Slide a window over the input. When you see a string that already appeared earlier, emit a back-reference (go back 12 bytes, copy 4 bytes) instead of the literal. This is the engine inside gzip, ZIP, PNG, and Brotli.
Entropy coding (Huffman, arithmetic, ANS). Assign short bit-codes to frequent symbols and long codes to rare ones. The codes have a prefix property — no code is the start of another — so the stream decodes uniquely without delimiters.

gzip, ZIP, PNG, and Brotli all start here. The encoder tracks recent bytes; the decoder rebuilds them from literals and back-references.

Build the tree by repeatedly merging the two least-frequent nodes. Frequent symbols rise to short codes; rare ones sink deep.

Real-world formats stack these. DEFLATE (used by gzip, ZIP, PNG) runs LZ77 first to remove repetition, then Huffman to entropy-code what's left. Brotli adds a built-in static dictionary of common web strings.

Lossy compression goes further by discarding information humans don't perceive:

JPEG transforms 8×8 pixel blocks into frequency components and discards the high-frequency detail the eye barely registers.
MP3, AAC, Opus use psychoacoustic masking — quiet sounds near louder ones are inaudible and can be removed entirely.
H.264, HEVC, AV1 apply JPEG-style spatial compression per frame plus motion compensation between frames, encoding most pixels as small offsets from a previous frame.

The trade-off triangle. Compression ratio, encode/decode speed, and (for lossy) perceived quality. You pick two. gzip -9 compresses smaller than gzip -1 but spends more CPU. HEVC compresses smaller than H.264 but burns more cycles per frame. Pick the corner that matches your bottleneck — bandwidth, latency, or storage — not the corner that sounds best in benchmarks.

The curve is concave: the first quality drop is almost free, the last point of "perfect" costs an order of magnitude. Pick the elbow, not the corner.

Error detection and correction

Bits don't always come back the way they went in. Disks bit-rot, RAM cells flip from cosmic rays, radio packets lose bits to interference. To know data is intact, the system has to send extra bits whose only job is to fail when something else is wrong.

The cheapest scheme is a parity bit: one extra bit set so the total count of 1s in the byte is even. A single flipped bit makes the count odd, which the receiver detects. Two flips cancel and slip through. So parity catches single-bit errors and nothing more.

A checksum generalizes this. The sender computes a small value (a sum, an XOR, a CRC) over the whole message and appends it. The receiver recomputes; mismatch means corruption. CRC32, used in Ethernet, ZIP, and PNG, catches all burst errors up to 32 bits and almost all longer ones. Internet packets carry a smaller, weaker 16-bit checksum because they're protected by other layers too.

Detection alone forces a retransmission, which costs latency. Error-correcting codes (ECC) add enough redundancy to fix small numbers of flipped bits on the spot. Server-grade RAM uses a Hamming-style code that corrects any single-bit error and detects any double-bit error per memory word; storage and wireless protocols use stronger codes (Reed–Solomon, LDPC, turbo codes) that correct dozens of errors per block. The trade-off is space: ECC RAM costs about 12% more bits than plain RAM in exchange for surviving the cosmic-ray flips a data centre sees at scale.

The system designer picks one based on whether retransmission is cheap (TCP can ask again) or impossible (RAM can't, and space probes are minutes of light away).

Hashing

Many problems need a small fixed-size summary of arbitrary data — to look it up in a hash table, to spot duplicates without storing copies, or to verify a download matches what was published. A hash function takes any-length input and returns a fixed-length output that changes drastically when any input bit changes.

The properties you want depend on the use case. A hash table needs fast and well-distributed — outputs spread evenly across buckets so lookups stay O(1) on average. A content-addressed store needs low collision probability — the chance two different inputs produce the same hash should be negligibly small. A download verifier needs collision-resistant against an adversary trying to forge a match. These are different bars, met by different families.

Non-cryptographic (FNV, MurmurHash, xxHash, SipHash). Fast, well-distributed, but easy to deliberately collide. Used inside hash tables, bloom filters, and load balancers — places where the only adversary is randomness.
Cryptographic (SHA-256, BLAKE3). Slower, but deliberately collision-resistant. Used for content addressing (Git commits, IPFS), download verification, and digital signatures.

A hash table maps hash(key) mod table_size to a bucket index. Two keys can hash to the same bucket — a collision — and the table handles it by chaining (a list per bucket) or open addressing (probe nearby buckets). With a good hash function and load factor below about 0.7, expected lookup is O(1); with a bad one, it degrades to O(n) and your "constant-time" data structure starts thrashing.

The same input always produces the same hash — that's how Git knows two files are identical without comparing them byte by byte.

Cryptographic hashes pin down "this is exactly that data" without storing the data. Git uses SHA-1 (now migrating to SHA-256) as a content address: a commit hash is the hash of its tree plus parents plus message, so two repositories with the same commit hash hold byte-identical history. The properties that make hashing useful for cryptography (preimage resistance, second-preimage resistance, collision resistance) are covered in Act VIIa.

How a hash function actually shrinks the input. "Any-length input → fixed-length output" sounds like magic; the mechanism is mundane. A typical hash function works in four steps:

Pad the message so its length is a multiple of a block size (often 512 or 1024 bits), embedding the original length in the padding so two different inputs can't pad to the same bytes.
Initialize a small fixed-size state (for SHA-256, eight 32-bit words = 256 bits). This is what will eventually become the output.
For each block, mix the block into the state. The mixing is a fixed sequence of bit operations — XORs, shifts, rotations, modular adds — repeated for many rounds. Each round spreads every input bit's influence across the entire state (the avalanche property).
Output the state once every block has been absorbed.

So the output is fixed-size because the state is fixed-size; the input affects the output only by repeatedly stirring the state. A one-bit change to the input changes one bit early in the mixing, but after enough rounds that single difference has propagated to roughly half the state bits — which is why hash("hello") and hash("hello!") look completely unrelated even though the inputs differ by one byte. The art is choosing mixing operations that no shortcut can unwind; the engineering is making the mixing fast enough to hash gigabytes per second.

Serialization

Programs hold structured data — objects with fields and types, references between them. To send that data across a network, write it to disk, or hand it to another program written in another language, the structure must be flattened to a byte sequence. Serialization is that flattening; both sides have to agree on the layout, or the receiver decodes nonsense.

Two design camps split the space:

Self-describing (JSON, XML, MessagePack, CBOR). Field names and type tags travel inside the bytes. Easy to inspect, larger on the wire, slower to parse.
Schema-bound (Protocol Buffers, Cap'n Proto, FlatBuffers, Avro, Thrift). Bytes are tightly packed; identity comes from a schema both sides hold separately. Compact, fast, but the schema is the contract — break it and the receiver decodes garbage with no warning.

The size difference is real. The same record encoded as JSON and as Protobuf might be 20 bytes versus 5. Multiply across millions of records or millions of RPCs and the format choice becomes a bandwidth bill.

Same record, two encodings. The 4× size difference is typical and turns into bandwidth and storage decisions at scale.

Endianness

A multi-byte integer like 0x12345678 has to be laid out in memory somehow, and CPUs disagree on the order. Big-endian stores the most significant byte first (12 34 56 78); little-endian stores the least significant first (78 56 34 12). Big-endian was historically the convention for network protocols ("network byte order"); little-endian dominates modern CPUs (x86, ARM by default).

Inside a single machine, endianness is invisible. The CPU writes bytes one way and reads them the same way. The trouble starts when bytes cross machines: a little-endian machine writes a 32-bit field, a big-endian machine reads it, and the value comes out scrambled with no error reported. Wire formats specify a byte order; the C functions htonl / ntohl ("host to network long") exist precisely to convert at the boundary.

The same four bytes; the order on disk decides what they mean. Protocols pin a byte order; htonl / ntohl exist precisely because hosts may not match the wire.

Versioning

Schemas evolve. The field you didn't anticipate last quarter has to ship today, while old clients are still in production. Two compatibility properties matter, and you have to think about both. Forward compatibility: new code can read data written by old code. Backward compatibility: old code can read data written by new code (skipping fields it doesn't understand).

Protobuf's design prioritizes both. Fields are identified by a numeric tag, not a name; the wire format is a list of (tag, value) pairs. Add a new field with a new tag — old readers skip it, new readers see it. Retire a field by reserving its tag forever in the schema. The one rule you can't break: never reuse a tag for a different type. Old code happily decodes the new bytes as the old type, and you ship corruption to production with no error message.

The wire format is a list of (tag, value) pairs. Tags are the contract; field names exist only in the .proto file. Add freely, retire with reserved, never recycle.

Standards

Every agreement on this page has a canonical specification.

ASCII — ANSI X3.4-1986 / ISO 646. 7-bit, 128 codepoints.
Two's complement integers — universal in modern hardware; no separate spec, but documented in every CPU ISA reference (e.g. Intel® 64 and IA-32 Architectures Software Developer's Manual, Vol. 1).
Floating-point — IEEE 754-2019 (revision of IEEE 754-1985 / -2008). Defines binary32, binary64, rounding modes, NaN/Inf semantics.
Unicode — The Unicode Standard, current major version. Code charts, normalization, bidi, collation. Maintained by the Unicode Consortium; tracked by ISO/IEC 10646.
UTF-8 — RFC 3629 (obsoletes RFC 2279). Defines the byte sequences shown in the encoding diagram, including the prohibition on surrogate codepoints and overlong forms.
JSON — RFC 8259 / ECMA-404. Defines the grammar; JSON.stringify semantics live in ECMA-262 (JavaScript).
Protocol Buffers — Encoding and the .proto language reference. Wire format is stable across proto2/proto3; the schema language is not.
Run-length and other lossless schemes — RFC 1951 (DEFLATE) underpins gzip, zlib, PNG; RFC 7932 (Brotli) is the modern successor.