Write-Ahead Logging in Rust: The Cost of Durability

The Problem

Before this milestone, HozonDB could lose data silently. A DELETE would mark a row's slot dead in memory, write the dirty page to disk, then remove the corresponding entry from the B+ tree index. If the process crashed between the page write and the index update, the database would restart with a data page that says "this row is gone" and an index that still points to it. No error, no warning — just a database that returns wrong answers.

This is the general problem every database with in-memory state and on-disk persistence has to solve: how do you guarantee that a crash at any point leaves the database in a state it can recover from, rather than a state that's silently wrong?

The answer is a write-ahead log (WAL): before any change touches a data page, a record describing that change is written to a separate append-only log and flushed to disk. If the process crashes before the data pages are updated, the log has enough information to redo the work on restart. If it crashes after, the log entry is simply redundant — replaying it again has no effect.

That's the goal. Getting there took longer than expected, and the detour is as much the story as the destination.

Designing the Record Format

The first decision was what a log record should actually contain. There were three real options:

Logical logging — record the SQL operation itself (INSERT INTO users VALUES (1, 'Alice')). Compact, but replay means re-running the executor, including re-evaluating expressions and re-checking constraints. Making that idempotent — safe to run twice — is genuinely hard.

Physical logging — record the exact bytes written to a page: page ID, offset, before-image, after-image. Precise and trivial to redo, but records can be large, especially for full-page writes.

Physiological logging — record at the page/slot level using logical addressing: "page 5, slot 3, here's the new row bytes." This is what PostgreSQL uses, and it's the middle ground — small records, but redo is a direct byte-level operation, not a re-execution of business logic.

Physiological logging was the right fit for HozonDB specifically because the slotted page design already gives every row a stable (page_id, slot) address. The record format didn't need to invent a new addressing scheme — it just needed to describe what changed at an address that already existed.

The initial record shape was a single flat struct:

pub struct WalRecord {
    lsn: u64,
    record_type: WalRecordType,
    table_name: String,
    page_id: PageId,
    slot: u16,
    new_data: Vec<u8>,
    old_data: Vec<u8>,
}

old_data was included from the start even though HozonDB has no transactions yet — undo logging is meaningless without a concept of "uncommitted." But adding the field early meant the format wouldn't need to change shape when transactions arrive later; only the recovery logic around it would.

Every record also carries a CRC32 checksum, computed over the serialized bytes and appended after them. This isn't about correcting corruption — it's about detecting it. If the process crashes mid-write, the last record in the log can be a torn write: half its bytes on disk, the rest missing. On restart, the checksum catches this and recovery stops at the last verifiably complete record, rather than trying to parse garbage as a valid log entry.

The First Ordering Violation

With the record format settled, the writer (WalWriter) was straightforward: an append-only file, a monotonically increasing LSN, record-length framing so a reader can skip records without parsing them, and checkpointing so recovery doesn't have to replay the entire log from the beginning every time.

Wiring it into the executor is where the first real problem showed up.

The WAL guarantee depends entirely on ordering: log first, write pages second. But insert_row_into_page — the function responsible for finding space in a page and writing the row — determines the page_id and slot as a side effect of writing. There was no way to know what to put in the WAL record until after the page write had already happened.

The honest fix is to split page writes into two phases — "plan" (find the page and slot, without writing) and "execute" (write, now that the WAL record exists) — so the WAL record can be constructed and flushed before any byte on the data page changes.

That fix wasn't done at this stage, and the reason matters more than the fix itself.

Why the Buffer Pool Became the Dependency

Splitting insert_row_into_page into plan/execute phases is real work — but it's also work that a buffer pool would redo from scratch. A buffer pool fundamentally changes the page I/O contract: read_page returns a handle into an in-memory frame, write_page becomes "mark this frame dirty," and the executor stops working with raw byte copies and starts working with cached pages. Any plan/execute split done before the buffer pool existed would be rebuilt the moment the buffer pool arrived.

So the WAL-ordering fix was deliberately deferred — logged with an explicit TODO — and the buffer pool became the actual prerequisite for finishing WAL. This wasn't scope creep; it was recognizing that two pieces of "future work" were really one piece, and doing them in the wrong order would mean doing the first one twice.

There was a second gap pointing at the same dependency. WAL records were only being written for row data pages — execute_insert, execute_update, execute_delete. But B+ tree index nodes and system catalog pages are also pages, written via the same PageManager::write_page, with no WAL record at all. Recovery could restore row data perfectly and still leave an index pointing at a row that no longer existed.

Both gaps had the same shape: WAL needs to intercept every page write, not just the ones explicit in executor code. That's exactly what a buffer pool is positioned to do — it sits between all callers and disk, so it's the one place that can guarantee "log, then mark dirty" for every page mutation regardless of whether it's a row, an index node, or a catalog entry.

What the Buffer Pool Unlocked

With the buffer pool in place, every page mutation — row writes, B+ tree node updates, catalog changes, page allocation — goes through the same path: log to WAL first, then write to the in-memory frame, then mark it dirty with the LSN that was just assigned. The record format grew to match this reality. What started as one flat struct became an enum:

pub enum WalRecord {
    Slotted {
        lsn: u64,
        record_type: WalRecordType,
        table_name: String,
        page_id: PageId,
        slot: u16,
        new_data: Vec<u8>,
        old_data: Vec<u8>,
    },
    Raw {
        lsn: u64,
        record_type: WalRecordType,
        page_id: PageId,
        new_data: Vec<u8>,
        old_data: Vec<u8>,
    },
    Checkpoint { lsn: u64 },
    LinkPage { lsn: u64, page_id: PageId, next_page: PageId },
    AllocatePage { lsn: u64, page_id: PageId, page_type: u8 },
}

Slotted is the original physiological format for row-level DML. Raw carries a full page image — for B+ tree nodes and catalog pages, where "slot 3 changed" doesn't mean anything; the whole page is the unit of change. LinkPage and AllocatePage exist because page lifecycle — chaining pages together, allocating from the free list — is also state that must survive a crash, even though it isn't a row or index change at all.

The enum shape is itself the record of what was learned: a single record format couldn't describe everything that needed to be durable, because not everything that needs to be durable is a row.

Recovery: Redo with LSN Comparison

Recovery (WalReader::recover) runs on every Database::new, before catalogs load — there's no attempt to distinguish a crash from a clean shutdown, because a clean shutdown with nothing to replay is just a fast no-op pass over an empty range.

The mechanism is a redo pass from the last checkpoint. For each record, the page it targets is loaded, and its stored LSN is compared against the record's LSN:

fn recover_insert(record: &WalRecord, buffer_pool: &mut BufferPool) -> io::Result<()> {
    let page_id = record.page_id()...;
    let mut page_meta = buffer_pool.read_page_metadata(page_id, PageType::Slotted)?;
    let page_data = buffer_pool.get_page_mut(page_id)?;

    // compare lsn
    if page_meta.lsn() >= record.lsn() {
        // changes must have been applied
        return Ok(());
    }

    // ... apply new_data at the slot, update metadata, set_lsn(record.lsn())
}

This is the idempotency guarantee in three lines. Every page stores the LSN of the last WAL record that modified it. If that LSN is already at or past the record being replayed, the record's effect is already on disk — skip it. Otherwise, apply it and stamp the page with the new LSN. Replaying the same WAL twice, or replaying from a checkpoint that's behind the actual flushed state, produces the same result either way.

Each record variant has its own redo function — recover_insert, recover_update, recover_delete for Slotted; a single replay_raw_page_new_data for Raw, since "apply the new page image" is the same operation regardless of whether the page is an index node or a catalog page; recover_page_link and recover_allocate_page for the lifecycle records. Checkpoint records themselves require no redo — they exist purely as recovery boundaries.

One subtlety came up during this work: recover_allocate_page reinitializes a page's metadata when replaying an allocation, but the free list head — which page in PageManager thinks is "next available" — also needs to be consistent with that replay. Getting page-level recovery right wasn't just "replay the bytes"; it meant tracing through every piece of state that a given operation touches, including state that lives outside the page itself.

Results: The Cost of Durability

WAL's correctness story is straightforward once the buffer pool was in place. Its performance story is less flattering — and worth being honest about.

Every WAL append calls sync_all() — a synchronous fsync to disk — before returning. This is what makes the durability guarantee real: if the WAL record isn't on disk before the operation returns, a crash immediately after could lose it. But fsync is expensive, and calling it on every single write shows up immediately in benchmarks.

Comparing the same 10,000-row benchmark with WAL fsync enabled versus disabled (buffer pool only, no durability guarantee):

Figure 1: With WAL fsync — every operation's durability guarantee paid for individually.

Figure 2: Without WAL fsync — same buffer pool, no per-operation disk sync.

Operation	Without WAL fsync	With WAL fsync	Slowdown
INSERT (single row)	0.22ms	8.52ms	~39x
UPDATE (fits slot)	0.16ms	13.02ms	~81x
DELETE (single row)	0.07ms	9.89ms	~141x
UPDATE bulk 10% (1000 rows)	83.14ms	12,227.99ms	~147x
DELETE bulk 10% (1000 rows)	35.10ms	7,740.22ms	~220x

Read operations are largely unaffected — fsync only sits on the WAL write path, and reads don't touch the WAL. Full scans and index seeks fall within normal run-to-run variance across both configurations.

The write numbers are not a regression to be ashamed of; they're the literal cost of the guarantee. Every one of those fsyncs is a promise: "if power is lost right now, this operation already happened as far as the database is concerned." Without WAL durability, that promise doesn't exist — the sub-millisecond writes are fast precisely because they're making no such promise.

The gap between these numbers is also the next piece of work. The current writer calls sync_all() per record — O(n) fsyncs for n operations, with each fsync paying for exactly one operation's durability. The standard fix is group commit: batch multiple WAL records into a single fsync, amortizing the disk flush across all of them. This becomes natural once transactions exist — sync_all moves from "after every WAL append" to "at transaction commit," and a single commit can cover many writes. That's explicitly noted as a TODO in the writer and is the next milestone that touches this code path.

What's Next

WAL recovery is now correct for row data, index nodes, catalog pages, and page lifecycle operations — the two gaps that blocked it (ordering and coverage) are both closed, and both were closed by the same dependency: the buffer pool.

The next pieces build directly on this foundation. Group commit addresses the fsync cost shown above, and requires transaction boundaries (BEGIN/COMMIT/ROLLBACK) to exist first — which also means the old_data field that's been sitting unused in every Slotted record finally gets a purpose, as the basis for rollback.

Beyond that, the WAL record stream is the foundation for replication. A physiological log record — "page 5, slot 3, here's what changed" — is close to what a Raft log entry needs to carry to a replica. That's the long-term reason this format was chosen, and it's the next major direction for the project.

The full source code for HozonDB is available on GitHub. The WAL implementation lives in core/src/wal/ — writer.rs, reader.rs, and record.rs. The buffer pool is in core/src/storage/buffer_pool/.

HozonDB: Write-Ahead Logging and the Cost of Durability

The Problem

Designing the Record Format

The First Ordering Violation

Why the Buffer Pool Became the Dependency

What the Buffer Pool Unlocked

Recovery: Redo with LSN Comparison

Results: The Cost of Durability

What's Next

Comments

HozonDB

How I Rebuilt HozonDB's Storage Layer to Make Indexes Actually Work

More from this blog

Building a Buffer Pool: Why Your Database Shouldn't Talk to Disk Directly

Building B+ Tree Indexing from Scratch in Rust — HozonDB

How I Rebuilt HozonDB's Storage Layer to Make Indexes Actually Work

Understanding Pin<T> in Rust

Command Palette

The Problem

Designing the Record Format

The First Ordering Violation

Why the Buffer Pool Became the Dependency

What the Buffer Pool Unlocked

Recovery: Redo with LSN Comparison

Results: The Cost of Durability

What's Next

Comments

HozonDB

How I Rebuilt HozonDB's Storage Layer to Make Indexes Actually Work

More from this blog