Skip to main content

Command Palette

Search for a command to run...

Building a Buffer Pool: Why Your Database Shouldn't Talk to Disk Directly

Updated
14 min read
Building a Buffer Pool: Why Your Database Shouldn't Talk to Disk Directly

HozonDB is a relational database engine I'm building from scratch in Rust, mostly as a way to actually understand how databases work instead of just using them. Previous posts covered slotted page storage and B+ tree indexing. This one is about the buffer pool — the layer that sits between the query executor and the disk, and turns out to be one of the most consequential architectural decisions in the whole system.

The problem: every write was a disk write

Up to this point, HozonDB's storage layer looked like this:

PageManager owned the database file and exposed two methods: read_page(page_id) and write_page(page_id, data). Every read went straight to the kernel's page cache via a syscall, and every write went straight to disk.

This was simple and correct, but it had three problems that became impossible to ignore:

1. Bulk writes were O(n) disk operations. Deleting 1000 rows meant 1000 individual write_page calls, even when many of those writes landed on the same handful of pages — and if the row was indexed, that's a few more writes on top.

2. Read-then-write was double I/O. Inserting a row meant reading a page to check free space, then writing it back — two syscalls for one logical change, every time.

3. No place to enforce ordering guarantees. Once I started designing the write-ahead log, I needed a single interception point where every page mutation could be logged before it hit disk. With direct PageManager calls scattered across the executor, the B+ tree, and the catalog, there was no such point.

That third one is the real reason the buffer pool had to come before anything else. I'll mostly set WAL aside in this post — that's its own article — but it's worth knowing it was the forcing function.

Why "just cache the pages" isn't the whole answer

The obvious fix is a HashMap<PageId, [u8; 4096]> sitting in front of PageManager. Read? Check the map first. Write? Update the map, write through to disk later.

This gets you 80% of the way there, but two questions immediately follow:

  • What happens when the cache fills up? A database can have millions of pages; you can't hold them all in memory. Something has to get evicted, and you need a policy for choosing what.

  • What if the evicted page has unflushed changes? You can't just drop it — you'd lose data. Dirty pages need to be written back before their frame can be reused.

So the cache isn't a HashMap — it's a fixed pool of memory frames, each one tracking whether it's occupied, whether its contents are stale relative to disk, and a PageId → frame index lookup table to find things.

That's the buffer pool.

The Frame

Each frame needs to track enough state to make eviction decisions and know whether it's safe to discard:

pub struct Frame {
    page_id: Option<PageId>,
    data: [u8; PAGE_SIZE],
    dirty: bool,
    pin_count: u32,
    last_lsn: u64,
    referenced: bool,
}

A few of these fields earned their place through actual design discussion rather than being obvious from the start:

  • dirty — if false, the frame's contents already match disk. Eviction is free; just overwrite it.

  • referenced — used by the eviction algorithm (more on this below).

  • pin_count — exists for correctness under concurrent access, but isn't actively used yet. HozonDB is currently single-threaded and synchronous, so a page can't be evicted mid-operation — there's only one thing running at a time. The field is there because the moment you add concurrent transactions or async execution, you need it, and retrofitting it later would mean auditing every code path that touches a frame.

The constructor deliberately doesn't take initial data:

impl Frame {
    pub fn new() -> Self {
        Frame {
            page_id: None,
            data: [0u8; PAGE_SIZE],
            dirty: false,
            pin_count: 0,
            last_lsn: 0,
            referenced: false,
        }
    }
}

There's no scenario where you construct a frame with meaningful data — frames start empty and get load()ed later. Mutation happens through controlled methods (load, mark_dirty, mark_clean, pin/unpin, set_referenced/clear_referenced), not direct field access. The reason is invariant protection: if dirty could be set to false by anyone after a write, you could silently lose the write-ahead guarantee the WAL depends on.

Eviction: LRU, the hard way, and the pragmatic way

When every frame is occupied and a new page needs to be loaded, something has to go. The standard answer is LRU — evict whatever was least recently used. There are 3 ways to implement this, and they have meaningfully different costs:

A VecDeque of frame indices, reordered on every access — simple, but removing an element from the middle is O(n).

A doubly linked list + hashmap — The textbook O(1) LRU uses a doubly linked list + hashmap, but that comes with real ownership complexity in Rust. Instead of going down that path, I went with the clock algorithm.

The clock algorithm — instead of tracking exact recency, each frame gets a single referenced bit. A "clock hand" sweeps through frames looking for a victim:

On each step: if referenced == true, clear it and move on (the page gets a "second chance"). If referenced == false, evict it. The hand stops where it found a victim, ready for next time.

fn evict(&mut self) -> io::Result<usize> {
    let capacity = self.frames.len();

    for i in 0..capacity * 2 {
        let idx = (self.clock_hand + i) % capacity;

        if self.frames[idx].pin_count() > 0 {
            continue;
        }

        if self.frames[idx].referenced() {
            self.frames[idx].clear_referenced();
            continue;
        }

        // victim found
        if self.frames[idx].dirty() {
            let page_id = self.frames[idx].page_id().ok_or_else(|| {
                Error::new(ErrorKind::InvalidData, "Non empty frame has no page ID")
            })?;
            self.page_manager
                .write_page(page_id, self.frames[idx].data())?;
        }

        if let Some(page_id) = self.frames[idx].page_id() {
            self.page_table.remove(&page_id);
        }

        self.clock_hand = (idx + 1) % capacity;
        return Ok(idx);
    }

    Err(Error::new(ErrorKind::Other, "No evictable frames found"))
}

The capacity * 2 bound covers the worst case: one full pass clearing every referenced bit, then a second pass finding the first frame (now unreferenced) to evict. After a worst-case sweep, every frame is unreferenced — so the next eviction is fast. The algorithm self-corrects.

The (self.clock_hand + i) % capacity arithmetic is doing the work that a circular linked list would do, without the ownership headaches. cycle() on a mutable iterator doesn't work here — Rust won't let you re-borrow an iterator like that — so plain modular arithmetic over indices is both simpler and avoids the issue entirely.

The borrow checker shapes the architecture

This is the part that surprised me most. I went into the buffer pool design assuming the hard part would be the eviction algorithm. The actual hard part was where the WAL writer lives.

The buffer pool needs to log to the WAL before marking a frame dirty — that's the entire point of write-ahead logging. My first instinct was to give BufferPool ownership of both PageManager and WalWriter:

struct BufferPool {
    page_manager: PageManager,
    wal_writer: WalWriter,  // seemed natural at first
    frames: Vec<Frame>,
    page_table: HashMap<PageId, usize>,
}

This looks harmless until you trace through the actual write flow. The executor needs to:

  1. Get mutable access to a page's bytes (get_page_mut)

  2. Mutate them

  3. Log the change to WAL

  4. Mark the frame dirty with the resulting LSN

impl BufferPool {
    pub fn get_page_mut(&mut self, page_id: PageId) -> io::Result<&mut [u8; PAGE_SIZE]> {
        // ...
    }
}

get_page_mut takes &mut self — all of self, including the nested wal_writer field. The returned &mut [u8; PAGE_SIZE] borrows self for as long as it's alive. With WalWriter living inside BufferPool, step 1 and step 3 become two operations on the same object while one of them is still mid-flight:

let page = buffer_pool.get_page_mut(page_id)?; // borrows ALL of buffer_pool, including wal_writer
// page is still alive here...
buffer_pool.wal_writer.append(...)?; // ERROR — buffer_pool already mutably borrowed

This isn't a workaround-able quirk — it's the borrow checker correctly identifying that get_page_mut's signature gives it the right to touch wal_writer too, even though it never does. The method's type doesn't know it only cares about frames; from the caller's perspective, the whole struct is off-limits until page goes out of scope.

The fix is a well-known Rust pattern: pull the two things you need simultaneously out to sibling fields of a shared owner, so they're independently borrowable.

pub struct Database {
    buffer_pool: BufferPool,  // owns PageManager
    wal_writer: WalWriter,
    table_catalog: TableCatalog,
    index_catalog: IndexCatalog,
    indexes: HashMap<String, BPlusTree>,
}

Now db.buffer_pool and db.wal_writer are genuinely disjoint fields, and the borrow checker tracks them independently — a &mut borrow of one says nothing about the other:

let page = db.buffer_pool.get_page_mut(page_id)?; // borrows db.buffer_pool only
// ...
db.wal_writer.append(...)?;                       // borrows db.wal_writer — unrelated, fine
db.buffer_pool.mark_dirty(page_id, lsn)?;

The same disjointness is why functions can take both as separate arguments without any ceremony — this shows up constantly once WalWriter and BufferPool are siblings:

BPlusTree::new(
    column_type.order(),
    &mut self.buffer_pool,
    &mut self.wal_writer,
)?;

Two &mut borrows of two different fields of self, passed along to a function that needs both. Nothing special — but this only works because the fields are disjoint at the struct level. If wal_writer were still nested inside buffer_pool, &mut self.buffer_pool would transitively include it, and &mut self.wal_writer would be borrowing through an already-borrowed path.

The general lesson: the borrow checker doesn't see "logically separate concerns" — it sees struct field boundaries. Two things can be conceptually independent and still be borrow-checker-inseparable if one is nested inside the other. BufferPool shouldn't have known about WAL policy anyway — it's a cache, and WAL ordering is a higher-level concern. Keeping WalWriter out of it was the right call on pure design grounds. Rust just made sure that the moment I tried to blur that line — by reaching for the "obvious" nested struct — it would refuse to compile, for reasons that turned out to be about architecture, not syntax.

The read/write interface

With the ownership question settled, the buffer pool's interface ended up fairly small:

pub fn read_page(&mut self, page_id: PageId) -> io::Result<&[u8; PAGE_SIZE]>
pub fn get_page_mut(&mut self, page_id: PageId) -> io::Result<&mut [u8; PAGE_SIZE]>
pub fn mark_dirty(&mut self, page_id: PageId, lsn: u64) -> io::Result<()>
pub fn write_raw_page(&mut self, page_id: PageId, new_data: &[u8; PAGE_SIZE], lsn: u64) -> io::Result<()>
pub fn flush_dirty(&mut self) -> io::Result<()>
pub fn allocate_page(&mut self, wal_writer: &mut WalWriter, page_type: PageType) -> io::Result<PageId>
pub fn free_page(&mut self, page_id: PageId, wal_writer: &mut WalWriter) -> io::Result<()>

The split between get_page_mut + mark_dirty (for row-level slotted page edits) and write_raw_page (for catalog and B+ tree node pages, which have no slot directory) reflects a distinction that runs through the whole storage layer: slotted pages and raw pages have different mutation semantics, and forcing them through one generic method would have meant the buffer pool re-deriving information the caller already had.

The flow for a typical row mutation looks like this:

No copying of page data, no redundant reads. The executor — which already knows the slot, the old bytes, and the new bytes from its own logic — does the WAL bookkeeping, and the buffer pool just tracks that the frame is now dirty as of a given LSN.

flush_dirty: the other half of the contract

flush_dirty writes every dirty frame to disk and marks it clean:

pub(crate) fn flush_dirty(&mut self) -> io::Result<()> {
    for frame in self.frames.iter_mut() {
        if frame.is_empty() || !frame.dirty() {
            continue;
        }

        let page_id = frame.page_id().ok_or_else(|| {
            Error::new(ErrorKind::InvalidData, "A non empty frame should have a page ID")
        })?;

        self.page_manager.write_page(page_id, frame.data())?;
        frame.mark_clean();
    }

    Ok(())
}

It's deliberately pub(crate), not pub. The only correct way to flush is through Database::checkpoint(), which coordinates flush_dirty with the WAL checkpoint record — flushing pages without that coordination would break the recovery guarantees. pub(crate) lets internal tests call it directly while keeping it out of the public API that the executor and external crates see. It's a small thing, but it's the difference between "this invariant is enforced by convention" and "this invariant is enforced by the type system."

Results: two snapshots, one storage layer apart

The numbers below come from two real benchmark runs at two different points in this project — not a controlled before/after of the buffer pool in isolation, but an honest comparison of how the same 10,000-row workload performed before WAL and the buffer pool existed, versus after, with WAL's sync_all call commented out so writes aren't forced to disk on every append.

Figure 1: Benchmark before WAL and the buffer pool — every page write hits disk immediately.

Figure 2: Benchmark after the buffer pool — writes deferred, WAL sync_all commented out.

Operation Before (no WAL/buffer pool) After (buffer pool, sync deferred) Speedup
INSERT (single row) 7.34ms 0.11ms ~67x
UPDATE (fits slot) 9.89ms 0.10ms ~99x
DELETE (single row) 7.68ms 0.07ms ~110x
UPDATE bulk (1000 rows) 6884.09ms 97.19ms ~71x
DELETE bulk (1000 rows) 3375.21ms 46.16ms ~73x

The most telling number isn't the duration — it's Pages Dirtied, which is 8 in both runs for the bulk operations. The bulk update touches 1000 rows spread across 8 unique pages. That number doesn't change based on the buffer pool's presence — what changes is what happens with it.

Before: 8 unique pages, each written to disk every time any row on them changed — across 1000 row updates, that's a lot of redundant writes to the same 8 pages.

After: those same 8 pages get marked dirty in memory, and the actual disk write happens once per page, deferred until checkpoint — regardless of how many of the 1000 updates touched them.

Same logical work. Same page count. Two orders of magnitude difference, because "dirty" became a deferred promise instead of an immediate action.

What's still missing

This isn't a production buffer pool, and it's worth being explicit about the gaps:

  • Fixed capacity, set at startup. No online resizing. A real system would let you grow or shrink the pool while running, evicting or allocating frames as needed.

  • pin_count exists but isn't enforced. It's structurally present for when concurrent access becomes a thing, but right now nothing actually pins a frame, because nothing needs to — single-threaded, synchronous execution means eviction can only happen at controlled points.

None of these are wrong, exactly — they're the honest boundary of "what you need to make WAL and recovery possible" versus "what you need to run this in production." The first one was the actual goal.

The bigger picture

The buffer pool wasn't really about performance, even though the benchmark numbers are dramatic. It was about creating a single point of control over page mutations — one place where every write passes through, can be logged, can be deferred, and can be made durable on a schedule rather than on every individual operation.

That single point of control is what makes write-ahead logging and crash recovery possible at all. Without it, "log before write" would have to be enforced by convention across every call site that touches a page — the B+ tree, the catalogs and the row executor.

The performance win was almost a side effect. The real deliverable was a chokepoint.


HozonDB is open source and built in public — code, commit history, and the rest of the build-in-public series at Github.

HozonDB

Part 3 of 4

Building a relational database engine from scratch in Rust. Each article covers a core subsystem — storage, indexing, distributed systems — implemented and explained from first principles.

Up next

HozonDB: Write-Ahead Logging and the Cost of Durability

The Problem Before this milestone, HozonDB could lose data silently. A DELETE would mark a row's slot dead in memory, write the dirty page to disk, then remove the corresponding entry from the B+ tree