feat: object store, typed object layer, work-tree scanner, and vcs CLI #1

Open
Taco wants to merge 12 commits from changes into main
Owner

Builds the four layers needed to turn the reference store into a
commit-capable VCS, plus one recovery-path perf fix and a scan
progress indicator.

What's in it

Object-segment storage (SPEC.md §3.8-§3.9, §4.4; FORMAT.md §7b)

  • New ObjectSegment payload with a BLAKE3-addressed locator index.
  • SegmentFooter gains object_index_offset and an
    OBJECT_SEGMENT feature flag (FORMAT.md footer tail, still within
    the frozen 256B window).
  • Manifest wire format bumped to v2 so the active set carries both
    ref-segment and object-segment sequences. v1 manifests still
    decode.
  • Store API: put_object, get_object, flush_object_segment.
  • Recovery extended to verify object segments alongside ref
    segments (RECOVERY.md §C1).

Typed object layer (include/vcs/objects.hpp)

  • Deterministic encode/decode for blobs, trees, commits using
    vcs::varint and vcs::crc32c. Tree entries are validated for
    name legality, sort order, and duplicate rejection.

Staging and work-tree composition (include/vcs/workdir.hpp)

  • Persistent Index (sorted, unique, CRC32C-framed) with D1-style
    atomic save (create-tmp + fsync + rename + fsync_dir per
    RECOVERY.md §D1).
  • Recursive scanner that walks an IFileSystem, hashes every
    regular file through Store::put_object, and upserts the Index.
  • Helpers build_tree_from_index, put_commit,
    stage_commit_from_index compose a tree graph from the Index and
    stage a ref CAS through the writer protocol (SPEC.md §3.2, §3.4).
  • New IFileSystem::is_dir primitive; MemoryFs::list_dir now
    returns immediate children including subdir names (POSIX
    readdir semantics).

CLI (cli/main.cpp, new build/bin/vcs)

  • vcs init, vcs commit -m <message>, vcs log.
  • Uses workdir::scan_into_index + stage_commit_from_index +
    the writer's flush/publish sequence.
  • Throttled stderr progress line during scan (TTY: single rewritten
    line; non-TTY: one line per tick). Suppressed for sub-250ms
    scans.

Recovery perf fix (RECOVERY.md §C1)

  • verify_segment no longer slurps and rehashes the whole segment
    body; it reads the fixed 256B footer via VFile::read_at and
    validates the header+footer pair. Body integrity remains the
    reader's job via per-block CRCs (FORMAT.md §2).
  • Startup cost for a 141 MB store: ~1.5s → ~0.02s.

What's deliberately not in it

  • vcs add, vcs status, vcs diff, vcs cat-file, vcs ls-tree,
    vcs show, branches, checkout, merge. All additive on top of
    this base.
  • Executable-bit detection in the scanner; every staged blob gets
    TreeMode::Regular.
  • .vcsignore; the CLI's hardcoded ignore set is {".vcs"}.
  • Stat cache in the Index: commit currently rescans and rehashes
    every path.
  • Author/committer are hardcoded to nobody <nobody@example>; no
    config file yet.
  • Format changes are strictly additive (new footer field behind a
    feature flag, manifest v2 parallel to v1). No bytes in any
    existing layout have moved.

Invariants relied on

  • SPEC.md §3.2 (CAS semantics), §3.4 (publish atomicity),
    §3.8-§3.9 (object put/get), §4.4 (object segment recovery).
  • RECOVERY.md §C1 (footer-only recovery scan), §D1 (atomic
    rename + fsync_dir), §D2 (manifest durability).
  • FORMAT.md §2 (block CRCs), §7b (object segment layout),
    frozen 256B footer.
  • COMPACTION.md is unchanged; object segments are not yet
    compacted.

Testing

  • 291 unit tests pass (./build/bin/vcs_tests).
  • Test deltas: new test_objects.cpp (239 lines), new workdir
    coverage in test_store.cpp (+475 lines), manifest/publish
    tests updated for the v2 manifest and footer-only recovery.
  • Two-commit end-to-end integration test exercises scanner +
    tree build + commit + publish + reopen + log walk.
  • Smoke: 100 MiB random blob on POSIX — commit ~1.4s end-to-end,
    scan ~560ms; 500 × 200 KiB files — ~1.5s end-to-end, scan ~620ms.
Builds the four layers needed to turn the reference store into a commit-capable VCS, plus one recovery-path perf fix and a scan progress indicator. ## What's in it ### Object-segment storage (SPEC.md §3.8-§3.9, §4.4; FORMAT.md §7b) - New `ObjectSegment` payload with a BLAKE3-addressed locator index. - `SegmentFooter` gains `object_index_offset` and an `OBJECT_SEGMENT` feature flag (FORMAT.md footer tail, still within the frozen 256B window). - Manifest wire format bumped to v2 so the active set carries both ref-segment and object-segment sequences. v1 manifests still decode. - Store API: `put_object`, `get_object`, `flush_object_segment`. - Recovery extended to verify object segments alongside ref segments (RECOVERY.md §C1). ### Typed object layer (`include/vcs/objects.hpp`) - Deterministic encode/decode for blobs, trees, commits using `vcs::varint` and `vcs::crc32c`. Tree entries are validated for name legality, sort order, and duplicate rejection. ### Staging and work-tree composition (`include/vcs/workdir.hpp`) - Persistent `Index` (sorted, unique, CRC32C-framed) with D1-style atomic save (create-tmp + fsync + rename + fsync_dir per RECOVERY.md §D1). - Recursive scanner that walks an `IFileSystem`, hashes every regular file through `Store::put_object`, and upserts the Index. - Helpers `build_tree_from_index`, `put_commit`, `stage_commit_from_index` compose a tree graph from the Index and stage a ref CAS through the writer protocol (SPEC.md §3.2, §3.4). - New `IFileSystem::is_dir` primitive; `MemoryFs::list_dir` now returns immediate children including subdir names (POSIX `readdir` semantics). ### CLI (`cli/main.cpp`, new `build/bin/vcs`) - `vcs init`, `vcs commit -m <message>`, `vcs log`. - Uses `workdir::scan_into_index` + `stage_commit_from_index` + the writer's flush/publish sequence. - Throttled stderr progress line during scan (TTY: single rewritten line; non-TTY: one line per tick). Suppressed for sub-250ms scans. ### Recovery perf fix (RECOVERY.md §C1) - `verify_segment` no longer slurps and rehashes the whole segment body; it reads the fixed 256B footer via `VFile::read_at` and validates the header+footer pair. Body integrity remains the reader's job via per-block CRCs (FORMAT.md §2). - Startup cost for a 141 MB store: ~1.5s → ~0.02s. ## What's deliberately not in it - `vcs add`, `vcs status`, `vcs diff`, `vcs cat-file`, `vcs ls-tree`, `vcs show`, branches, checkout, merge. All additive on top of this base. - Executable-bit detection in the scanner; every staged blob gets `TreeMode::Regular`. - `.vcsignore`; the CLI's hardcoded ignore set is `{".vcs"}`. - Stat cache in the Index: commit currently rescans and rehashes every path. - Author/committer are hardcoded to `nobody <nobody@example>`; no config file yet. - Format changes are strictly additive (new footer field behind a feature flag, manifest v2 parallel to v1). No bytes in any existing layout have moved. ## Invariants relied on - SPEC.md §3.2 (CAS semantics), §3.4 (publish atomicity), §3.8-§3.9 (object put/get), §4.4 (object segment recovery). - RECOVERY.md §C1 (footer-only recovery scan), §D1 (atomic rename + fsync_dir), §D2 (manifest durability). - FORMAT.md §2 (block CRCs), §7b (object segment layout), frozen 256B footer. - COMPACTION.md is unchanged; object segments are not yet compacted. ## Testing - 291 unit tests pass (`./build/bin/vcs_tests`). - Test deltas: new `test_objects.cpp` (239 lines), new workdir coverage in `test_store.cpp` (+475 lines), manifest/publish tests updated for the v2 manifest and footer-only recovery. - Two-commit end-to-end integration test exercises scanner + tree build + commit + publish + reopen + log walk. - Smoke: 100 MiB random blob on POSIX — commit ~1.4s end-to-end, scan ~560ms; 500 × 200 KiB files — ~1.5s end-to-end, scan ~620ms.
Taco added 12 commits 2026-04-24 18:21:12 +00:00
SPEC.md §3.4 requires a single manifest file to publish both the ref-
segment stack and the object-segment stack in one rename. The v1
layout only carried one sequence. v2 adds a second 4-byte count at
offset 24 and appends the object-segment hashes after the ref hashes:

  28              N_r*32  ref segment hashes
  28+N_r*32       N_o*32  object segment hashes
  28+(N_r+N_o)*32 4       CRC32C

Total bytes = 32 + 32*(N_r + N_o). Header grows 24 -> 28 to include
the obj_segment_count field. v1 manifests are rejected with
UnknownVersion per RECOVERY.md §C3-analogue.

Manifest gains object_segments; encoded_size takes (n_ref, n_obj).
test_manifest.cpp updates the byte-level pins, adds a round-trip with
object segments, and asserts v1 is rejected. test_publish.cpp fills
in the empty third aggregate member.
recover() now also walks manifest.object_segments, enforcing C1 per
sequence: the on-disk footer's content_hash must match and its
OBJECT_SEGMENT flag (FORMAT.md §8 bit 5, §10) must agree with the
sequence the hash was drawn from. A manifest that names a ref-segment
hash in its object sequence (or vice versa) fails recovery with
InvalidRecord, matching the existing hash-mismatch behavior for
ref segments.
Store now carries an ObjectStore, adds objBuf/pendingObjSegs to the
Writer (SPEC.md §3.1), and pins the object manifest alongside the ref
manifest in Reader (§4.1).

- put_object (§3.8): hashes the bytes, routes sizes above
  opts.loose_object_threshold to ObjectStore::put_loose, otherwise
  buffers in objBuf after confirming the oid is not already
  resolvable via pendingObjSegs, the active object manifest, or a
  loose file. Idempotent per the spec.
- flush_object_segment (§3.9): assembles buffered objects into an
  object segment via ObjectSegmentWriter, publishes the segment
  bytes durably (RECOVERY.md §D1), installs it in SegmentCache
  (empty data_index for OBJECT_SEGMENT), and appends the segId to
  pendingObjSegs. wState is unchanged, per spec.
- publish (§3.4): extends manifest.object_segments with
  pendingObjSegs[w] as part of the same single-rename commit that
  already extended manifest.segments.
- get_object (§2.2, §4.4): scans the pinned (or active) object
  stack newest-first via read_object_from_segment, then falls back
  to ObjectStore::get_loose. NotFound propagates.
- begin_read / end_read pin both stacks through SegmentCache; the
  pin map is keyed by segId and already supports mixed kinds.
- reclaim_pending's active-set check includes object_segments so a
  future object compactor cannot reclaim a still-referenced object
  segment.
- SegmentCache::get skips collect_data_block_index when the footer
  has OBJECT_SEGMENT set (FORMAT.md §10 — no ref index).
Seven new cases in test_store.cpp. No new test files.

- put_object_packs_into_object_segment_and_publishes: round-trip
  through objBuf, FlushObjectSegment, PublishManifest, get_object.
  Confirms manifest.object_segments grows to length 1 and the
  ref manifest is committed by the same publish.
- put_object_above_threshold_goes_loose: opts.loose_object_threshold
  forces the §3.8 loose branch. Loose files are readable without
  publish (§2.2 loose fallback) and do not populate objBuf, so an
  abort leaves object_manifest_length == 0.
- put_object_is_idempotent_within_txn: repeated put of the same
  bytes collapses to one entry (SPEC.md §3.8); a second
  FlushObjectSegment with nothing staged is rejected.
- put_object_is_idempotent_across_segments: a later txn that puts
  bytes already in objManifest does not buffer them again.
- reader_pinned_object_manifest_is_stable: §4.1 pin by value. A
  newer publish does not appear in the older reader's om.
- object_segment_survives_reopen: the manifest/object-segment pair
  is durable; recover() rehydrates both.
- abort_leaves_flushed_object_segment_orphan_not_in_manifest: §3.5
  AbortTxn drops pendingObjSegs without extending the manifest;
  the segment is an on-disk orphan and not reachable via lookup.
New files only; no existing behavior touched. SPEC.md §1 places
object typing outside the content-addressed store, so this layer
sits on top of put_object/get_object (SPEC.md §3.8, §4.4) and
hashes its own framed bytes with BLAKE3 unchanged.

Framing (include/vcs/objects.hpp):
  [magic:4]['TREE'|'COMT'][version:1=1][payload][crc32c:4]
  tree payload  : varint(count) + sorted entries of
                  (varint mode, varint name_len, name, 32B oid)
  commit payload: 32B tree_oid + varint(parent_count) + parent_oids
                  + author sig + committer sig + varint(msg_len) + msg
  signature     : varint name, varint email, i64 LE ts_ns, i16 LE tz_min

Encoder enforces: entries sorted and unique by name, non-empty names
free of NUL/'/'/'.'/'..', modes in {0100644, 0100755, 0040000},
tz_offset_min in [-1439, 1439]. Decoder checks magic, version, CRC,
varint framing, sort order, name and mode validity, tz range, and
rejects trailing bytes.

test/test_objects.cpp: 11 cases covering tree sort/round-trip,
order invariance (permutations hash identically), empty tree,
invalid names/modes/duplicates, bad framing (truncated, bad magic,
bad version, bad CRC), root/merge commits, tz bounds, and
format-drift hash smoke check. 282/282 tests pass.
Phase 3 scaffold. The working-tree scanner (include/vcs/workdir.hpp,
coming next) needs directory discrimination to recurse. list_dir
alone is ambiguous for MemoryFs's flat namespace and doesn't expose
type info on PosixFs either.

MemoryFs: returns true iff any file key has path+'/' as a prefix
(matching how list_dir treats implicit directories in its flat store).
PosixFs: stat + S_ISDIR. Non-existent paths return false on both.

All three IFileSystem implementations updated. 282/282 tests pass.
Phase 3 staging area. Three concepts sit above SPEC.md's byte-store
interface:

* Index: persistent (path, mode, oid) list at <store_dir>/index,
  framed by 'VCSINDX\0' + version + varint entries + CRC32C. Saved
  via the D1-style tmp+fsync+rename+fsync_dir pattern used by
  publish.cpp; load() treats a missing file as an empty index and
  surfaces decode failures as errors (callers can clear+rebuild).
* workdir::scan_into_index: recursive IFileSystem walk that hashes
  every regular file through Store::put_object (SPEC.md \u00a73.8) and
  upserts an IndexEntry{TreeMode::Regular, ...}. Names validated via
  objects::is_valid_name per segment; optional ignore set for the
  store directory when it lives inside the work tree.
* workdir::build_tree_from_index / put_commit / stage_commit_from_index:
  compose Tree objects by grouping entries on each '/' boundary,
  put_object each synthesized subtree, encode a Commit, and stage
  the ref CAS via Store::stage_write (SPEC.md \u00a73.2). Path collisions
  (file and subtree at the same name) fall out as encode_tree's
  duplicate-name rejection.

MemoryFs::list_dir now returns immediate children including
implicit subdirectory names, matching POSIX readdir. The previous
'flat namespace' filter was adequate for flat manifest directories
but blocked the recursive scanner. Updated the matching test case
in test/test_memory_fs.cpp.
Extends test/test_store.cpp with coverage for the staging layer:

* index save/load round-trip; save overwrites; missing file is empty
* add() rejects TreeMode::Tree and invalid paths (empty, leading
  slash, '..' segment); remove() is idempotent
* scan_into_index walks a nested work tree, skips the store dir via
  the ignore set, and populates sorted index entries
* stage_commit_from_index produces a commit whose decoded tree
  references the blobs we staged, and whose ref is updated via
  stage_write + publish
* build_tree_from_index rejects an empty index (InvalidArgument)
  and a path collision (file at 'a' and subtree at 'a/x' surfaces
  as encode_tree duplicate-name rejection)

290/290 tests pass.
Exercises the full Phase 3 pipeline against a MemoryFs-backed store:

  1. Scan initial work tree -> Index -> stage_commit_from_index ->
     flush_object_segment + flush_segment + publish. Save Index.
  2. Mutate the work tree (edit README, add src/util/u.hpp), load
     the Index back, re-scan, stage a second commit with parent=c1
     under a CAS pre-image of Direct{c1} (SPEC.md \u00a73.4), publish.

After the second publish:

  * store.version() == 2 and HEAD resolves to c2.
  * decode_commit(c2).parents == [c1] and message == 'v2'.
  * Walking c2's tree reaches the post-mutation blobs via
    Store::get_object, and the README blob's bytes match the v2
    contents, confirming the content-addressed round trip through
    put_object/get_object (SPEC.md \u00a73.8, \u00a74.4).

291/291 tests pass.
A thin C++ driver over the Store and workdir layers. Builds as
build/bin/vcs via a new 'cli' Makefile target.

- 'vcs init [dir]' creates .vcs/ and opens the Store once to
  initialize the empty manifest (SPEC.md §3.2).
- 'vcs commit -m <msg>' scans the cwd, puts blobs and a tree via
  put_object (SPEC.md §3.8), encodes a commit with HEAD's current
  direct oid as the sole parent, stages a HEAD update, then
  flush_object_segment / flush_segment / publish (SPEC.md §3.4,
  §3.9). The Index is saved after publish as a cached view of the
  last committed tree.
- 'vcs log' walks commit parents from HEAD using get_object.

Scan + publish is collapsed into 'commit' because SPEC.md §3.4
requires every publish to bundle a ref segment; a separate 'add'
that only produces object segments cannot independently commit a
manifest.
RECOVERY.md §C1 specifies startup verification as O(#ref segments
+ #object segments) footer reads, explicitly 'not full file reads'.
Body corruption is caught at access time by per-block CRCs
(FORMAT.md §2).

verify_segment was slurping the whole file and rehashing the body
via verify_content_hash, which for a 98 MiB object segment costs
~1 s on every Store::open and is paid by every CLI invocation.

Read only the last kFooterSize bytes and decode. Compare
footer.content_hash to the manifest's expected value; check the
OBJECT_SEGMENT flag matches the sequence that names the segment
(FORMAT.md §10).

test/test_publish.cpp:rejects_hash_mismatch still passes: the
footer records the file's actual content_hash, which differs from
what the manifest names, so verify_segment still returns
InvalidRecord. No behavioral change for any of the six recover
cases; only the cost changes.
Large scans were silent for seconds at a time with no indication of
progress. scan_into_index now accepts an optional ScanProgress
callback invoked once per file (path, cumulative files, cumulative
bytes) right after put_object. The default (empty std::function) is
a no-op, so existing call sites and tests are unaffected.

cmd_commit installs a throttled printer on stderr: suppressed for
the first 250ms, then at most 10Hz. TTY output uses a single
rewritten line; non-TTY emits one line per tick. A summary line
closes the scan when progress was shown.

Scope: library API is additive; no SPEC.md, FORMAT.md, RECOVERY.md,
or COMPACTION.md semantics change. 291/291 tests still pass.
This pull request doesn't have enough approvals yet. 0 of 1 approvals granted.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin changes:changes
git switch changes
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Hzel/Quire!1
No description provided.