Xiao Cui    About    Archive

Building software, raising humans

Building SGREP - Part Two

After shipping the first version of sgrep with ColBERT late interaction, I thought the hard work was done. The search accuracy was good - MRR of 0.70 on my test queries, significantly better than plain semantic search. But when I started using it on larger codebases, problems emerged.

The Storage Problem

ColBERT stores segment embeddings for each chunk - that’s what enables the token-level matching that gives it an edge over document-level embeddings. But float32 storage adds up fast: 768 dimensions times 4 bytes equals 3072 bytes per segment. For the dspy-go codebase with ~27,000 segments, that meant 105MB just for ColBERT data.

This wasn’t sustainable. On a monorepo with 100k files, I’d be looking at gigabytes of index storage just for one project.

Quantization to the Rescue

The fix was Int8 quantization. The idea is simple: instead of storing full float32 values, compress them to 8-bit integers using min-max scaling. Each value gets mapped from its original range to [-128, 127]:

quantized = round((value - min) / scale) - 128

We store two extra numbers per segment - the scale and min - so we can convert back to floats at query time. The storage math works out to:

  • Before: 3072 bytes per segment (768 * 4)
  • After: 770 bytes per segment (768 * 1 + 2 floats for scale/min)

That’s a 4x reduction, bringing 105MB down to 27MB.

The Surprise: Speed

Here’s what I didn’t expect: search got faster too. A lot faster.

Metric Before After
Storage 105 MB 27 MB
ColBERT scoring 136ms 14ms
Total search 167ms 50ms

ColBERT scoring went from 136ms to 14ms - that’s 10x faster. Total search time dropped from 167ms to 50ms.

Why? Dequantization should add overhead, not remove it. We’re doing extra math (multiplying by scale, adding min) for every single value.

The answer is cache efficiency. Modern CPUs spend most of their time waiting for data from memory. Int8 data is 4x smaller, which means 4x more data fits in the CPU cache. The dequantization math is basically free compared to the memory bandwidth savings.

This was a good reminder: when optimizing, measure first. My intuition about the dequantization cost was wrong.

Loop Unrolling

One more optimization: 8-way loop unrolling for the dot product. Instead of one accumulator:

var sum float64
for i := range query {
    sum += query[i] * doc[i]
}

Use eight independent accumulators:

var s0, s1, s2, s3, s4, s5, s6, s7 float64
for i := 0; i <= n-8; i += 8 {
    s0 += query[i] * doc[i]
    s1 += query[i+1] * doc[i+1]
    // ...
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7

This breaks data dependencies between loop iterations, letting the CPU pipeline more instructions in parallel. For 768-dimension vectors, that’s 96 unrolled iterations instead of 768 dependent ones.

Did It Actually Help?

Numbers are easy to fudge. I wanted real validation, so I built two benchmark suites:

Comparative benchmark: Tested sgrep against mgrep (Mixedbread’s cloud semantic search) and osgrep on 20 queries with hand-labeled ground truth.

Tool MRR Tokens Cost
sgrep (hybrid+colbert) 0.698 2757 $0.17
mgrep (cloud) 0.262 4646 $0.28
osgrep 0.050 332 $0.02

sgrep’s MRR is 2.7x better than mgrep. Cloud tools optimize for general text - code is a different beast.

CoSQA benchmark: Standard academic code search dataset with 500 queries:

Metric Score
NDCG@10 0.622
MRR 0.571
Recall@10 0.782

Recall@10 of 0.782 means the right answer is in the top 10 results 78% of the time. Not state-of-the-art, but good enough for code exploration without needing cloud APIs or fine-tuned models.

Memory-Mapped Storage

SQLite was still the bottleneck. Even with Int8 quantization, every search required parsing vector data from the database. The CPU was waiting on I/O instead of computing.

The solution: memory-mapped files. Instead of reading vectors through SQLite, map the file directly into the process’s address space. The OS handles paging - frequently accessed vectors stay in RAM, rarely used ones get swapped out automatically.

I built two MMap stores:

  • MMapSegmentStore: For ColBERT segment embeddings (int8 quantized)
  • MMapVectorStore: For chunk embeddings (float32)

The file format is simple:

Header (32 bytes): magic, version, dims, count, data offset
Index: chunk IDs with their offsets
Data: contiguous embedding arrays

The key insight is zero-copy access. With SQLite, getting a vector meant: parse SQL → deserialize blob → copy to application memory. With mmap, it’s just pointer arithmetic - the vector is already in addressable memory.

The Results

Storage comparison on real indexes:

Chunks SQLite MMap Ratio
100 0.70 MB 0.37 MB 1.89x
500 2.76 MB 1.86 MB 1.48x
1000 10.47 MB 3.73 MB 2.81x
2000 20.70 MB 7.45 MB 2.78x

At scale (1000+ chunks), MMap gives ~2.8x storage savings. Combined with Int8 quantization, that’s 11x smaller than the original float32 SQLite approach.

But the real win is access latency. Reading a ColBERT segment from SQLite took ~50μs. From mmap, it’s ~200ns - a 250x speedup. For MaxSim scoring across hundreds of segments, this adds up.

Making It Default

With the performance wins clear, I made MMap the default path:

  1. Indexing: Now exports to mmap files automatically after SQLite indexing
  2. Search: Auto-detects mmap files and uses them when available
  3. ColBERT pre-indexing: Enabled by default (--colbert-preindex=true)

The workflow is now: sgrep index . creates both SQLite (for metadata/hybrid search) and mmap files (for fast vector access). Search transparently uses whichever is faster.

Why Not Drop SQLite Entirely?

SQLite still serves important purposes:

  • FTS5 for BM25: Full-text search with proper tokenization
  • Metadata queries: Finding chunks by file path, line numbers
  • Transactions: Safe concurrent access during incremental updates

MMap handles the hot path (vector similarity), SQLite handles everything else. Each tool for what it’s good at.

What’s Next

Three things on the list:

  • Code-specific embedding models instead of general-purpose nomic-embed-text
  • Incremental indexing so we don’t re-index everything on each file change
  • Multi-repo search with a shared index

Code is at github.com/XiaoConstantine/sgrep.

← Previous:


→ Next: