Building SGREP

06 Dec 2025

Recently, there was a great mrep tool published by mxbread team, helping to address the issue that LLM harnesses such as claude code, codex, amp, when doing search, spent unnecessary time to retrieve useless tokens. Here’s what mgrep claim:

Why mgrep?
Natural-language search that feels as immediate as grep.
Semantic, multilingual & multimodal (audio, video support coming soon!)
Smooth background indexing via mgrep watch, designed to detect and keep up-to-date everything that matters inside any git repository.
Friendly device-login flow and first-class coding agent integrations.
Built for agents and humans alike, and designed to be a helpful tool, not a restrictive harness: quiet output, thoughtful defaults, and escape hatches everywhere.
Reduces the token usage of your agent by 2x while maintaining superior performance

This is quite interesting to me, as myself is learning building out of local coding/review agent (https://github.com/XiaoConstantine/maestro), being in the relevant domain for most of my career, also got amp build crew credits for experience with amp! So why not vibe build a local version myself, not versatile like mrep but as good for my use case - Coding related

Getting started

So what’s the scope?

For my use case, it will be focused on code retrieve for a given repo, for example, when user ask how should i do prompt optimization using maestro about dspy-go repo, maestro should be able to use the tool to find out code, examples.

How should it be implemented?

I want minimal external dependencies as possible, local first, able to provide accurate and reasonable performant (context efficient) results.

For inference server, I decided to use llamacpp

Starting with embeddings

Less dependencies, so I m not using any external embeddings store, sqlite will be the storage powering vector similarity search (as well as keyword search later in hybrid mode).

The next question is embedding model, a general text embedding model is a good start: nomic-embed-text-v1.5.Q8_0.gguf (~130MB, 768 dimensions). It’s small enough to run locally with reasonable latency.

The initial version was straightforward: chunk code files, embed chunks, store in sqlite-vec, and do nearest neighbor search. It worked, but there were issues to solve.

Performance Iteration #1: In-Memory Search

The first major performance win came from moving vector search to memory. The initial sqlite-vec queries were slow - after profiling, I realized loading vectors into memory and doing the search there gave an 88x speedup. The trade-off is memory usage, but for typical codebases (a few thousand chunks), it’s negligible.

I also added a self-managed embedding server. Instead of spawning llama.cpp for each request, sgrep now starts it as a daemon with 16 parallel slots and continuous batching. This amortizes the model loading cost across queries.

AST-Aware Chunking

Naive chunking (splitting by line count) breaks code in awkward places - middle of functions, splitting imports from their usages. I integrated tree-sitter to parse Go, TypeScript, Python, and Rust, then chunk along AST boundaries:

Functions/methods as complete units
Type definitions kept together
Doc comments stay with their declarations

This improved search quality significantly - when you search for “authentication middleware”, you get the complete AuthMiddleware function, not a fragment.

Hybrid Search: Semantic + BM25

Pure semantic search has a weakness: it can miss exact keyword matches. If someone searches “parseAST”, they probably want files containing that exact symbol, not just semantically similar code about parsing.

The solution was hybrid search combining:

Semantic (60%): Vector similarity for intent matching
BM25 (40%): TF-IDF lexical matching via SQLite FTS5

FTS5 was already available in SQLite - no new dependencies. The hybrid mode catches cases where semantic search alone would rank results incorrectly.

Storage Evolution: libSQL + DiskANN

sqlite-vec worked, but had a significant storage overhead: ~780KB per vector. For a codebase with 10,000 chunks, that’s 7.8GB just for vectors.

I switched to libSQL as the default backend, which uses DiskANN for approximate nearest neighbor search. The result: 93-177x more space efficient (~5-10KB per vector). DiskANN also scales better for larger codebases.

Two-Stage Retrieval with Cross-Encoder

Vector search is fast but approximate. Cross-encoder reranking is slower but more accurate - it looks at the query and document together with full attention.

The two-stage pipeline:

Fast retrieval: Get top 50 candidates via hybrid search (~50ms)
Reranking: Score top candidates with cross-encoder (~300-700ms)

I integrated BGE-reranker-v2-m3 via llama.cpp. The cross-encoder takes query-document pairs and outputs relevance scores. RRF (Reciprocal Rank Fusion) combines the initial retrieval rank with reranker scores.

ColBERT Late Interaction

Here’s where it gets interesting. Cross-encoders are trained on general text, not code. On my benchmarks, adding cross-encoder reranking actually hurt code search quality (MRR dropped from 0.70 to 0.60).

ColBERT offers a middle ground: late interaction. Instead of embedding the whole document as one vector, ColBERT:

Decomposes queries into terms/bigrams
Decomposes documents into code-aware segments
Computes MaxSim between query terms and document segments

This gives token-level matching without the full cross-encoder cost. The best configuration for code search turned out to be --hybrid --colbert (MRR 0.70), outperforming the full cascade.

Benchmark Results

I built a quality benchmark using the dspy-go corpus with 20 semantic queries and ground truth labels. Here’s how different modes compare:

Mode	MRR	Latency
Semantic only	0.61	~30ms
Hybrid	0.62	~50ms
Hybrid + ColBERT	0.70	~200ms
Full cascade (all 3)	0.60	~500ms

The takeaway: more stages isn’t always better. For code search, ColBERT’s token-level matching matters more than general-purpose cross-encoder reranking.

Smart Deduplication

One annoying issue: git worktrees. If you have multiple worktrees, the same file appears multiple times in search results (often 50%+ duplicates).

I added canonical path normalization to detect .worktrees/<branch>/... paths and deduplicate them. The result: each logical file appears once, with the best-scoring version kept.

What’s Next

A few ideas I’m exploring:

Better cross-encoder models trained on code (not general text)
Incremental indexing for large monorepos
IDE integrations beyond Claude Code

The full source is at github.com/XiaoConstantine/sgrep. It’s been a fun project combining information retrieval concepts with practical code search needs.