Performance tuning

simse is built with performance as a first-class concern. Most workloads run well with default settings. This guide covers the concepts that matter when the store grows large, latency is critical, or memory is constrained.

Index backend selection

The adaptive store supports two index modes for vector search:

Exact search

Exact search is a brute-force nearest-neighbor search. It scans all vectors in the store and returns the exact top-k results.

Use exact search when:

The store has fewer than approximately 1,000 entries
You need guaranteed exact results
Memory is plentiful and query latency is not critical

Characteristics:

Query time scales linearly with store size: O(n)
Zero build time — no index construction needed
Always returns the exact nearest neighbors
Memory efficient — no additional index structure beyond the vectors themselves

Approximate search (HNSW)

Approximate search uses a Hierarchical Navigable Small World (HNSW) graph index to provide nearest-neighbor results with sub-linear query time.

Use approximate search when:

The store has more than approximately 1,000 entries
Query latency matters more than exact results
You can accept a small recall tradeoff (typically 95–99% recall at default settings)

Characteristics:

Query time scales as O(log n)
Build time increases with store size — index construction runs on writes
Recall is configurable through construction and search parameters

The index backend is selected automatically based on store size. Small stores use exact search; larger stores switch to HNSW.

Quantization

Quantization compresses vector embeddings to reduce memory usage. The adaptive store supports two quantization modes:

Scalar quantization — 4x compression

Scalar quantization maps each floating-point component to a single byte. A 768-dimension embedding goes from 3,072 bytes to 768 bytes (4x reduction).

Characteristics:

4x memory reduction
Minimal recall impact: typically less than 1% degradation
Faster distance computations due to smaller working set
SIMD-accelerated on supported hardware

Binary quantization — 32x compression

Binary quantization maps each floating-point component to a single bit (sign bit). A 768-dimension embedding goes from 3,072 bytes to 96 bytes (32x reduction).

Characteristics:

32x memory reduction
Moderate recall impact: typically 5–15% degradation depending on the embedding model
Fastest distance computation
Best for very large stores where memory is the primary constraint

Quantization tradeoffs

Mode	Memory	Recall	Speed
None	1x	100%	Baseline
Scalar	0.25x	~99%	Faster
Binary	0.03x	~85–95%	Fastest

For most workloads, scalar quantization is the right default: meaningful memory savings with negligible recall loss. Binary quantization is appropriate when the store has hundreds of thousands of entries and memory pressure is severe.

Memory management

Controlling store size

The maxResults setting in memory.json controls how many entries are retrieved per query (default: 10). To manage overall store growth, periodically remove outdated entries by asking simse to search for and withdraw them using library_withdraw.

Store load time

The adaptive store loads from disk at startup. Load time scales with the number of entries. For large stores, expect a few seconds of startup time.

The store loads once and stays in memory for the session duration. There is no per-query disk I/O after startup.

Persistence format

The store uses compressed binary format on disk. Compression runs on write and decompression on load. The stored format is not human-readable.

Tool output truncation

Long tool outputs can slow down the agentic loop. The default output limit is 50,000 characters. Truncated output includes a notice with the number of omitted characters.

Doom loop prevention

The agentic loop detects repeated identical tool calls. After 3 consecutive identical calls (same tool name and same arguments), it injects a warning into the conversation to break the loop. This prevents the agent from getting stuck retrying the same failing operation.