simse is built with performance as a first-class concern. Most workloads run well with default settings. This guide covers the concepts that matter when the store grows large, latency is critical, or memory is constrained.
The adaptive store supports two index modes for vector search:
Exact search is a brute-force nearest-neighbor search. It scans all vectors in the store and returns the exact top-k results.
Use exact search when:
Characteristics:
Approximate search uses a Hierarchical Navigable Small World (HNSW) graph index to provide nearest-neighbor results with sub-linear query time.
Use approximate search when:
Characteristics:
The index backend is selected automatically based on store size. Small stores use exact search; larger stores switch to HNSW.
Quantization compresses vector embeddings to reduce memory usage. The adaptive store supports two quantization modes:
Scalar quantization maps each floating-point component to a single byte. A 768-dimension embedding goes from 3,072 bytes to 768 bytes (4x reduction).
Characteristics:
Binary quantization maps each floating-point component to a single bit (sign bit). A 768-dimension embedding goes from 3,072 bytes to 96 bytes (32x reduction).
Characteristics:
| Mode | Memory | Recall | Speed |
|---|---|---|---|
| None | 1x | 100% | Baseline |
| Scalar | 0.25x | ~99% | Faster |
| Binary | 0.03x | ~85–95% | Fastest |
For most workloads, scalar quantization is the right default: meaningful memory savings with negligible recall loss. Binary quantization is appropriate when the store has hundreds of thousands of entries and memory pressure is severe.
The maxResults setting in memory.json controls how many entries are retrieved per query (default: 10). To manage overall store growth, periodically remove outdated entries by asking simse to search for and withdraw them using library_withdraw.
The adaptive store loads from disk at startup. Load time scales with the number of entries. For large stores, expect a few seconds of startup time.
The store loads once and stays in memory for the session duration. There is no per-query disk I/O after startup.
The store uses compressed binary format on disk. Compression runs on write and decompression on load. The stored format is not human-readable.
Long tool outputs can slow down the agentic loop. The default output limit is 50,000 characters. Truncated output includes a notice with the number of omitted characters.
The agentic loop detects repeated identical tool calls. After 3 consecutive identical calls (same tool name and same arguments), it injects a warning into the conversation to break the loop. This prevents the agent from getting stuck retrying the same failing operation.