ClawMem

Cloud embedding for ClawMem

By default, ClawMem embeds AI agent memory locally via llama-server or in-process node-llama-cpp fallback (Metal on Apple Silicon, Vulkan where available, CPU as last resort). With GPU acceleration the fallback is fast; CPU-only is significantly slower. As an alternative, you can use a cloud embedding provider instead of running models on your machine.

Supported providers

Provider URL Model Dimensions
Jina AI https://api.jina.ai jina-embeddings-v5-text-small 1024
OpenAI https://api.openai.com text-embedding-3-small 1536
Voyage AI https://api.voyageai.com voyage-4-large 1024
Cohere https://api.cohere.com embed-v4.0 1024

All providers use the OpenAI-compatible /v1/embeddings endpoint with Bearer token auth.

Configuration

Copy .env.example to .env and set your provider credentials:

cp .env.example .env
# Edit .env:
CLAWMEM_EMBED_URL=https://api.jina.ai
CLAWMEM_EMBED_API_KEY=jina_your-key-here
CLAWMEM_EMBED_MODEL=jina-embeddings-v5-text-small

Or export them in your shell before running clawmem.

Precedence: shell environment > .env file > bin/clawmem wrapper defaults. The wrapper sources .env from the project root before applying its defaults, so .env values override defaults but explicit shell exports still win.

Cloud mode features

When CLAWMEM_EMBED_API_KEY is set, cloud mode activates:

Provider-specific parameters

ClawMem auto-detects your provider from the URL and sends the right params:

Provider Document embedding Query embedding Truncation Extra
Jina AI task: "retrieval.passage" task: "retrieval.query" truncate: true  
Voyage AI input_type: "document" input_type: "query" Default (true)  
Cohere input_type: "search_document" input_type: "search_query" truncate: "END"  
OpenAI Symmetric (none) Symmetric (none) None (8192 max) dimensions via CLAWMEM_EMBED_DIMENSIONS

No configuration needed — just set CLAWMEM_EMBED_URL to the provider and the correct params are applied. For OpenAI’s text-embedding-3-* models, optionally set CLAWMEM_EMBED_DIMENSIONS to reduce output dimensions (e.g. 512 or 1024).

Note on OpenAI truncation: OpenAI does not auto-truncate — inputs exceeding 8192 tokens return an error. ClawMem’s default chunk size is ~800 tokens, well under this limit, so this is not an issue in practice.

Rate limiting

Cloud providers enforce rate limits on both requests-per-minute (RPM) and tokens-per-minute (TPM). TPM is typically the binding constraint for embedding.

Set CLAWMEM_EMBED_TPM_LIMIT to match your provider tier:

Tier RPM TPM CLAWMEM_EMBED_TPM_LIMIT
Jina Free 100 100,000 100000 (default)
Jina Paid 500 2,000,000 2000000
Jina Premium 5,000 50,000,000 50000000
# Example: paid tier
export CLAWMEM_EMBED_TPM_LIMIT=2000000

The adaptive pacer computes delay as (batchTokens / (TPM_LIMIT × 0.85)) × 60s, using actual token counts from the API response when available, falling back to character-based estimation. The 0.85 safety factor leaves headroom for retries.

Truncation behavior

To override the local truncation limit:

export CLAWMEM_EMBED_MAX_CHARS=4000  # for models with smaller context

Localhost warning

If you set CLAWMEM_EMBED_API_KEY but your CLAWMEM_EMBED_URL points to localhost or 127.0.0.1, ClawMem prints a one-time warning. This catches accidental configurations where an API key is sent to a local server. If you’re using a local API gateway intentionally, the warning is safe to ignore.

Mixing local and cloud

The LLM (query expansion) and reranker always use local llama-server or in-process node-llama-cpp fallback. Only embedding supports cloud providers. This means:

Note: In-process fallback is silent — if a GPU server crashes, there is no warning. With Metal/Vulkan the fallback is fast; on CPU-only it is significantly slower. Set CLAWMEM_NO_LOCAL_MODELS=true to fail fast instead, or use systemd services to keep servers running.

Model recommendations

Default (QMD native combo, any GPU or in-process): EmbeddingGemma-300M-Q8_0 (314MB, 768d) + qwen3-reranker-0.6B (600MB) + qmd-query-expansion-1.7B (~1.1GB). All three auto-download via node-llama-cpp if no server is running (Metal on Apple Silicon, Vulkan where available, CPU as last resort). Fast with GPU acceleration; significantly slower on CPU-only.

llama-server -m embeddinggemma-300M-Q8_0.gguf \
  --embeddings --port 8088 --host 0.0.0.0 -ngl 99 -c 2048 --batch-size 2048

SOTA upgrade (12GB+ GPU): ZeroEntropy zembed-1 (2560 dimensions, 32K context, SOTA retrieval quality, ~4.4GB VRAM) paired with zerank-2 reranker (distillation-paired via zELO). CC-BY-NC-4.0 — non-commercial only.

llama-server -m zembed-1-Q4_K_M.gguf \
  --embeddings --port 8088 --host 0.0.0.0 -ngl 99 -c 8192 -b 2048 -ub 2048

llama-server -m zerank-2-Q4_K_M.gguf \
  --reranking --port 8090 --host 0.0.0.0 -ngl 99 -c 2048 -b 2048 -ub 2048

For cloud, Jina AI jina-embeddings-v5-text-small is recommended (1024 dimensions, 32K context, task-specific LoRA adapters for retrieval).

Switching embedding models

When changing to a model with different output dimensions (e.g. 768d → 2560d), a full re-embed is required:

clawmem embed --force

This clears all existing vectors and rebuilds with the new model’s dimensions. The vector table is automatically recreated with the correct dimension size on the first embedded fragment.

Important notes: