Alibaba's Qwen 3.5 9B is Jarvis's primary inference model. Selected for its strong structured output, tool-use formatting, and classification accuracy (100% on eval). Uses internal chain-of-thought ("thinking tokens") before generating visible output, which improves quality but adds latency.
| Metric | Value | Notes |
|---|---|---|
| Generation speed | 5.9–6.6 tok/sec | At hardware ceiling (~38.4 GB/s bandwidth) |
| Prompt processing | ~0.95–1.5s | For typical 50-200 token prompts |
| Cold load time | ~6.7s | From NVMe to RAM |
| Warm inference | <1s first token | When model is already loaded |
| Eval score | 0.658 (65.8%) | Highest across all models on 23-fixture suite |
| Classification | 100% | Perfect on eval classification tasks |
| Extraction | 78.8% | Strong structured data extraction |
| Reasoning | 20% | Low on eval (format mismatch with thinking tokens) |
/nothink to skip
chain-of-thought tokens. Reduces latency from 30s+ to under 10s for classification. Note: effectiveness
varies — the model may still think internally.
The fastest model in our lineup. Purpose-built for code tasks with 92+ language support and fill-in-the-middle (FIM) capability. No thinking token overhead — responses are immediate and direct. Outperforms CodeStral-22B and DeepSeek Coder 33B V1 on HumanEval (88.4%).
| Metric | Value | Notes |
|---|---|---|
| Generation speed | 8.9–12.4 tok/sec | Fastest model — 2x faster than qwen3.5 |
| Cold load time | ~10s | Smaller model, quick to load |
| Eval score | 0.543 (54.3%) | Lower overall but strong on code + classification |
| Classification | 80% | Good for fast triage when speed matters |
| Extraction | 71.3% | Decent structured output |
| HumanEval | 88.4% | Industry-leading for its size class |
DeepSeek's R1 reasoning model distilled to 8B parameters. Uses chain-of-thought reasoning with explicit <think> blocks. Good balance of reasoning capability and speed. 92.8% on MATH-500 benchmark at full scale; this distill trades some accuracy for running on constrained hardware.
| Metric | Value | Notes |
|---|---|---|
| Generation speed | 6.3–7.7 tok/sec | Faster than qwen3.5 (smaller model) |
| Cold load time | ~5s | Quickest large model to load |
| Eval score | 0.400 (40%) | Lower overall — reasoning format hurts structured eval |
| Classification | 60% | Adequate but not its strength |
| Extraction | 50% | Thinking overhead for simple tasks |
The largest model in our local lineup. DeepSeek R1 distilled to 14B parameters — significantly more capable than the 8B version but pushes the limits of this hardware. Uses explicit chain-of-thought reasoning. Load on demand only; do not keep warm.
| Metric | Value | Notes |
|---|---|---|
| Generation speed | 4.5–4.6 tok/sec | Slowest model — near hardware limit for this size |
| Cold load time | ~15s | Largest model, longest load |
| Eval score | 0.585 (58.5%) | Second highest overall |
| Classification | 100% | Perfect on eval (ties with qwen3.5) |
| Extraction | 67.5% | Good, but slower than qwen3.5 |
keep_alive: 0.
Nomic AI's text embedding model. Converts text into 768-dimensional vectors for semantic search and similarity. Always loaded in memory — used by the Long-Term Memory (LTM) system for real-time embedding generation. Tiny footprint (578 MB loaded) makes it negligible alongside inference models.
| Model | Speed | Quality | RAM | Best For | Avoid For |
|---|---|---|---|---|---|
| qwen3.5:9b | 8.6 GB | General purpose, triage, extraction | Speed-critical, code | ||
| qwen2.5-coder:7b | ~6 GB | Code, fast classification | Reasoning, analysis | ||
| deepseek-r1:8b | ~7 GB | Quick reasoning, math | General tasks (niche) | ||
| deepseek-r1:14b | ~12 GB | Complex reasoning (offline) | Anything simple or timed | ||
| nomic-embed-text | 578 MB | LTM vectors, semantic search | Not an inference model |
| Task | Recommended Model | Why |
|---|---|---|
| Telegram message triage | qwen3.5:9b (or qwen2.5-coder for speed) | Best classification accuracy; coder is 2x faster if latency matters |
| Screener safety check | qwen3.5:9b | Highest structured output quality |
| Code generation / review | qwen2.5-coder:7b | Purpose-built, 88.4% HumanEval, fastest inference |
| JSON extraction from text | qwen3.5:9b | 78.8% extraction accuracy, best structured output |
| Summarization | qwen3.5:9b | Best balance of quality and coherence |
| Math / logic problem | deepseek-r1:14b | Most capable reasoning; 8b as faster fallback |
| Complex multi-step analysis | deepseek-r1:14b (or cloud) | Only when cloud unavailable; cloud models are better |
| Embedding / vector search | nomic-embed-text | Only embedding model; always loaded |
| Emergency (offline, any task) | qwen3.5:9b | Highest overall score, widest capability |
VEP content extraction uses a free-first cascade. Each model is tried in order; paid fallback only fires on double free failure (~5-10% of items).
| # | Model | Tier | Avg Latency | Cost/1M tok | Role |
|---|---|---|---|---|---|
| 1 | Nemotron Nano 30B | FREE | 6.7s | $0 | Primary — best speed/quality ratio |
| 2 | GPT-OSS 120B | FREE | 10.8s | $0 | Secondary — deeper analysis, catches misses |
| 3 | Gemini 2.5 Flash Lite | PAID | ~2s | $0.02 / $0.08 | Paid fallback — fastest overall |
free-model-discovery.ts every Wednesday 3 AM ET.
Tier rankings update automatically based on availability and extraction quality.
13 of 27 free models extract successfully. Tested on 3 corpus items (YouTube transcripts + articles). Quality = key points extracted (max 10).
| Model | Latency | Quality | Output | Reliability | Notes |
|---|---|---|---|---|---|
| Nemotron Nano 30B | 6.7s | 580-675 ch | 3/3 ✓ | Best overall. Recommended primary. | |
| Gemma 3 4B | 6.4s | 753-776 ch | 1/3 ⚠ | Fast but intermittent 429s on deep test | |
| GPT-OSS 120B | 10.8s | 596-683 ch | 3/3 ✓ | Reliable. Was 404 in March, now stable. | |
| GPT-OSS 20B | 20.4s | 521-780 ch | 3/3 ✓ | Highest quality free model. Slow. | |
| Nemotron Nano 9B | 29.5s | 521-605 ch | 3/3 ✓ | Good quality, too slow for primary | |
| MiniMax M2.5 | 35.8s | 586-802 ch | 3/3 ✓ | Good quality but 30-300s latency range | |
| GLM 4.5 Air | 33.8s | 458-624 ch | 3/3 ✓ | Z-AI (Zhipu). Consistent. | |
| Hermes 3 405B | 41.6s | 368 ch | 1/3 ⚠ | Largest free model. Intermittent 429s. | |
| Trinity Large | 52.5s | 353-485 ch | 3/3 ✓ | Arcee AI. Expires 2026-04-22. | |
| Nemotron 12B VL | 56.0s | 373-416 ch | 2/3 ⚠ | Vision model. Needs strict JSON prompt. | |
| OpenRouter /free | 5.6s | 392-492 ch | 3/3 ✓ | Smart router — routes to best available free model | |
| Liquid LFM 1.2B (x2) | 1.0-2.5s | 125-359 ch | 3/3 ✓ | Tiny (1.2B). Fastest but lowest quality. |
| Model | Latency | Quality | Cost/1M tokens | Best For |
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | ~2s | $0.02 / $0.08 | Bulk extraction fallback. Cheapest paid option. | |
| Claude Haiku 4.5 | ~3s | $0.80 / $4.00 | When quality matters. 40x cost of Flash Lite. | |
| Claude Sonnet 4.6 | ~5s | $3.00 / $15.00 | Quality ceiling. 10% sampling for A/B comparison. |
14 models returned 429 (rate limited) or errors. Re-tested weekly — availability fluctuates.