MODEL DOCUMENTATION

Dashboard Experts Jarvis Library LAN Docs Solar Finance Memory System
Ollama Model Reference
Last updated: 2026-03-19 | Ollama 0.18.2 | Beelink SER5 (Ryzen 5 5500U, 28GB RAM, AMD Vulkan iGPU)

System Overview

CPU
Ryzen 5 5500U
6C/12T, Zen 2, 4.06 GHz boost
Memory
28 GiB DDR4
~38.4 GB/s bandwidth (bottleneck)
GPU
Radeon iGPU (Vulkan)
4 GiB shared VRAM, 100% offload for ≤9B
Ollama Config
4 threads, Flash Attn
Vulkan=1, Nice=15, 14GB cap
Performance Ceiling
~6 tok/s (9B Q4)
Memory-bandwidth-bound, not compute-bound
Storage
25.7 GB models
5 models on 476GB NVMe (DRAM-less)
qwen3.5:9b PRIMARY
6.6 GB • 9B params • Q4

Overview

Alibaba's Qwen 3.5 9B is Jarvis's primary inference model. Selected for its strong structured output, tool-use formatting, and classification accuracy (100% on eval). Uses internal chain-of-thought ("thinking tokens") before generating visible output, which improves quality but adds latency.

Classification Triage Summarization Tool Use Extraction Speed-critical tasks

Specifications

ProviderAlibaba (Qwen team) Parameters9 billion QuantizationQ4 (4-bit, ~6.6 GB on disk) Loaded Size8.6 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Context Window4,096 tokens (default), up to 262K Thinking ModelYes — generates internal CoT tokens before response LicenseApache 2.0 Temperature1.0 (default) Top-K / Top-P20 / 0.95 Presence Penalty1.5

Performance (Ollama 0.18.2, 2026-03-19)

MetricValueNotes
Generation speed5.9–6.6 tok/secAt hardware ceiling (~38.4 GB/s bandwidth)
Prompt processing~0.95–1.5sFor typical 50-200 token prompts
Cold load time~6.7sFrom NVMe to RAM
Warm inference<1s first tokenWhen model is already loaded
Eval score0.658 (65.8%)Highest across all models on 23-fixture suite
Classification100%Perfect on eval classification tasks
Extraction78.8%Strong structured data extraction
Reasoning20%Low on eval (format mismatch with thinking tokens)
Pre-upgrade comparison: Was 4.1–5.4 tok/sec on Ollama 0.17.5. The 0.18.0 upgrade delivered a +35% speed improvement for this model specifically.

Best Used For

  • Message triage — Classifying incoming Telegram messages by intent/priority
  • Screener tasks — Quick assessment of message safety and relevance
  • Structured extraction — Pulling entities, dates, categories from text
  • Summarization — Condensing long content into brief summaries
  • Tool use formatting — Generating structured JSON function calls
  • General Q&A — Conversational queries when cloud models aren't needed

When NOT to Use

  • Speed-critical classification — Thinking tokens add 5-30s latency even for trivial tasks. Use qwen2.5-coder:7b instead (2x faster, no thinking overhead)
  • Heavy multi-step reasoning — Use deepseek-r1:14b or cloud models for complex logic chains
  • Code generation — qwen2.5-coder:7b is purpose-built for code and faster

Usage Tips

Disable thinking for simple tasks: Prefix your prompt with /nothink to skip chain-of-thought tokens. Reduces latency from 30s+ to under 10s for classification. Note: effectiveness varies — the model may still think internally.
Keep-alive: Model stays loaded for 5 minutes after last use. First request after cold start takes ~7s extra for loading. Plan batch operations to keep it warm.
qwen2.5-coder:7b CODE
4.7 GB • 7B params • Q4

Overview

The fastest model in our lineup. Purpose-built for code tasks with 92+ language support and fill-in-the-middle (FIM) capability. No thinking token overhead — responses are immediate and direct. Outperforms CodeStral-22B and DeepSeek Coder 33B V1 on HumanEval (88.4%).

Code Generation Fast Classification Speed-Critical Tasks FIM / Autocomplete Complex Reasoning Long-form Text

Specifications

ProviderAlibaba (Qwen team) Parameters7 billion QuantizationQ4 (4-bit, ~4.7 GB on disk) Loaded Size~6 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Context WindowUp to 128K tokens Thinking ModelNo — direct response, no CoT overhead FIM SupportYes (fill-in-the-middle for code completion) Languages92+ programming languages LicenseApache 2.0

Performance (Ollama 0.18.2, 2026-03-19)

MetricValueNotes
Generation speed8.9–12.4 tok/secFastest model — 2x faster than qwen3.5
Cold load time~10sSmaller model, quick to load
Eval score0.543 (54.3%)Lower overall but strong on code + classification
Classification80%Good for fast triage when speed matters
Extraction71.3%Decent structured output
HumanEval88.4%Industry-leading for its size class
Speed advantage: At 8.9–12.4 tok/sec, this model completes tasks in roughly half the time of qwen3.5:9b. For tasks where quality difference is negligible (classification, simple extraction), this is the better choice.

Best Used For

  • Code generation — TypeScript, Python, JavaScript, and 89 more languages
  • Code completion (FIM) — Fill in the middle of existing code
  • Fast classification — When you need an answer in <5 seconds, not 30+
  • Screener backup — Quick triage when qwen3.5 latency is unacceptable
  • JSON formatting — Direct, no-thinking structured output

When NOT to Use

  • Complex reasoning — 0% on reasoning eval; no chain-of-thought capability
  • Long-form content generation — Not trained for essay/article writing
  • Nuanced analysis — Use qwen3.5 or cloud models for judgment calls

Usage Tips

Consider as screener primary: For the Jarvis screener pipeline (message classification), this model may be better than qwen3.5:9b — same accuracy on simple classifications but 2x faster response time.
deepseek-r1:8b REASONING
5.2 GB • 8B params • Q4

Overview

DeepSeek's R1 reasoning model distilled to 8B parameters. Uses chain-of-thought reasoning with explicit <think> blocks. Good balance of reasoning capability and speed. 92.8% on MATH-500 benchmark at full scale; this distill trades some accuracy for running on constrained hardware.

Math Problems Logical Reasoning Step-by-Step Analysis Classification Speed-Critical

Specifications

ProviderDeepSeek Parameters8 billion (distilled from 671B) QuantizationQ4 (4-bit, ~5.2 GB on disk) Loaded Size~7 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Thinking ModelYes — explicit <think> blocks with visible reasoning Temperature0.6 (default — lower for more deterministic reasoning) LicenseMIT

Performance (Ollama 0.18.2, 2026-03-19)

MetricValueNotes
Generation speed6.3–7.7 tok/secFaster than qwen3.5 (smaller model)
Cold load time~5sQuickest large model to load
Eval score0.400 (40%)Lower overall — reasoning format hurts structured eval
Classification60%Adequate but not its strength
Extraction50%Thinking overhead for simple tasks

Best Used For

  • Math and logic problems — Step-by-step reasoning with visible work
  • Debugging assistance — Methodical analysis of error conditions
  • Quick reasoning tasks — When deepseek-r1:14b is too slow

When NOT to Use

  • Classification/triage — Too much thinking overhead for simple categorization
  • Structured extraction — Thinking tokens pollute JSON output
  • Heavy reasoning — Use deepseek-r1:14b for genuinely hard problems (accepts slower speed)
Niche model: This model sits awkwardly between qwen3.5:9b (better at general tasks) and deepseek-r1:14b (better at hard reasoning). Consider whether one of those better fits your use case.
deepseek-r1:14b HEAVY REASONING
9.0 GB • 14B params • Q4

Overview

The largest model in our local lineup. DeepSeek R1 distilled to 14B parameters — significantly more capable than the 8B version but pushes the limits of this hardware. Uses explicit chain-of-thought reasoning. Load on demand only; do not keep warm.

Complex Reasoning Multi-Step Analysis Classification (100%) Latency-Sensitive Tasks Simple Tasks

Specifications

ProviderDeepSeek Parameters14 billion (distilled from 671B) QuantizationQ4 (4-bit, ~9.0 GB on disk) Loaded Size~12 GB in RAM GPU OffloadPartial (exceeds iGPU effective VRAM, some CPU fallback) Thinking ModelYes — explicit <think> blocks LicenseMIT

Performance (Ollama 0.18.2, 2026-03-19)

MetricValueNotes
Generation speed4.5–4.6 tok/secSlowest model — near hardware limit for this size
Cold load time~15sLargest model, longest load
Eval score0.585 (58.5%)Second highest overall
Classification100%Perfect on eval (ties with qwen3.5)
Extraction67.5%Good, but slower than qwen3.5
Resource warning: At ~12 GB loaded, this model consumes nearly all of the 14 GB Ollama memory cap. Running this alongside other memory-intensive services may trigger MemoryHigh reclaim or OOM. Always unload after use via keep_alive: 0.

Best Used For

  • Complex multi-step reasoning — Problems requiring 5+ logical steps
  • Difficult analysis — When qwen3.5 or r1:8b give wrong answers
  • Emergency fallback — Local heavy reasoning when cloud is unavailable

When NOT to Use

  • Anything qwen3.5:9b can handle — The 14B model is 30% slower with marginal quality improvement for most tasks
  • Time-sensitive operations — 4.5 tok/sec + thinking tokens = minutes per response
  • Batch operations — Will monopolize system resources
Decision framework: Only use this model when (1) the task genuinely requires heavy reasoning, (2) cloud models (Claude, OpenRouter) are unavailable, and (3) you can tolerate 2-5 minute response times. In almost all other cases, qwen3.5:9b or a cloud model is the better choice.
nomic-embed-text:latest EMBEDDING
274 MB • 137M params • F16

Overview

Nomic AI's text embedding model. Converts text into 768-dimensional vectors for semantic search and similarity. Always loaded in memory — used by the Long-Term Memory (LTM) system for real-time embedding generation. Tiny footprint (578 MB loaded) makes it negligible alongside inference models.

Semantic Search LTM Embedding Similarity Scoring

Specifications

ProviderNomic AI Parameters137 million QuantizationF16 (full precision — small enough not to need quantization) Loaded Size578 MB in RAM GPU Offload100% GPU Context Window8,192 tokens Embedding Dims768 LicenseApache 2.0 Always LoadedYes — permanent resident

Usage in Jarvis

  • LTM vector generation — Every knowledge entry is embedded for semantic retrieval
  • Lesson surfacing — Session startup searches LTM using embedded queries
  • Similarity dedup — Detects near-duplicate knowledge entries
Do not unload this model. It runs permanently with negligible resource impact. Unloading and reloading adds latency to every LTM operation.

Model Comparison Matrix

Model Speed Quality RAM Best For Avoid For
qwen3.5:9b
6.0 t/s
65.8%
8.6 GB General purpose, triage, extraction Speed-critical, code
qwen2.5-coder:7b
12.4 t/s
54.3%
~6 GB Code, fast classification Reasoning, analysis
deepseek-r1:8b
7.7 t/s
40.0%
~7 GB Quick reasoning, math General tasks (niche)
deepseek-r1:14b
4.5 t/s
58.5%
~12 GB Complex reasoning (offline) Anything simple or timed
nomic-embed-text
N/A
Embedding
578 MB LTM vectors, semantic search Not an inference model

Model Selection Guide

TaskRecommended ModelWhy
Telegram message triageqwen3.5:9b (or qwen2.5-coder for speed)Best classification accuracy; coder is 2x faster if latency matters
Screener safety checkqwen3.5:9bHighest structured output quality
Code generation / reviewqwen2.5-coder:7bPurpose-built, 88.4% HumanEval, fastest inference
JSON extraction from textqwen3.5:9b78.8% extraction accuracy, best structured output
Summarizationqwen3.5:9bBest balance of quality and coherence
Math / logic problemdeepseek-r1:14bMost capable reasoning; 8b as faster fallback
Complex multi-step analysisdeepseek-r1:14b (or cloud)Only when cloud unavailable; cloud models are better
Embedding / vector searchnomic-embed-textOnly embedding model; always loaded
Emergency (offline, any task)qwen3.5:9bHighest overall score, widest capability
Cloud vs Local decision: Local models are for (1) triage/screening that must be fast and cheap, (2) offline/emergency fallback, (3) embedding generation. For any task requiring judgment, analysis, or creativity, use Claude (Opus/Sonnet) or OpenRouter models — they are orders of magnitude more capable.

OpenRouter Cloud Models

VEP extraction benchmark — April 2026 | 27 free + 3 paid models tested

Extraction Cascade (Active)

VEP content extraction uses a free-first cascade. Each model is tried in order; paid fallback only fires on double free failure (~5-10% of items).

#ModelTierAvg LatencyCost/1M tokRole
1 Nemotron Nano 30B FREE 6.7s $0 Primary — best speed/quality ratio
2 GPT-OSS 120B FREE 10.8s $0 Secondary — deeper analysis, catches misses
3 Gemini 2.5 Flash Lite PAID ~2s $0.02 / $0.08 Paid fallback — fastest overall
Weekly auto-discovery: Sentinel runs free-model-discovery.ts every Wednesday 3 AM ET. Tier rankings update automatically based on availability and extraction quality.

Free Tier — Extraction Benchmark (April 2026)

13 of 27 free models extract successfully. Tested on 3 corpus items (YouTube transcripts + articles). Quality = key points extracted (max 10).

ModelLatencyQualityOutputReliabilityNotes
Nemotron Nano 30B 6.7s
5-7 pts
580-675 ch 3/3 ✓ Best overall. Recommended primary.
Gemma 3 4B 6.4s
5 pts
753-776 ch 1/3 ⚠ Fast but intermittent 429s on deep test
GPT-OSS 120B 10.8s
5 pts
596-683 ch 3/3 ✓ Reliable. Was 404 in March, now stable.
GPT-OSS 20B 20.4s
7-8 pts
521-780 ch 3/3 ✓ Highest quality free model. Slow.
Nemotron Nano 9B 29.5s
7 pts
521-605 ch 3/3 ✓ Good quality, too slow for primary
MiniMax M2.5 35.8s
7 pts
586-802 ch 3/3 ✓ Good quality but 30-300s latency range
GLM 4.5 Air 33.8s
5 pts
458-624 ch 3/3 ✓ Z-AI (Zhipu). Consistent.
Hermes 3 405B 41.6s
5 pts
368 ch 1/3 ⚠ Largest free model. Intermittent 429s.
Trinity Large 52.5s
5-6 pts
353-485 ch 3/3 ✓ Arcee AI. Expires 2026-04-22.
Nemotron 12B VL 56.0s
7 pts
373-416 ch 2/3 ⚠ Vision model. Needs strict JSON prompt.
OpenRouter /free 5.6s
5-6 pts
392-492 ch 3/3 ✓ Smart router — routes to best available free model
Liquid LFM 1.2B (x2) 1.0-2.5s
3-5 pts
125-359 ch 3/3 ✓ Tiny (1.2B). Fastest but lowest quality.

Paid Models — Quality Baseline

ModelLatencyQualityCost/1M tokensBest For
Gemini 2.5 Flash Lite ~2s
6-7 pts
$0.02 / $0.08 Bulk extraction fallback. Cheapest paid option.
Claude Haiku 4.5 ~3s
7-8 pts
$0.80 / $4.00 When quality matters. 40x cost of Flash Lite.
Claude Sonnet 4.6 ~5s
9 pts
$3.00 / $15.00 Quality ceiling. 10% sampling for A/B comparison.
Cost context: Nemotron Nano 30B (free) produces 80-90% of Gemini Flash Lite quality at zero cost. Sonnet is 150x more expensive than Flash Lite — used only for quality sampling, not bulk extraction.

Unavailable Free Models (April 2026)

14 models returned 429 (rate limited) or errors. Re-tested weekly — availability fluctuates.

Gemma 4 26B • Gemma 4 31B • Gemma 3 12B • Gemma 3 27B • Gemma 3n E2B • Gemma 3n E4B • Gemma 3 4B (deep test)
Llama 3.3 70B • Llama 3.2 3B • Qwen3 Next 80B • Qwen3 Coder • Dolphin Mistral 24B • Nemotron Super 120B (error) • Hermes 405B (deep test)
Local models: 2026-03-19 (Ollama 0.18.2, Beelink SER5) | Cloud models: 2026-04-11 (OpenRouter free-model-discovery benchmark)