Model Documentation

Ollama Model Reference

Last updated: 2026-03-19 | Ollama 0.18.2 | Beelink SER5 (Ryzen 5 5500U, 28GB RAM, AMD Vulkan iGPU)

System Overview

CPU

Ryzen 5 5500U

6C/12T, Zen 2, 4.06 GHz boost

Memory

28 GiB DDR4

~38.4 GB/s bandwidth (bottleneck)

GPU

Radeon iGPU (Vulkan)

4 GiB shared VRAM, 100% offload for ≤9B

Ollama Config

4 threads, Flash Attn

Vulkan=1, Nice=15, 14GB cap

Performance Ceiling

~6 tok/s (9B Q4)

Memory-bandwidth-bound, not compute-bound

Storage

25.7 GB models

5 models on 476GB NVMe (DRAM-less)

qwen3.5:9b PRIMARY

6.6 GB • 9B params • Q4 ▶

Overview

Alibaba's Qwen 3.5 9B is Jarvis's primary inference model. Selected for its strong structured output, tool-use formatting, and classification accuracy (100% on eval). Uses internal chain-of-thought ("thinking tokens") before generating visible output, which improves quality but adds latency.

Classification Triage Summarization Tool Use Extraction Speed-critical tasks

Specifications

ProviderAlibaba (Qwen team) Parameters9 billion QuantizationQ4 (4-bit, ~6.6 GB on disk) Loaded Size8.6 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Context Window4,096 tokens (default), up to 262K Thinking ModelYes — generates internal CoT tokens before response LicenseApache 2.0 Temperature1.0 (default) Top-K / Top-P20 / 0.95 Presence Penalty1.5

Performance (Ollama 0.18.2, 2026-03-19)

Metric	Value	Notes
Generation speed	5.9–6.6 tok/sec	At hardware ceiling (~38.4 GB/s bandwidth)
Prompt processing	~0.95–1.5s	For typical 50-200 token prompts
Cold load time	~6.7s	From NVMe to RAM
Warm inference	<1s first token	When model is already loaded
Eval score	0.658 (65.8%)	Highest across all models on 23-fixture suite
Classification	100%	Perfect on eval classification tasks
Extraction	78.8%	Strong structured data extraction
Reasoning	20%	Low on eval (format mismatch with thinking tokens)

Pre-upgrade comparison: Was 4.1–5.4 tok/sec on Ollama 0.17.5. The 0.18.0 upgrade delivered a +35% speed improvement for this model specifically.

Best Used For

Message triage — Classifying incoming Telegram messages by intent/priority
Screener tasks — Quick assessment of message safety and relevance
Structured extraction — Pulling entities, dates, categories from text
Summarization — Condensing long content into brief summaries
Tool use formatting — Generating structured JSON function calls
General Q&A — Conversational queries when cloud models aren't needed

When NOT to Use

Speed-critical classification — Thinking tokens add 5-30s latency even for trivial tasks. Use qwen2.5-coder:7b instead (2x faster, no thinking overhead)
Heavy multi-step reasoning — Use deepseek-r1:14b or cloud models for complex logic chains
Code generation — qwen2.5-coder:7b is purpose-built for code and faster

Usage Tips

Disable thinking for simple tasks: Prefix your prompt with /nothink to skip chain-of-thought tokens. Reduces latency from 30s+ to under 10s for classification. Note: effectiveness varies — the model may still think internally.

Keep-alive: Model stays loaded for 5 minutes after last use. First request after cold start takes ~7s extra for loading. Plan batch operations to keep it warm.

qwen2.5-coder:7b CODE

4.7 GB • 7B params • Q4 ▶

Overview

The fastest model in our lineup. Purpose-built for code tasks with 92+ language support and fill-in-the-middle (FIM) capability. No thinking token overhead — responses are immediate and direct. Outperforms CodeStral-22B and DeepSeek Coder 33B V1 on HumanEval (88.4%).

Code Generation Fast Classification Speed-Critical Tasks FIM / Autocomplete Complex Reasoning Long-form Text

Specifications

ProviderAlibaba (Qwen team) Parameters7 billion QuantizationQ4 (4-bit, ~4.7 GB on disk) Loaded Size~6 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Context WindowUp to 128K tokens Thinking ModelNo — direct response, no CoT overhead FIM SupportYes (fill-in-the-middle for code completion) Languages92+ programming languages LicenseApache 2.0

Performance (Ollama 0.18.2, 2026-03-19)

Metric	Value	Notes
Generation speed	8.9–12.4 tok/sec	Fastest model — 2x faster than qwen3.5
Cold load time	~10s	Smaller model, quick to load
Eval score	0.543 (54.3%)	Lower overall but strong on code + classification
Classification	80%	Good for fast triage when speed matters
Extraction	71.3%	Decent structured output
HumanEval	88.4%	Industry-leading for its size class

Speed advantage: At 8.9–12.4 tok/sec, this model completes tasks in roughly half the time of qwen3.5:9b. For tasks where quality difference is negligible (classification, simple extraction), this is the better choice.

Best Used For

Code generation — TypeScript, Python, JavaScript, and 89 more languages
Code completion (FIM) — Fill in the middle of existing code
Fast classification — When you need an answer in <5 seconds, not 30+
Screener backup — Quick triage when qwen3.5 latency is unacceptable
JSON formatting — Direct, no-thinking structured output

When NOT to Use

Complex reasoning — 0% on reasoning eval; no chain-of-thought capability
Long-form content generation — Not trained for essay/article writing
Nuanced analysis — Use qwen3.5 or cloud models for judgment calls

Usage Tips

Consider as screener primary: For the Jarvis screener pipeline (message classification), this model may be better than qwen3.5:9b — same accuracy on simple classifications but 2x faster response time.

deepseek-r1:8b REASONING

5.2 GB • 8B params • Q4 ▶

Overview

DeepSeek's R1 reasoning model distilled to 8B parameters. Uses chain-of-thought reasoning with explicit <think> blocks. Good balance of reasoning capability and speed. 92.8% on MATH-500 benchmark at full scale; this distill trades some accuracy for running on constrained hardware.

Math Problems Logical Reasoning Step-by-Step Analysis Classification Speed-Critical

Specifications

ProviderDeepSeek Parameters8 billion (distilled from 671B) QuantizationQ4 (4-bit, ~5.2 GB on disk) Loaded Size~7 GB in RAM GPU Offload100% (all layers on iGPU via Vulkan) Thinking ModelYes — explicit <think> blocks with visible reasoning Temperature0.6 (default — lower for more deterministic reasoning) LicenseMIT

Performance (Ollama 0.18.2, 2026-03-19)

Metric	Value	Notes
Generation speed	6.3–7.7 tok/sec	Faster than qwen3.5 (smaller model)
Cold load time	~5s	Quickest large model to load
Eval score	0.400 (40%)	Lower overall — reasoning format hurts structured eval
Classification	60%	Adequate but not its strength
Extraction	50%	Thinking overhead for simple tasks

Best Used For

Math and logic problems — Step-by-step reasoning with visible work
Debugging assistance — Methodical analysis of error conditions
Quick reasoning tasks — When deepseek-r1:14b is too slow

When NOT to Use

Classification/triage — Too much thinking overhead for simple categorization
Structured extraction — Thinking tokens pollute JSON output
Heavy reasoning — Use deepseek-r1:14b for genuinely hard problems (accepts slower speed)

Niche model: This model sits awkwardly between qwen3.5:9b (better at general tasks) and deepseek-r1:14b (better at hard reasoning). Consider whether one of those better fits your use case.

deepseek-r1:14b HEAVY REASONING

9.0 GB • 14B params • Q4 ▶

Overview

The largest model in our local lineup. DeepSeek R1 distilled to 14B parameters — significantly more capable than the 8B version but pushes the limits of this hardware. Uses explicit chain-of-thought reasoning. Load on demand only; do not keep warm.

Complex Reasoning Multi-Step Analysis Classification (100%) Latency-Sensitive Tasks Simple Tasks

Specifications

ProviderDeepSeek Parameters14 billion (distilled from 671B) QuantizationQ4 (4-bit, ~9.0 GB on disk) Loaded Size~12 GB in RAM GPU OffloadPartial (exceeds iGPU effective VRAM, some CPU fallback) Thinking ModelYes — explicit <think> blocks LicenseMIT

Performance (Ollama 0.18.2, 2026-03-19)

Metric	Value	Notes
Generation speed	4.5–4.6 tok/sec	Slowest model — near hardware limit for this size
Cold load time	~15s	Largest model, longest load
Eval score	0.585 (58.5%)	Second highest overall
Classification	100%	Perfect on eval (ties with qwen3.5)
Extraction	67.5%	Good, but slower than qwen3.5

Resource warning: At ~12 GB loaded, this model consumes nearly all of the 14 GB Ollama memory cap. Running this alongside other memory-intensive services may trigger MemoryHigh reclaim or OOM. Always unload after use via keep_alive: 0.

Best Used For

Complex multi-step reasoning — Problems requiring 5+ logical steps
Difficult analysis — When qwen3.5 or r1:8b give wrong answers
Emergency fallback — Local heavy reasoning when cloud is unavailable

When NOT to Use

Anything qwen3.5:9b can handle — The 14B model is 30% slower with marginal quality improvement for most tasks
Time-sensitive operations — 4.5 tok/sec + thinking tokens = minutes per response
Batch operations — Will monopolize system resources

Decision framework: Only use this model when (1) the task genuinely requires heavy reasoning, (2) cloud models (Claude, OpenRouter) are unavailable, and (3) you can tolerate 2-5 minute response times. In almost all other cases, qwen3.5:9b or a cloud model is the better choice.

nomic-embed-text:latest EMBEDDING

274 MB • 137M params • F16 ▶

Overview

Nomic AI's text embedding model. Converts text into 768-dimensional vectors for semantic search and similarity. Always loaded in memory — used by the Long-Term Memory (LTM) system for real-time embedding generation. Tiny footprint (578 MB loaded) makes it negligible alongside inference models.

Semantic Search LTM Embedding Similarity Scoring

Specifications

ProviderNomic AI Parameters137 million QuantizationF16 (full precision — small enough not to need quantization) Loaded Size578 MB in RAM GPU Offload100% GPU Context Window8,192 tokens Embedding Dims768 LicenseApache 2.0 Always LoadedYes — permanent resident

Usage in Jarvis

LTM vector generation — Every knowledge entry is embedded for semantic retrieval
Lesson surfacing — Session startup searches LTM using embedded queries
Similarity dedup — Detects near-duplicate knowledge entries

Do not unload this model. It runs permanently with negligible resource impact. Unloading and reloading adds latency to every LTM operation.

Model Comparison Matrix

Model	Speed	Quality	RAM	Best For	Avoid For
qwen3.5:9b	6.0 t/s	65.8%	8.6 GB	General purpose, triage, extraction	Speed-critical, code
qwen2.5-coder:7b	12.4 t/s	54.3%	~6 GB	Code, fast classification	Reasoning, analysis
deepseek-r1:8b	7.7 t/s	40.0%	~7 GB	Quick reasoning, math	General tasks (niche)
deepseek-r1:14b	4.5 t/s	58.5%	~12 GB	Complex reasoning (offline)	Anything simple or timed
nomic-embed-text	N/A	Embedding	578 MB	LTM vectors, semantic search	Not an inference model

Model Selection Guide

Task	Recommended Model	Why
Telegram message triage	qwen3.5:9b (or qwen2.5-coder for speed)	Best classification accuracy; coder is 2x faster if latency matters
Screener safety check	qwen3.5:9b	Highest structured output quality
Code generation / review	qwen2.5-coder:7b	Purpose-built, 88.4% HumanEval, fastest inference
JSON extraction from text	qwen3.5:9b	78.8% extraction accuracy, best structured output
Summarization	qwen3.5:9b	Best balance of quality and coherence
Math / logic problem	deepseek-r1:14b	Most capable reasoning; 8b as faster fallback
Complex multi-step analysis	deepseek-r1:14b (or cloud)	Only when cloud unavailable; cloud models are better
Embedding / vector search	nomic-embed-text	Only embedding model; always loaded
Emergency (offline, any task)	qwen3.5:9b	Highest overall score, widest capability

Cloud vs Local decision: Local models are for (1) triage/screening that must be fast and cheap, (2) offline/emergency fallback, (3) embedding generation. For any task requiring judgment, analysis, or creativity, use Claude (Opus/Sonnet) or OpenRouter models — they are orders of magnitude more capable.

OpenRouter Cloud Models

VEP extraction benchmark — April 2026 | 27 free + 3 paid models tested

Extraction Cascade (Active)

VEP content extraction uses a free-first cascade. Each model is tried in order; paid fallback only fires on double free failure (~5-10% of items).

#	Model	Tier	Avg Latency	Cost/1M tok	Role
1	Nemotron Nano 30B	FREE	6.7s	$0	Primary — best speed/quality ratio
2	GPT-OSS 120B	FREE	10.8s	$0	Secondary — deeper analysis, catches misses
3	Gemini 2.5 Flash Lite	PAID	~2s	$0.02 / $0.08	Paid fallback — fastest overall

Weekly auto-discovery: Sentinel runs free-model-discovery.ts every Wednesday 3 AM ET. Tier rankings update automatically based on availability and extraction quality.

Free Tier — Extraction Benchmark (April 2026)

13 of 27 free models extract successfully. Tested on 3 corpus items (YouTube transcripts + articles). Quality = key points extracted (max 10).

Model	Latency	Quality	Output	Reliability	Notes
Nemotron Nano 30B	6.7s	5-7 pts	580-675 ch	3/3 ✓	Best overall. Recommended primary.
Gemma 3 4B	6.4s	5 pts	753-776 ch	1/3 ⚠	Fast but intermittent 429s on deep test
GPT-OSS 120B	10.8s	5 pts	596-683 ch	3/3 ✓	Reliable. Was 404 in March, now stable.
GPT-OSS 20B	20.4s	7-8 pts	521-780 ch	3/3 ✓	Highest quality free model. Slow.
Nemotron Nano 9B	29.5s	7 pts	521-605 ch	3/3 ✓	Good quality, too slow for primary
MiniMax M2.5	35.8s	7 pts	586-802 ch	3/3 ✓	Good quality but 30-300s latency range
GLM 4.5 Air	33.8s	5 pts	458-624 ch	3/3 ✓	Z-AI (Zhipu). Consistent.
Hermes 3 405B	41.6s	5 pts	368 ch	1/3 ⚠	Largest free model. Intermittent 429s.
Trinity Large	52.5s	5-6 pts	353-485 ch	3/3 ✓	Arcee AI. Expires 2026-04-22.
Nemotron 12B VL	56.0s	7 pts	373-416 ch	2/3 ⚠	Vision model. Needs strict JSON prompt.
OpenRouter /free	5.6s	5-6 pts	392-492 ch	3/3 ✓	Smart router — routes to best available free model
Liquid LFM 1.2B (x2)	1.0-2.5s	3-5 pts	125-359 ch	3/3 ✓	Tiny (1.2B). Fastest but lowest quality.

Paid Models — Quality Baseline

Model	Latency	Quality	Cost/1M tokens	Best For
Gemini 2.5 Flash Lite	~2s	6-7 pts	$0.02 / $0.08	Bulk extraction fallback. Cheapest paid option.
Claude Haiku 4.5	~3s	7-8 pts	$0.80 / $4.00	When quality matters. 40x cost of Flash Lite.
Claude Sonnet 4.6	~5s	9 pts	$3.00 / $15.00	Quality ceiling. 10% sampling for A/B comparison.

Cost context: Nemotron Nano 30B (free) produces 80-90% of Gemini Flash Lite quality at zero cost. Sonnet is 150x more expensive than Flash Lite — used only for quality sampling, not bulk extraction.

Unavailable Free Models (April 2026)

14 models returned 429 (rate limited) or errors. Re-tested weekly — availability fluctuates.

Gemma 4 26B • Gemma 4 31B • Gemma 3 12B • Gemma 3 27B • Gemma 3n E2B • Gemma 3n E4B • Gemma 3 4B (deep test)

Llama 3.3 70B • Llama 3.2 3B • Qwen3 Next 80B • Qwen3 Coder • Dolphin Mistral 24B • Nemotron Super 120B (error) • Hermes 405B (deep test)

Local models: 2026-03-19 (Ollama 0.18.2, Beelink SER5) | Cloud models: 2026-04-11 (OpenRouter free-model-discovery benchmark)

MODEL DOCUMENTATION

System Overview

Overview

Specifications

Performance (Ollama 0.18.2, 2026-03-19)

Best Used For

When NOT to Use

Usage Tips

Overview

Specifications

Performance (Ollama 0.18.2, 2026-03-19)

Best Used For

When NOT to Use

Usage Tips

Overview

Specifications

Performance (Ollama 0.18.2, 2026-03-19)

Best Used For

When NOT to Use

Overview

Specifications

Performance (Ollama 0.18.2, 2026-03-19)

Best Used For

When NOT to Use

Overview

Specifications

Usage in Jarvis

Model Comparison Matrix

Model Selection Guide

OpenRouter Cloud Models

Extraction Cascade (Active)

Free Tier — Extraction Benchmark (April 2026)

Paid Models — Quality Baseline

Unavailable Free Models (April 2026)