Pivert's Blog

OpenCode + Oh My OpenAgent: Building a Multi-Model AI Coding Stack for €30/Month


Reading Time: 17 minutes

Reading Time: 15 minutes

The Power of OpenCode + Oh My OpenAgent

OpenCode is an open-source AI coding framework. It supports 75+ LLM providers, specialized agents, MCP servers, and a plugin architecture that lets you wire anything to anything. Out of the box, it is powerful. But it is also raw — you pick a model, you talk to it, and that is the experience.

Oh My OpenAgent is the plugin that transforms OpenCode from a single-model chat interface into a coordinated multi-agent development system. It adds 11 specialized agents, each with independent model routing, category-based delegation, and ordered fallback chains:

  • Sisyphus — The orchestrator. Receives every message and delegates to the right agent.
  • Oracle — Architecture consultant. High-stakes design decisions.
  • Prometheus — Task planner. Decomposes complex work into steps.
  • Metis — Pre-planning analyst. Detects ambiguities before execution.
  • Momus — QA critic. Post-implementation review and critique.
  • Librarian — Documentation search. Finds relevant docs and examples.
  • Explore — Codebase search. Grep and pattern matching across files.
  • Atlas — Todo orchestrator. Tracks progress and manages task lists.
  • Sisyphus-Junior — Implementation worker. Writes the actual code.
  • Multimodal-Looker — Vision and image analysis. UI evaluation, diagrams.
  • Hephaestus — Deep technical work. Hard problems requiring extended reasoning.

The real power is this: you can assign ANY model to ANY agent. You are not locked to one provider. You are not stuck with one thinking style. You can put the cheapest fast model on documentation search while routing architecture decisions to a frontier model with extended reasoning. Total freedom.

The tradeoff: this freedom comes with work. You have to understand models, benchmarks, consumption levels, and make informed choices about which agent gets which brain. This article is about finding the sweet spot — the Oh My OpenAgent configuration that delivers professional-grade AI coding for €30/month while handling 15–20 hours of heavy development per week.

Analysis current as of May 30, 2026. Model benchmarks, pricing, and quota policies change frequently. Verify current values before implementing.


Installation

Getting started is straightforward:

  • OpenCode: Install via npm (npm install -g opencode), Homebrew (brew install opencode), or download from opencode.ai. Works in terminal, desktop, or IDE.
  • Oh My OpenAgent: Add "plugin": ["oh-my-openagent"] to your ~/.config/opencode/opencode.json file. On next startup, OpenCode automatically installs the plugin from npm. No manual npm install needed.

The Winning Team: Ollama Cloud Pro + OpenCode Zen

Two providers, two budgets, one system.

Ollama Cloud Pro (Flat-Rate, €15/month)

Ollama Cloud Pro is a flat-rate remote inference subscription. Plans range from Free (1 concurrent model) to Pro (3 concurrent) to Max (10 concurrent models) — the Max plan is worth considering if you run heavy multi-agent workflows that need high parallelism. This is not the same as local Ollama, which runs models on your own hardware. Ollama Cloud runs inference on NVIDIA GPU clusters in the United States (AWS us-east-1), with capacity routing to Europe and Singapore. You get generous GPU-time quotas that reset every 5 hours (session limits) and every 7 days (weekly limits). The Pro tier provides roughly 50× the Free tier allocation.

In practice, running 15–20 hours of heavy coding per week has never exhausted the available quota. This is where 90%+ of all calls go.

OpenCode Zen (Pay-Per-Use, ~€15/month)

OpenCode Zen provides pay-per-use API access to frontier models: GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.7. You pay per token. This is where the remaining <10% of calls go — reserved for agents that fire rarely but where quality has outsized impact.

Why They Pair Well

Flat-rate absorbs all the high-frequency bulk work. Pay-per-use handles rare critical decisions. The flat-rate subscription becomes the foundation; pay-per-use becomes the precision instrument. Together, they deliver a complete multi-model stack at roughly 7–10% of what developers typically spend on premium pay-per-use alone.

Parallelism: Higher Throughput Without Higher Cost

Ollama Cloud Pro supports 3 concurrent inference sessions on its base plan (€15/month annual). OpenCode Zen supports 5 concurrent requests on its pay-per-use tier. When you combine them, your system can handle up to 8 parallel model calls — more than enough for complex multi-agent workflows where Sisyphus delegates to several juniors simultaneously. A single OpenAI or Anthropic subscription with comparable concurrency would cost significantly more.

Redundancy: When One Provider Goes Down

It happens. This week, Ollama Cloud had a multi-hour outage. No inference, no responses, no fallback within the same provider. Because the Oh My OpenAgent configuration routes through both Ollama Cloud and OpenCode Zen, the system automatically degraded: high-frequency agents fell back to OpenCode’s free tier models, critical agents stayed on Zen pay-per-use, and work continued. A single-provider architecture would have been completely dark. The multi-provider fallback chain is not just about cost — it is about resilience.

Ollama Cloud Consumption Levels

Different models consume GPU quota at different rates. Ollama Cloud groups them into consumption tiers:

Level Label Example Models Quota Impact
1 Small gpt-oss:20b-cloud Minimal GPU time
2 Medium deepseek-v4-flash, nemotron-3-nano Moderate
3 High kimi-k2.6, glm-5.1 Significant
4 Extra High deepseek-v4-pro Maximum GPU time

The Problem: Why Inference Gets Expensive

Claude Opus 4.7 costs $75 per 1M output tokens. GPT-5.4 costs $15 per 1M. A single heavy coding session — architecture discussion, multi-file refactor, test generation, review cycle — easily burns 200,000+ tokens.

That is $3 to $15 per conversation.

At four sessions per day, twenty days per month: $100–400/month in pure inference costs. Before subscriptions. Before team licenses. Before API quotas.

To understand why — and how to reduce it — you have to understand what drives inference cost. The answer is not total parameters. It is active parameters per token. A model with 1 trillion parameters that only activates 32 billion per token is cheaper to run than a model with 100 billion parameters that activates all 100 billion every time. The next section explains why.

Why Mixture of Experts Changes the Economics

MoE models like DeepSeek V4 Pro and Kimi K2.6 activate only a fraction of their total parameters per token. Dense models like Llama 3.1 405B activate everything. Every token. Every time.

The reason: MoE architectures break the model into specialized sub-networks called “experts.” For any given token, only a small subset of experts is activated — typically 3–5% of the total model. The rest stays idle, consuming memory but not electricity.

The practical formula is simple:

inference cost ∝ active_parameters_per_token

Model Architecture Total Parameters Active per Token Activation Rate
DeepSeek V4 Pro MoE 1.6T 49B 3.1%
Kimi K2.6 MoE 1T 32B 3.2%
GLM 5.1 MoE 754B 40B 5.3%
DeepSeek V4 Flash MoE 284B 13B 4.6%
Qwen 3.6-27B (Dense) Dense 27B 27B 100%
Llama 3.1 405B (Dense) Dense 405B 405B 100%

The VRAM Tradeoff

MoE models do require loading all parameters into GPU memory — DeepSeek V4 Pro needs enough VRAM to hold 1.6 trillion parameters. But computation only flows through the active experts. You pay for the memory footprint once; you pay for the computation per token. Less active parameters = less GPU computation = less electricity = lower cost.

Real-World Consumption Data

Here is the empirical evidence from actual usage on Ollama Cloud Pro. These screenshots show the same session quota bar with the cursor hovering over each model segment to reveal request counts.

Session Quota: 100% Consumed

The session quota was fully exhausted. The breakdown by model tells the story:

Model Requests Level Active Params Visual Share Notes
gemini-3-flash-preview 2 ? ? ~5% Too few requests to analyze
Kimi K2.6 139 3 (High) 32B ~20% Efficient per-request consumption
DeepSeek V4 Pro 208 4 (Extra High) 49B ~80% Dominant quota consumer
DeepSeek V4 Flash 42 2 (Medium) 13B ~1–2% Extremely cost-effective

Consumption Analysis

The critical insight: V4 Pro had only 50% more requests than Kimi (208 vs 139), yet consumed 4× the quota (~80% vs ~20%). This confirms the structural cost difference between Level 4 and Level 3 models.

Normalized per request:

  • Kimi K2.6: ~20% ÷ 139 requests = ~0.14% of session quota per request
  • V4 Pro: ~80% ÷ 208 requests = ~0.38% of session quota per request
  • Ratio: 0.38 ÷ 0.14 = 2.7× more expensive per request

DeepSeek V4 Flash is the standout: 42 requests at Level 2 (13B active, roughly 40% of Kimi’s per-request cost) consumed a barely visible fraction of the bar. Estimated: ~0.06% per request. This makes Flash the ideal choice for high-frequency, low-complexity tasks like code search and documentation lookup.

Visual Evidence

Hover over each bar segment to see the model name and request count. The proportional bar lengths tell the consumption story at a glance.

gemini-3-flash-preview usage
Figure 1: gemini-3-flash-preview — 2 requests, minimal quota impact.
Kimi K2.6 usage
Figure 2: Kimi K2.6 — 139 requests, ~20% of session quota. Efficient Level 3 consumption.
DeepSeek V4 Pro usage
Figure 3: DeepSeek V4 Pro — 208 requests, ~80% of session quota. Level 4 dominance.
DeepSeek V4 Flash usage
Figure 4: DeepSeek V4 Flash — 42 requests, almost invisible bar. Extraordinary Level 2 efficiency.

The Solution Architecture

Two-Tier Cost Model

Tier 1 — Flat-Rate (90%+ of calls): Ollama Cloud Pro

Ollama Cloud Pro is a flat-rate subscription with GPU-time quotas. Quotas reset every 5 hours (session limits) and every 7 days (weekly limits). The Pro tier provides approximately 50× the Free tier allocation. In practice, running 15–20 hours of heavy coding per week has never exhausted the available quota.

Level Label Active Parameters Best For
1 Small ~20B Minimal work, micro-queries
2 Medium 13B Bulk work, fast queries, code search
3 High 32B–40B Balanced reasoning, multi-file tasks, vision
4 Extra High 49B Maximum reasoning, long context, agentic tasks
  • DeepSeek V4 Pro (Level 4): 1.6T total, 49B active. Strongest agentic reasoning, 35% faster inference, 1M context. Best for implementation and infrastructure.
  • Kimi K2.6 (Level 3): 1T total, 32B active. Beats V4 Pro on SWE-Bench Pro, HumanEval, HLE with tools. Native vision. 300-agent swarms. Best for orchestration and multi-file refactors.
  • DeepSeek V4 Flash (Level 2): 284B total, 13B active. Minimum quota consumption. Best for search, lookup, trivial tasks.
  • GLM 5.1 (Level 3): 754B total, 40B active. Reliable backup with different strengths.

Tier 2 — Pay-Per-Use (<10% of calls): OpenCode Zen

Reserved for agents that fire rarely but where quality has outsized impact:

Model Cost Used For Monthly Cost
GPT-5.4 $2.50 in / $15 out per 1M Architecture, QA review ~€7–10
Gemini 3.1 Pro $2.00 / $12 per 1M Vision, image analysis ~€1–2
Claude Opus 4.7 $5.00 / $75 per 1M Ultimate fallback ~€0–1

Free Tier Models

OpenCode Zen provides free-tier models that are identical to their paid counterparts. DeepSeek V4 Flash Free is the same model as the flat-rate V4 Flash — same weights, same performance, zero cost. We use it as the primary for six high-volume agents: librarian, explore, atlas, quick, unspecified-low, and writing.

Why this works: These agents fire 5–20 times per task. At flat-rate, they compete with V4 Pro and Kimi for the 3 Ollama Cloud slots. On OpenCode’s free tier, they get their own concurrency lane without touching Ollama quota. The only risk is model retirement or rate limits — which is why every free agent has ollama-cloud/deepseek-v4-flash as its #1 fallback. If the free tier disappears, you revert to current behavior automatically.

Monthly Cost Breakdown

Component Cost Notes
Ollama Cloud Pro (annual) €15/month Flat-rate, generous GPU-time quotas
OpenCode Zen (pay-per-use) ~€12–15/month Rare agents only (<10% of calls)
TOTAL ~€30/month 15–20h/week heavy coding

At 15–20 hours per week, this represents approximately 7–10% of the cost of using premium pay-per-use models for the same workload.

Deep Dive: How to Choose Between V4 Pro and Kimi K2.6

Benchmark Evidence (May 2026)

Benchmark Kimi K2.6 V4 Pro Winner Margin
SWE-Bench Pro (real-world multi-file PRs) 58.6% 55.4% Kimi +3.2
SWE-Bench Verified 80.2% 80.6% Tie ~0
HumanEval (function-level code) 92.0% 76.8% Kimi +15.2
LiveCodeBench (competitive programming) 89.6% 93.5% V4 Pro +3.9
GPQA Diamond (grad-level STEM) 74.7% 76.2% V4 Pro +1.5
MMLU-Pro (general knowledge) 75.6% 78.5% V4 Pro +2.9
HLE (hard tasks with tools) 40.1% 23.8% Kimi +16.3
Agentic Elo ranking 1484 1554 V4 Pro +70
AA Index 67.0 65.0 Kimi +2.0
Context Window 256K 1M V4 Pro
Inference Speed 52ms/token 38ms/token V4 Pro 35% faster
Native Vision Yes No Kimi Exclusive

Task-Specific Routing

Neither model is universally better. The correct approach is routing by task:

Kimi K2.6 is superior when:

  • Multi-file code changes (SWE-Bench Pro: +3.2)
  • Function-level code generation (HumanEval: +15.2)
  • Hard tasks requiring tool use (HLE: +16.3)
  • Vision or image understanding needed
  • Agentic coordination at scale (300-agent swarms)

DeepSeek V4 Pro is superior when:

  • Maximum reasoning depth required (Agentic Elo: +70)
  • Long context essential (1M vs 256K tokens)
  • Inference speed matters (35% faster)
  • Competitive programming (LiveCodeBench: +3.9)

Real-World Lesson: Consumption Matters for High-Frequency Agents

After running this Oh My OpenAgent configuration, the Sisyphus orchestrator — which fires on every message — was consuming quota at roughly 2.7× the rate when assigned to V4 Pro versus Kimi K2.6. The fix: move the orchestrator to Kimi. V4 Pro remains the first fallback. Result: equivalent orchestration quality, one-third the quota burn.

The Orchestration Layer: Oh My OpenAgent

The open-source plugin oh-my-openagent integrates with OpenCode to provide the multi-agent architecture.

How It Works

  1. Sisyphus (orchestrator) receives every message and delegates to specialized agents
  2. Sisyphus-Junior workers handle implementation using category-specific models
  3. Task categories route to different models:
    • quick → V4 Flash Free (trivial fixes, zero cost)
    • unspecified-low → V4 Flash Free (moderate tasks, zero cost)
    • unspecified-high → Kimi K2.6 (complex coding)
    • deep → GPT-5.4 (hard problems, pay-per-use)
    • visual-engineering → Gemini 3.5 Flash (UI/UX, pay-per-use, 25% cheaper & faster than 3.1 Pro)
    • ultrabrain → Kimi K2.6 (maximum reasoning)
    • infrastructure → V4 Pro (Terraform/K8s)
    • writing → V4 Flash Free (documentation, zero cost)
  4. Fallback chains provide resilience: if a model is unavailable, the system automatically tries the next model

Oh My OpenAgent — Agent Assignments

Agent Primary Model Role Cost
Sisyphus Kimi K2.6 Orchestrator, every message ● Flat-rate
Oracle GPT-5.4 (high) Architecture consulting ~€2–3/mo
Prometheus V4 Pro Task planning ● Flat-rate
Metis Kimi K2.6 Pre-planning analysis ● Flat-rate
Momus GPT-5.4 (high) QA review ~€5–8/mo
Librarian V4 Flash Free Documentation search ● Free
Explore V4 Flash Free Codebase search ● Free
Atlas V4 Flash Free Todo orchestration ● Free
Sisyphus-Junior V4 Pro Implementation worker ● Flat-rate
Multimodal-Looker Gemini 3.1 Pro Vision/image analysis ~€1–2/mo

Fallback Chain Philosophy

Every agent in the Oh My OpenAgent configuration has an ordered fallback chain with three principles:

  1. Capability diversity: Fallback offers different reasoning style (Kimi’s tool reasoning vs V4 Pro’s agentic reasoning)
  2. Cost gradient: Flat-rate first, then pay-per-use, then free
  3. Provider diversity: Chains span ollama-cloud → opencode pay-per-use → opencode free

Privacy, Data Residency, and the “China” Question

“When I use Chinese models like DeepSeek or Qwen, does my data go to China?”

No. This is the most common misconception.

Where Does Inference Actually Happen?

  • Models are open-weight: DeepSeek V4, Kimi K2.6, GLM 5.1, Qwen 3.6 are released under Apache 2.0 or MIT licenses. Their weights are publicly available on HuggingFace.
  • Ollama hosts on US infrastructure: Ollama Cloud runs inference on NVIDIA GPU clusters primarily in the United States (AWS us-east-1), with capacity routing to Europe and Singapore. Your queries go to ollama.com‘s US servers, not to api.deepseek.com or Chinese endpoints.
  • Architecture: Local proxy on your machine → Ollama proxy → Ollama Cloud (US/EU) → NVIDIA GPUs. No data path to China.

Data Retention Policy

Ollama’s privacy policy: “do not log, do not train, zero data retention.”

  • User prompts processed transiently
  • No logging of conversation content
  • No use of user data for model training
  • Zero data retention beyond request completion

For maximum control, Ollama’s local mode runs entirely on-device with zero external connections.

Common Pitfalls to Avoid

“I Need the Best Model for Everything”

Claude Opus and GPT-5.4 are excellent. But “best” is task-dependent. Using GPT-5.4 for code search (where V4 Flash is 90% as good) is burning $15/1M tokens on work that costs nothing extra on flat-rate. The skill is routing, not defaulting to the most expensive option.

Ignoring Consumption Levels

Level 4 models are 2–3× more expensive in quota than Level 3. If an agent fires 100× per session, that multiplier matters. Reserve Level 4 for agents where the benchmark advantage is clear and call frequency is low — or where speed/throughput justifies the cost.

Treating Fallbacks as Afterthoughts

Fallback chains are part of the Oh My OpenAgent cost model. A well-designed chain means that if quotas are exhausted, the system degrades gracefully to pay-per-use rather than failing. Ordering matters: always try the cheaper alternative first.

Conclusion: The Sweet Spot for Oh My OpenAgent

Professional AI-assisted development does not require a $400/month inference budget. It requires three things:

  1. A flat-rate subscription for high-frequency bulk work (Ollama Cloud Pro at €15/month)
  2. Pay-per-use access for rare critical decisions (OpenCode Zen at ~€15/month)
  3. Smart routing through Oh My OpenAgent — each task gets the model it actually needs, not the most expensive model available

The empirical data is clear: DeepSeek V4 Pro with 208 requests consumed ~80% of a session quota. Kimi K2.6 with 139 requests consumed only ~20%. V4 Flash with 42 requests was barely visible. The difference between Level 4 and Level 3 is not theoretical — it is measurable in every quota bar. The reason is Mixture of Experts: V4 Pro and Kimi both have massive total parameter counts, but activate only 3% per token, making them significantly cheaper to run per-inference than dense models of comparable intelligence.

The Oh My OpenAgent configuration documented here handles 15–20 hours of heavy coding per week for approximately €30/month. It uses Kimi for orchestration, V4 Pro for implementation, Flash for search, and frontier models only where their specific strengths justify the cost.

The data does not flow to China. The models are open-weight and hosted on US infrastructure. The privacy policy is stronger than many American alternatives. And the energy consumption per token is a fraction of dense frontier models.

This is not a workaround. It is a better architecture — and this article gives you the exact configuration file to reproduce it.


The Configuration File

Here is the complete oh-my-openagent.jsonc (568 lines). Copy sections or the entire file to adapt for your own setup. Every agent assignment, fallback chain, and category routing is commented with the reasoning behind the choice.

// oh-my-openagent.jsonc
// Optimized for: 15-20h/week coding, ~30 EUR/month budget
// Strategy: Ollama Cloud Pro flat-rate (Kimi + V4 Pro) +
//           OpenCode Zen free tier for bulk work +
//           OpenCode Zen pay-per-use for critical agents only
//
// TIER 1 — Ollama Cloud Pro (flat-rate, unlimited within GPU slots)
//   Kimi K2.6        (Level 3) — Best for: orchestration,
//                                    complex refactors, vision
//   DeepSeek V4 Pro  (Level 4) — Best for: implementation,
//                                    planning, infrastructure
//   GLM-5.1          (Level 3) — Flat-rate backup / diversity
//
// TIER 2 — OpenCode Zen (pay-per-use, ~12-15 EUR/month)
//   GPT-5.4          — Oracle, Momus, Deep (critical reasoning)
//   Gemini-3.5-Flash — Visual-engineering (UI iteration, 25% cheaper
//                       than 3.1 Pro, faster responses)
//   Gemini-3.1-Pro   — Multimodal-looker (complex vision tasks)
//   Claude-Opus-4-6  — Ultrabrain fallback (max reasoning)
//
// TIER 3 — OpenCode Zen Free (zero cost, boosts parallelism)
//   DeepSeek V4 Flash Free — Bulk work: explore, librarian, atlas,
//                              quick, unspecified-low, writing
//   Qwen3.6-Plus, Nemotron-3-Super, Big-Pickle — Fallbacks
//
// === WHY FREE TIER FOR BULK WORK ===
// OpenCode Zen offers free models that are identical to flat-rate
// counterparts (e.g. DeepSeek V4 Flash Free = same model, $0).
// Using them for high-volume agents (librarian, explore, atlas)
// frees up all 3 Ollama Cloud slots for heavy models (V4 Pro + Kimi).
// Risk: free tier may change (models retired, rate limits added).
// Mitigation: every free agent has ollama-cloud/v4-flash as #1
// fallback — if the free tier disappears, you revert to current
// behavior with zero config changes.
//
// === PARALLELISM AFTER THIS CHANGE ===
// Ollama Cloud (3 slots): V4 Pro + Kimi only — no contention
// OpenCode Zen (5 slots): free flash + pay-per-use critical roles
//
// === QUOTA REALITY ===
// Ollama Cloud Pro has generous GPU-time quotas:
//   Level 1 (Small):     gpt-oss:20b-cloud    — 1.3%   per request
//   Level 2 (Medium):    deepseek-v4-flash    — 0.06%  per request
//   Level 3 (High):      kimi-k2.6, glm-5.1   — 0.15%  per request
//   Level 4 (Extra):     deepseek-v4-pro      — 0.38%  per request
// Reset: 5h/session, 7d/weekly. Pro = 50x Free tier.
// With 15-20h/week workload, this stays well within limits.

{
  "agents": {

    // SISYPHUS — Orchestrator (every message, ~100% of traffic)
    // CHOICE: Kimi K2.6 (Level 3)
    // WHY: 3x cheaper than V4 Pro, strong enough for orchestration
    // Real-world: 139 requests = 20.6% quota (vs 208 V4 Pro = 80%)
    // Fallbacks: V4 Pro → GLM-5.1 → GPT-5.4 → Qwen-Free → Big-Pickle
    "sisyphus": {
      "model": "ollama-cloud/kimi-k2.6",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "medium" },
        "opencode/qwen3.6-plus-free",
        "opencode/big-pickle"
      ],
      "prompt_append": "CRITICAL DIRECTIVE: You are an orchestrator. " +
        "Do NOT write, edit, or implement code yourself. You MUST " +
        "use the `sisyphus_task` tool to delegate all implementation " +
        "work to the appropriate sub-agents based on their category."
    },

    // ORACLE — Read-only consultant (critical architecture decisions)
    // CHOICE: GPT-5.4 (Zen pay-per-use, high variant)
    // WHY: Highest-stakes reasoning, architecture, debugging
    // Fallbacks: V4 Pro → Kimi → Gemini-3.1-Pro → Claude-Opus → GLM-5.1
    "oracle": {
      "model": "opencode/gpt-5.4",
      "variant": "high",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/kimi-k2.6",
        { "model": "opencode/gemini-3.1-pro", "variant": "high" },
        { "model": "opencode/claude-opus-4-6", "variant": "max" },
        "ollama-cloud/glm-5.1"
      ]
    },

    // PROMETHEUS — Planner (task decomposition)
    // CHOICE: DeepSeek V4 Pro (Level 4, flat-rate)
    // WHY: Best GDPval-AA Agentic Elo (1554), fastest inference
    // Fallbacks: Kimi → GLM-5.1 → GPT-5.4 → Qwen-Free
    "prometheus": {
      "model": "ollama-cloud/deepseek-v4-pro",
      "fallback_models": [
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "high" },
        "opencode/qwen3.6-plus-free"
      ]
    },

    // METIS — Pre-planning analyst (ambiguity detection)
    // CHOICE: Kimi K2.6 (Level 3)
    // WHY: Tool-augmented reasoning (HLE 54% vs V4 37%), swarm support
    // Fallbacks: V4 Pro → GLM-5.1 → GPT-5.4 → Qwen-Free
    "metis": {
      "model": "ollama-cloud/kimi-k2.6",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "medium" },
        "opencode/qwen3.6-plus-free"
      ]
    },

    // MOMUS — QA / Critic (plan review, quality assurance)
    // CHOICE: GPT-5.4 (Zen pay-per-use, high variant)
    // WHY: Rigorous analysis, plan evaluation, gap detection
    // Fallbacks: V4 Pro → Kimi → Claude-Opus → GLM-5.1
    "momus": {
      "model": "opencode/gpt-5.4",
      "variant": "high",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/kimi-k2.6",
        { "model": "opencode/claude-opus-4-6", "variant": "max" },
        "ollama-cloud/glm-5.1"
      ]
    },

    // LIBRARIAN — Doc search (external references, OSS lookup)
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: Same model as flat-rate V4 Flash, costs nothing.
    //      Frees up Ollama slots for heavy models.
    // RISK: Free tier may change. MITIGATION: ollama-cloud/v4-flash
    //      is #1 fallback — automatic revert if free tier breaks.
    // Fallbacks: ollama-cloud/v4-flash → Nemotron-Free → Big-Pickle
    "librarian": {
      "model": "opencode/deepseek-v4-flash-free",
      "temperature": 0.1,
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "opencode/nemotron-3-super-free",
        "opencode/big-pickle",
        "opencode/qwen3.6-plus-free"
      ]
    },

    // EXPLORE — Code grep (internal codebase search)
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: High call volume (5-20x per task). $0 cost = unlimited
    //      parallel grep without touching Ollama quota.
    // Fallbacks: ollama-cloud/v4-flash → Big-Pickle → Nemotron → Qwen
    "explore": {
      "model": "opencode/deepseek-v4-flash-free",
      "temperature": 0.1,
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "opencode/big-pickle",
        "opencode/nemotron-3-super-free",
        "opencode/qwen3.6-plus-free"
      ]
    },

    // ATLAS — Todo orchestrator (lightweight task tracking)
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: Todo management needs speed, not depth. Free tier
    //      provides both with zero quota impact.
    // Fallbacks: ollama-cloud/v4-flash → Kimi → Qwen → Big-Pickle
    "atlas": {
      "model": "opencode/deepseek-v4-flash-free",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "ollama-cloud/kimi-k2.6",
        "opencode/qwen3.6-plus-free",
        "opencode/big-pickle"
      ]
    },

    // SISYPHUS-JUNIOR — Implementation worker (all delegated tasks)
    // CHOICE: DeepSeek V4 Pro (Level 4, flat-rate)
    // WHY: Biggest quality win — all implementation flows through juniors
    // Fallbacks: Kimi → GLM-5.1 → Qwen → Nemotron → Big-Pickle
    "sisyphus-junior": {
      "model": "ollama-cloud/deepseek-v4-pro",
      "fallback_models": [
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1",
        "opencode/qwen3.6-plus-free",
        "opencode/nemotron-3-super-free",
        "opencode/big-pickle"
      ]
    },

    // MULTIMODAL-LOOKER — Vision analysis (images, PDFs, diagrams)
    // CHOICE: Gemini-3.1-Pro (Zen pay-per-use, high variant)
    // WHY: Gold standard for complex vision tasks.
    //      Gemini-3.5-Flash handles UI iteration; 3.1-Pro handles
    //      deep visual analysis (diagrams, mockups, screenshots).
    // Fallbacks: GPT-5.4 → Kimi (MoonViT) → GLM-5.1
    "multimodal-looker": {
      "model": "opencode/gemini-3.1-pro",
      "variant": "high",
      "fallback_models": [
        { "model": "opencode/gpt-5.4", "variant": "medium" },
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1"
      ]
    }
  },

  // === CATEGORIES ===
  // Categories control which model sisyphus-junior uses per task domain.
  // Precedence: Agent override > Category > System default

  "categories": {

    // QUICK — Single-file fixes, typos, trivial changes
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: Trivial tasks should cost nothing and not compete for
    //      Ollama slots with heavy work.
    "quick": {
      "model": "opencode/deepseek-v4-flash-free",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "opencode/qwen3.6-plus-free",
        "opencode/nemotron-3-super-free",
        "opencode/big-pickle"
      ]
    },

    // UNSPECIFIED-LOW — Moderate tasks, low reasoning needs
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: Moderate tasks don't benefit from deeper models.
    //      Free tier is the rational choice.
    "unspecified-low": {
      "model": "opencode/deepseek-v4-flash-free",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1",
        "opencode/qwen3.6-plus-free",
        "opencode/big-pickle"
      ]
    },

    // UNSPECIFIED-HIGH — Complex features, architecture
    // CHOICE: Kimi K2.6 (Level 3, flat-rate)
    // WHY: Complex multi-file = SWE-Bench Pro: Kimi +3.2 over V4 Pro.
    //      HumanEval: Kimi +15.2. Real-world PRs favor Kimi.
    "unspecified-high": {
      "model": "ollama-cloud/kimi-k2.6",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "medium" },
        "opencode/qwen3.6-plus-free"
      ]
    },

    // DEEP — Deep technical work, refactors, hard bugs
    // CHOICE: GPT-5.4 (Zen pay-per-use, medium variant)
    // WHY: Exists for tasks where "good enough" isn't acceptable.
    //      Low frequency → cost ~EUR 3-4/month.
    "deep": {
      "model": "opencode/gpt-5.4",
      "variant": "medium",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1",
        "opencode/qwen3.6-plus-free"
      ]
    },

    // VISUAL-ENGINEERING — UI, CSS, styling, layout, design
    // CHOICE: Gemini-3.5-Flash (Zen pay-per-use)
    // WHY: 25% cheaper than 3.1 Pro, faster responses for UI
    //      iteration. CSS/React work doesn't need deep reasoning.
    //      Gemini-3.1-Pro kept as fallback for complex visual analysis.
    "visual-engineering": {
      "model": "opencode/gemini-3.5-flash",
      "fallback_models": [
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gemini-3.1-pro", "variant": "high" }
      ]
    },

    // ULTRABRAIN — Maximum reasoning, logic-heavy tasks
    // CHOICE: Kimi K2.6 (Level 3, max variant, thinking=true)
    // WHY: Kimi IS a reasoning model — extended CoT by default.
    //      HLE with tools: 54.0% vs V4 Pro 37.7% (+16.3).
    //      "Maximum reasoning" is Kimi's core identity.
    "ultrabrain": {
      "model": "ollama-cloud/kimi-k2.6",
      "variant": "max",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-pro",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "high" },
        { "model": "opencode/claude-opus-4-6", "variant": "max" }
      ]
    },

    // WRITING — Documentation, prose, technical writing
    // CHOICE: DeepSeek V4 Flash Free (Zen $0)
    // WHY: Documentation generation is high-volume, low-stakes.
    //      Free tier handles it without touching Ollama quota.
    "writing": {
      "model": "opencode/deepseek-v4-flash-free",
      "fallback_models": [
        "ollama-cloud/deepseek-v4-flash",
        "opencode/nemotron-3-super-free",
        "opencode/qwen3.6-plus-free",
        "opencode/big-pickle"
      ]
    },

    // INFRASTRUCTURE — Terraform, K8s, Ansible, IaC
    // CHOICE: DeepSeek V4 Pro (Level 4, flat-rate)
    // WHY: IaC requires precision. V4 Pro's MMLU-Pro advantage
    //      (+2.9) matters for config syntax and API surfaces.
    "infrastructure": {
      "model": "ollama-cloud/deepseek-v4-pro",
      "fallback_models": [
        "ollama-cloud/kimi-k2.6",
        "ollama-cloud/glm-5.1",
        { "model": "opencode/gpt-5.4", "variant": "medium" },
        "opencode/qwen3.6-plus-free"
      ]
    }
  },

  // Provider concurrency: flat-rate moderate (3), pay-per-use higher (5)
  "providerConcurrency": {
    "ollama-cloud": 3,
    "opencode": 5
  },

  // Skills: registered abilities for specialized tasks
  "skills": {
    "terraform-expert": {
      "description": "IaC specialist for Terraform/OpenTofu"
    },
    "k8s-architect": {
      "description": "Kubernetes manifest specialist"
    }
  }
}

To use this configuration: save it as ~/.config/opencode/oh-my-openagent.jsonc and add "plugin": ["oh-my-openagent"] to your opencode.json. Restart OpenCode — the plugin auto-installs from npm on startup. Adjust model assignments based on your own usage patterns.

Resources

Author note: This Oh My OpenAgent configuration was built and refined through daily use. The initial 50/50 split was adjusted based on real-world quota consumption data. The orchestrator was moved to Kimi after observing ~2.7× higher per-request quota burn on V4 Pro. The result is a configuration file that is both cost-efficient and quality-optimal — not by cutting corners, but by routing each task to the right model.

Like it ?

Get notified on new posts (max 1 / month)
Soyez informés lors des prochains articles

Leave a Reply

Your email address will not be published. Required fields are marked *