Cohere: North Mini Code model

Explore Cohere's North Mini Code, a 30B parameter MoE model for agentic software engineering. Learn its architecture, training, and when it's a good fit for your coding tasks.

North Mini Code MoE OpenCode vLLM Hugging Face North Mini Code (30B MoE, 3B active) Agent harness Live Demo vLLM for serving rollouts during async RL SWE-Agent and mini-SWE-agent harnesses for training and eval Harbor for containerized agent environments SWE-Bench Verified and Terminal-Bench v2 for benchmarking

Overview

North Mini Code is a 30B parameter MoE coding model (3B active) that Cohere released June 9 under Apache 2.0, trained specifically for agentic software engineering. We will demo it live in OpenCode so you can watch it work through an agentic coding task end to end. Alongside the live agent session, we’ll walk through the architecture and the post-training pipeline that got it there: two stages of SFT followed by async RLVR across terminal and SWE environments.

Links

https://huggingface.co/CohereLabs/North-Mini-Code-1.0
North Mini Code is an agentic coding model featuring interleaved thinking.

Tech stack

North Mini Code

Cohere's open-weights 30B Mixture-of-Experts model designed to run high-performance, agentic software engineering tasks on local hardware.

North Mini Code is a decoder-only, sparse Mixture-of-Experts (MoE) model built specifically for local, agentic coding workflows. Released under an Apache 2.0 license, it packs 30 billion total parameters but activates only 3 billion per token, giving developers the speed of a lightweight model alongside the reasoning capacity of a much larger system. It handles a massive 256K token context window and generates up to 64K tokens of continuous output, making it highly effective for parsing entire codebases, executing complex terminal-based agent tasks, and powering self-hosted software engineering pipelines on a single GPU.

https://huggingface.co/CohereLabs/North-Mini-Code-1.0

View projects
MoE

MoE scales model capacity by activating only a sparse subset of specialized parameters for each input token.

Mixture of Experts (MoE) replaces dense feed-forward layers with a collection of specialized sub-networks (experts) managed by a gating mechanism. This architecture allows models like Mixtral 8x7B or GPT-4 to scale to trillions of parameters while maintaining the inference latency of much smaller models. By routing each token to the top-2 most relevant experts, the system maximizes computational efficiency: it increases total parameter count by 10x or more without a linear increase in FLOPs (floating point operations). This sparse activation strategy is the primary driver for current state-of-the-art performance in large language models.

https://arxiv.org/abs/1701.06538

View projects
OpenCode

OpenCode is the open-source AI coding agent (CLI tool), integrating LLMs like GPT-5 and Claude Sonnet 4 directly into the terminal for fast, context-aware development.

OpenCode is the open-source AI coding agent, built for terminal-first developers who demand speed and privacy. It connects your local files, Git history, and a choice of LLMs (e.g., OpenAI's GPT-5 Nano, Anthropic's Claude Sonnet 4) to execute complex tasks directly from the command line . The tool bypasses IDE and browser dependencies, allowing developers to triage issues, fix errors, or implement features with commands like `opencode fix error in main.go` . With over 26,000 GitHub stars by October 2025, OpenCode delivers a secure, context-aware coding partner that keeps your code local and your workflow efficient .

https://opencode.ai

View projects
vLLM

vLLM is the high-throughput, memory-efficient LLM inference engine: it leverages PagedAttention to maximize GPU utilization and cut serving costs.

This is the engine for scaling LLM inference: vLLM (Virtual Large Language Model) is an open-source library engineered for high-throughput and low-latency serving. Its core innovation is PagedAttention, a memory management technique inspired by OS virtual memory, which efficiently handles the Key-Value (KV) cache. This optimization drastically reduces memory overhead—up to 90% in some reported cases—and allows for continuous batching of requests. The result: significantly higher request capacity on the same hardware, lower GPU usage, and a production-ready, cost-effective serving system that supports popular models like Llama and Mistral, complete with an OpenAI-compatible API server.

https://vllm.ai/

View projects
Hugging Face

Hugging Face is the central, open-source platform and community for building AI applications, hosting over 300,000 models and datasets via the popular Transformers library.

Hugging Face functions as the 'GitHub for machine learning,' providing a massive, collaborative Hub for AI assets (models, datasets, and demos). Its core technology is the open-source **Transformers** Python library, which simplifies the use of state-of-the-art models (e.g., BERT, GPT) for various tasks: natural language processing, computer vision, and audio. The platform hosts over 300,000 models and thousands of datasets, streamlining the entire ML workflow from research to deployment via **Spaces** (interactive demos). This ecosystem makes advanced AI accessible, efficient, and reproducible for developers and enterprises globally.

https://huggingface.co

View projects
North Mini Code (30B MoE, 3B active)

Cohere's open-weight 30B Mixture-of-Experts model designed to run high-performance, agentic software engineering tasks on local hardware.

Built specifically for developer workflows, North Mini Code utilizes a sparse Mixture-of-Experts (MoE) architecture with 128 experts (activating 8 per token) to deliver the punch of a 30B model with the speed and resource footprint of a 3B active parameter footprint. It features a massive 256K token context window and is optimized for complex multi-step tasks like terminal operations, system architecture mapping, and sub-agent orchestration. Released under an Apache 2.0 license, this model runs efficiently on a single H100 GPU (at FP8 precision) to bring robust, enterprise-grade coding intelligence directly to local developer environments.

https://huggingface.co/CohereLabs/North-Mini-Code-1.0

View projects
Agent harness

An agent harness is the operational infrastructure wrapping a raw language model to manage tool execution, state, and sandboxed environments, turning static text generation into an autonomous work engine.

While raw language models excel at text generation, they cannot natively execute code, manage persistent state, or call external APIs. The agent harness bridges this gap by serving as the execution layer that orchestrates system prompts, manages tool registries (like the Model Context Protocol), and runs isolated sandboxes. By handling the execution loops and error recovery that models cannot manage alone, a well-engineered harness can dramatically swing benchmark performance (such as LangChain's 13.7-point jump on Terminal-Bench 2.0) without changing the underlying model. This infrastructure layer is what ultimately transforms a static LLM into a reliable, production-ready autonomous agent.

https://langchain-ai.github.io/posts/anatomy-of-an-agent-harness/

View projects
Live Demo

LiveDemo is an AI-powered platform that lets go-to-market teams capture, customize, and deploy interactive product demonstrations in seconds.

Software sales move fast, and static screenshots do not close deals. LiveDemo solves this by letting sales, marketing, and customer success teams clone their web application front-end with a single click (no engineering resources required). Users record a standard walkthrough, and the platform automatically packages it into a fully interactive, sandbox-like guided tour. Teams can easily swap out data, customize logos for specific prospects, and embed the finished product directly onto websites or share personalized links post-call to keep sales cycles moving.

https://www.livedemo.live

View projects
vLLM for serving rollouts during async RL

vLLM decouples generation and training in RLHF pipelines, using an AsyncLLMEngine to serve rollouts continuously while trainers update weights mid-flight.

Standard reinforcement learning pipelines waste massive amounts of compute because training accelerators sit idle during rollout generation, and vice versa. vLLM solves this bottleneck by running generation and training as parallel coroutines. Using the AsyncLLMEngine alongside native weight-syncing APIs (utilizing NCCL or IPC), vLLM continuously streams rollout data to a shared buffer while the trainer updates the model. When new weights are ready, vLLM pauses generation with a specialized 'keep' mode, swaps the weights mid-flight, and resumes the active generation stream without losing in-flight requests. This asynchronous execution eliminates pipeline bubbles, drastically boosting GPU utilization and training throughput.

https://docs.vllm.ai/en/stable/features/async_rl.html

View projects
SWE-Agent and mini-SWE-agent harnesses for training and eval

An open source agentic framework and lightweight harness designed to run, train, and evaluate language models on real-world software engineering tasks.

SWE-agent turns language models into autonomous coding agents capable of resolving GitHub issues within secure, sandboxed environments. While the original system uses an Agent-Computer Interface (ACI) to let models browse files and execute tests, the newer mini-swe-agent streamlines this architecture into a radically simple 100-line Python implementation. Together, these harnesses serve as the standard infrastructure for running evaluations on SWE-bench and generating high-quality trajectory datasets for supervised fine-tuning (SFT) and reinforcement learning.

https://github.com/princeton-nlp/SWE-agent

View projects
Harbor for containerized agent environments

Harbor is an open-source framework for running, evaluating, and optimizing AI agents inside secure, containerized sandbox environments.

Built by the creators of Terminal-Bench, Harbor provides a unified harness to evaluate and optimize AI agents across thousands of isolated sandboxes. The framework ships with out-of-the-box support for popular agents (including Claude Code, OpenHands, and Codex CLI) and standard benchmarks like SWE-Bench and Terminal-Bench-2.0. By decoupling the agent from its execution layer, Harbor allows developers to run massive parallel evaluations locally via Docker or scale horizontally using cloud providers like Daytona, Modal, and E2B. It is the go-to infrastructure for generating clean rollout data, optimizing prompts, and training agents through reinforcement learning.

https://github.com/harbor-framework/harbor

View projects
SWE-Bench Verified and Terminal-Bench v2 for benchmarking

SWE-bench Verified and Terminal-Bench v2 serve as the industry-standard gauntlets for testing AI agents on real-world software engineering and sandboxed command-line operations.

Evaluating AI agents requires testing them on practical, messy engineering tasks rather than isolated code snippets. SWE-bench Verified solves this by offering a human-vetted subset of 500 real-world GitHub issues (drawn from major Python repositories like Django and pandas) to ensure tasks are clear and solvable. To complement this, Terminal-Bench v2 tests end-to-end command-line competence by requiring agents to execute complex system administration, compilation, and environment-setup tasks inside sandboxed containers. Together, these benchmarks provide the most reliable, reproducible metrics for measuring how effectively modern LLMs can operate as autonomous software engineers.

https://github.com/princeton-nlp/SWE-bench

View projects