Introducing Cogito v2.1

Community Article Published November 19, 2025

Takeaways

  • We are releasing the best open-weight LLM by a US company: Cogito v2.1 671B
  • On most industry benchmarks and our internal evals, the model performs competitively to frontier closed and open models, while being ahead of any other US open model.
  • We also built an interface where you can try the model: chat.deepcogito.com. This is free and we don’t store any chats.
  • This model uses significantly fewer tokens amongst any similar capability models, because it has better reasoning capabilities. It also has improvements across instruction following, coding, longer queries, multi-turn and creativity.

Release Details

The model weights1 are available on Huggingface.

The model is also available as an API on OpenRouter, Fireworks AI, Together AI, Ollama’s cloud, Baseten and RunPod. You can also run the model locally using Ollama or Unsloth.

Evaluation

We have added evaluations on a few standard benchmarks.

(Note - while these benchmarks provide a useful signal, they do not fully capture real-world performance. That said, our models have been tested across multiple internal and external evaluations and consistently perform well.

Ultimately, the best evals are the ones closest to the user's needs. We encourage users to test out the models on the chat interface directly.

We are confident that our models will stand up to such real-world evaluations and deliver strong results in practice.)

v2-1-benchmark-1

The evaluations and their setup2 is kept common across multiple models.

v2-1-benchmark-2

Cogito models are trained via process supervision for the reasoning chains. As a result, the model develops a stronger intuition for the right search trajectory during the reasoning process, and does not need long reasoning chains to arrive at the correct answer.

Cogito v2.1 model has the lowest average tokens3 used with respect to reasoning models of similar capabilities.

v2-1-benchmark-3

Usage

Cogito v2.1 is a 671B parameter Mixture of Experts model in BF16 format, consuming approximately 1.3 TB for parameters. You will need at least 8 B200s (1 node) or 16 H200s (2 nodes) to run this model. For serving on 8 H200s, use the quantized version: deepcogito/cogito-671b-v2.1-FP8.

With HuggingFace pipeline

import torch
from transformers import pipeline

model_id = "deepcogito/cogito-671b-v2.1"
pipe = pipeline("text-generation", model=model_id, model_kwargs={"dtype": "auto"}, device_map="auto")

messages = [
    {"role": "system", "content": "Always respond in 1-2 words."},
    {"role": "user", "content": "Who created you?"},
]

## without reasoning
outputs = pipe(messages, max_new_tokens=512, tokenizer_encode_kwargs={"enable_thinking": False})
print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': 'Deep Cogito'}

## with reasoning
outputs = pipe(messages, max_new_tokens=512, tokenizer_encode_kwargs={"enable_thinking": True})
print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': 'The question is asking about my creator. I know that I\'m Cogito, an AI assistant created by Deep Cogito, which is an AI research lab. The question is very direct and can be answered very briefly. Since the user has specified to always respond in 1-2 words, I should keep my answer extremely concise.\n\nThe most accurate 2-word answer would be "Deep Cogito" - this names the organization that created me without any unnecessary details. "Deep Cogito" is two words, so it fits the requirement perfectly.\n</think>\nDeep Cogito'}

With HuggingFace AutoModel

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepcogito/cogito-671b-v2.1"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "system", "content": "Always respond in 1-2 words."},
    {"role": "user", "content": "Who created you?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
# To enable reasoning, set `enable_thinking=True` above.

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

With vLLM

from transformers import AutoTokenizer
from vllm import SamplingParams, LLM

model_id = "deepcogito/cogito-671b-v2.1-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, tensor_parallel_size=8, gpu_memory_utilization=0.95, max_model_len=16384)
sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)

prompts = ["who created you?", "how are you doing?"]

prompts = [
    tokenizer.apply_chat_template(
        [{"role": "system", "content": "Always respond in 1-2 words."}, {"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )
    for prompt in prompts
]
# To enable reasoning, set `enable_thinking=True` above.

out = llm.generate(prompts, sampling_params=sampling_params)
print([res.outputs[0].text for res in out])

With SGLang

Start the local endpoint with:

# H200s
python3 -m sglang.launch_server --model deepcogito/cogito-671b-v2.1-FP8 --tp 8

# B200s
python3 -m sglang.launch_server --model deepcogito/cogito-671b-v2.1-FP8 --tp 8 --quantization compressed-tensors --moe-runner-backend triton

Then query the model:

import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "Always respond in 1-2 words."},
        {"role": "user", "content": "Who created you?"},
    ],
    temperature=0.6,
    max_tokens=8192,
    extra_body = {"chat_template_kwargs": {"enable_thinking": False}}
)
# To enable reasoning, set `enable_thinking=True` above.
print(response.choices[0].message.content)

1 Similar to Cogito v2, we fork off the open-licensed Deepseek base model from November 2024 and run post-training in-house for Cogito v2.1.

2 For SWE-Bench, we used the Agentless framework as orchestration to perform code repair localization in each instance's repo and patch generation OpenAI's text-embedding-3-small model was used during the RAG step. The presented numbers is Single Patch without Test (accuracy), with max output tokens for each model call set to 16384 (except for Qwen3-VL 235B, which was set to 32768, due to otherwise unusable evaluation comparison due to frequent response length violation).

Metric types: 'accuracy' for non-coding benchmarks (AIME, GPQA Diamond, MATH-500, MMLU-Pro, HLE, MMMLU), 'f1' for SimpleQA Verified

Repeats per example: AIME (32), GPQA Diamond (8), MATH-500 (3), MMLU-Pro (1), HLE (1), SimpleQA Verified (1), MMMLU (1)

3 The graph shows the average tokens used per benchmark instance, averaged over all benchmarks.

Community

Sign up or log in to comment