Novoyaz 20B — Pre-reform → Modern Russian Orthography on gpt-oss-20b

Novoyaz 20B is a specialized text-generation model built on top of OpenAI’s open-weight gpt-oss-20b.

Its job is very focused:

Rewrite дореформенный (pre-revolution / pre-1918) Russian text into modern Russian orthography, preserving meaning and punctuation, without adding explanations, commentary, or translation.

This repository contains:

A task-specific model configuration (novoyaz-20b) set up for 4-bit inference with bitsandbytes.
A custom inference handler (handler.py) used for Hugging Face Inference Endpoints, which:
- Loads openai/gpt-oss-20b (or unsloth/gpt-oss-20b-bnb-4bit via an env flag).
- Wraps inputs in a fixed, Russian-language instruction prompt.
A LoRA adapter checkpoint under checkpoint-60/ (trained with Unsloth + TRL on unsloth/gpt-oss-20b-unsloth-bnb-4bit) for advanced users who want to attach the adapter manually.

Note: The current production handler.py loads the base openai/gpt-oss-20b model and applies a strict system prompt for orthography normalization.
The LoRA adapter in checkpoint-60/ is provided for offline / custom integration and future experiments.

1. Intended Use

1.1 Primary use case

Novoyaz 20B is designed for one main task:

Input: Russian text in pre-reform / дореформенная orthography
(e.g. word-final hard signs ъ, letters like ѣ, old spellings, etc.), often coming from OCR of older printed materials.
Output: The same text in modern Russian spelling, with:
- Meaning preserved
- Punctuation preserved as much as possible
- No commentary, explanation, or translation

The production handler uses the following instruction (simplified):

Ты – модель, которая строго переписывает дореформенный русский текст
в современную орфографию, не меняя смысл и пунктуацию.
Не добавляй комментарии и не переводь текст.

Typical scenarios:

Normalizing OCR output from pre-1918 printed sources
Preparing historical Russian texts for NLP pipelines that expect modern orthography
Making archival / émigré Russian materials more accessible without changing their meaning

1.2 Out-of-scope / not recommended

Novoyaz 20B is not designed for:

General chat / assistant behavior
Translation into other languages
Content moderation or classification
Creative writing or long-form generation unrelated to orthography normalization

Outside this domain the model may still produce usable text, but behavior is not tuned or evaluated.

2. Quickstart

2.1 Simple inference via `transformers` (local)

The snippet below mirrors what handler.py does internally:

Loads openai/gpt-oss-20b
Wraps your text with a fixed instruction prefix and suffix

Generates a non-sampled, deterministic normalization

  from transformers import AutoTokenizer, AutoModelForCausalLM
  import torch

  model_id = "openai/gpt-oss-20b"  # base model used by handler

  tokenizer = AutoTokenizer.from_pretrained(
      model_id,
      use_fast=True,
      trust_remote_code=True,
  )

  model = AutoModelForCausalLM.from_pretrained(
      model_id,
      torch_dtype="auto" if torch.cuda.is_available() else torch.float32,
      device_map="auto" if torch.cuda.is_available() else None,
      trust_remote_code=True,
      low_cpu_mem_usage=True,
  )

  if tokenizer.pad_token is None:
      tokenizer.pad_token = tokenizer.eos_token
  if not torch.cuda.is_available():
      model.config.use_cache = False

  PROMPT_PREFIX = (
      "Ты – модель, которая строго переписывает дореформенный русский текст "
      "в современную орфографию, не меняя смысл и пунктуацию. "
      "Не добавляй комментарии и не переводь текст.\n\nТекст:\n"
  )
  PROMPT_SUFFIX = "\n\nСовременный орфографический вариант:"

  pre_reform = "Въ началѣ бѣ Слово, и Слово бѣ къ Богу..."
  prompt = f"{PROMPT_PREFIX}{pre_reform}{PROMPT_SUFFIX}"

  inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

  gen_kwargs = dict(
      do_sample=False,
      temperature=0.0,
      num_beams=1,
      max_new_tokens=512,
      repetition_penalty=1.0,
  )

  with torch.inference_mode():
      outputs = model.generate(**inputs, **gen_kwargs)

  # Strip the prompt tokens from the generated sequence
  in_len = inputs["input_ids"].shape[-1]
  gen_only = outputs[0][in_len:]

  modern_text = tokenizer.decode(gen_only, skip_special_tokens=True).strip()
  print(modern_text)

You can switch to your own quantized or fine-tuned variant by changing model_id to point to a local directory or a Hugging Face model that includes your weights.

2.2 Using this repo’s Inference Endpoint / HF Inference API

If you deploy this repository (ZennyKenny/novoyaz-20b) as an Inference Endpoint, handler.py expects requests in the following format:

  {
    "inputs": "дореформенный текст"
  }

  {
    "inputs": ["текст 1", "текст 2"]
  }

Example client code:

  import requests
  
  API_URL = "https://api-inference.huggingface.co/models/ZennyKenny/novoyaz-20b"
  HF_TOKEN = "hf_..."  # your token
  headers = {"Authorization": f"Bearer {HF_TOKEN}"}
  
  pre_reform = "Въ началѣ бѣ Слово, и Слово бѣ къ Богу..."
  
  response = requests.post(API_URL, headers=headers, json={"inputs": pre_reform})
  data = response.json()
  
  # data will look like: [{"generated_text": "..." }]
  print(data[0]["generated_text"])

By default, the handler:

Uses deterministic generation (do_sample = False, temperature = 0.0, num_beams = 1).
Limits responses to GEN_MAX_NEW_TOKENS (default 512; configurable via environment variable).
Truncates very long inputs to 2048 tokens to avoid pathological cases.

3. LoRA Adapter (`checkpoint-60/`)

The directory checkpoint-60/ contains a LoRA adapter trained with Unsloth’s SFT trainer and TRL on:

Base model: unsloth/gpt-oss-20b-unsloth-bnb-4bit

This is exposed as a PEFT adapter that you can attach manually to a compatible base model:

  from transformers import AutoTokenizer, AutoModelForCausalLM
  from peft import PeftModel
  import torch

  base_model_id = "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
  adapter_id    = "ZennyKenny/novoyaz-20b/checkpoint-60"

  tokenizer = AutoTokenizer.from_pretrained(
      base_model_id,
      use_fast=True,
      trust_remote_code=True,
  )

  base_model = AutoModelForCausalLM.from_pretrained(
      base_model_id,
      device_map="auto" if torch.cuda.is_available() else None,
      trust_remote_code=True,
  )

  model = PeftModel.from_pretrained(base_model, adapter_id)

  # Then use the same PROMPT_PREFIX / PROMPT_SUFFIX scheme as above.

The checkpoint-60/README.md contains the auto-generated PEFT model card template; this top-level README.md provides the task-specific details and recommended usage.

4. Model Architecture & Config

Core architecture details (from config.json):

Base architecture: GptOssForCausalLM (model_type: "gpt_oss")
Hidden size: 2880
Layers: 24 transformer decoder layers
Attention heads: 64
Key/value heads: 8
Mixture-of-Experts (MoE):
- experts_per_token: 4
- num_local_experts: 32
- output_router_logits: false
Context & RoPE:
- max_position_embeddings: 131072
- initial_context_length: 4096
- rope_scaling: YaRN-style scaling up to 131k tokens
Quantization config (bitsandbytes):
- load_in_4bit: true
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: "bfloat16"
- bnb_4bit_use_double_quant: true
- llm_int8_skip_modules: ["router", "lm_head", "embed_tokens"]

This config mirrors openai/gpt-oss-20b, but includes an explicit 4-bit quantization config so the model can be loaded efficiently on consumer hardware with bitsandbytes.

5. Training & Data

5.1 Training procedure (LoRA)

The LoRA adapter in checkpoint-60/ was trained using:

Base model: unsloth/gpt-oss-20b-unsloth-bnb-4bit
Method: Supervised fine-tuning (SFT) with LoRA
Frameworks:
- Unsloth SFT trainer (for efficient fine-tuning and long context)
- TRL (Transformers Reinforcement Learning library by Hugging Face)
- Transformers, Accelerate, BitsAndBytes

The auto-generated training metadata lists:

TRL: 0.21.0
Transformers: 4.56.0.dev0
PyTorch: 2.8.0
Datasets: 3.6.0
Tokenizers: 0.21.4

(See checkpoint-60/training_args.bin and checkpoint-60/trainer_state.json for exact hyperparameters.)

5.2 Training data (task description)

The fine-tuning data is a parallel corpus of:

Source: Russian texts written in pre-reform orthography (pre-1918, plus émigré usage).
Target: The same texts normalized to modern orthography.

The dataset includes:

Excerpts from public-domain historical materials that preserve pre-reform spelling.
Manually or semi-automatically normalized versions in modern spelling.
Augmented examples to cover edge cases (rare letters, tricky word forms, punctuation quirks, mild OCR noise).

At this time, the exact dataset is not being released; if a cleaned public subset becomes available later, it will be linked from this model card.

6. Evaluation

Novoyaz 20B is evaluated on fidelity of orthography conversion, not on generic benchmarks:

Key questions:

Does the output preserve the original meaning and sentence structure?
Does it follow modern Russian spelling norms?
Does it avoid adding commentary, translation, or explanations?

Current evaluation is mostly qualitative and task-specific, including:

Manual review on held-out parallel examples.
Spot-checking OCR output from historical books.
Comparison to base gpt-oss-20b without the strict normalization prompt or adapter.

No standardized numerical benchmarks are published yet.
If you run your own evaluations (e.g. on a test set of historical texts), contributions via issues / PRs are very welcome.

7. Limitations, Risks & Bias

7.1 Limitations

Domain-specific:
The model is tuned for pre-reform → modern Russian orthography. It is not a general chat model.
OCR noise:
Very noisy OCR (missing characters, mis-segmented words, etc.) can cause:
- Over-aggressive “fixing” of grammar/meaning.
- Occasional hallucinated corrections where the original glyphs are unreadable.
Edge cases:
Very archaic or regional spellings not present in the training set may be normalized incorrectly or inconsistently.

7.2 Bias & content risks

Novoyaz 20B inherits the content and biases of:

The base model gpt-oss-20b and its training data.
The historical texts used for fine-tuning (which may contain outdated or discriminatory language).

The task is narrowly defined — rewriting spelling — so harmful or biased content present in the input will usually be preserved (in modern orthography) rather than removed.

Do not rely on this model for:

Safety filtering
Content moderation
Debiasing text

For sensitive domains, consider chaining Novoyaz 20B with dedicated moderation / safety models.

8. Ethical & Historical Context

The Novoyaz project exists to help:

Researchers, students, and archivists work with pre-revolutionary and émigré Russian sources.
NLP practitioners integrate historical Russian text into modern pipelines.

The model:

Does not attempt to change the meaning, ideology, or tone of the text.
Does make the spelling conform to post-1918 standards, so the text can be read and processed more easily.

Historical texts necessarily reflect the values and biases of their time.
Modernized spelling does not imply endorsement, and users should approach such content with critical historical awareness.

9. How to Cite

If you use Novoyaz 20B in academic work or products, please consider citing both:

The base model (example):

OpenAI. “gpt-oss-120b & gpt-oss-20b Model Card.” 2025.
This project (example):

Hamilton, K. “Novoyaz 20B: Orthography Normalization for Pre-reform Russian on gpt-oss-20b.” Hugging Face, 2025.

10. Contact & Contributions

Author / maintainer: @ZennyKenny
Issues & questions: Open an issue on the model page.
Contributions welcome:
- Examples of failure cases / tricky spellings
- Evaluation scripts and test sets for historical Russian
- Suggestions for improved prompts or integration patterns

If you build something with Novoyaz (OCR pipelines, research tools, teaching aids, etc.), I’d love to hear about it — it helps guide what to improve next.

Downloads last month: 53

Model tree for ZennyKenny/novoyaz-20b

Base model

openai/gpt-oss-20b

Adapter

(101)

this model

ZennyKenny
/

novoyaz-20b

Novoyaz 20B — Pre-reform → Modern Russian Orthography on gpt-oss-20b

1. Intended Use

1.1 Primary use case

1.2 Out-of-scope / not recommended

2. Quickstart

2.1 Simple inference via `transformers` (local)

2.2 Using this repo’s Inference Endpoint / HF Inference API

3. LoRA Adapter (`checkpoint-60/`)

4. Model Architecture & Config

5. Training & Data

5.1 Training procedure (LoRA)

5.2 Training data (task description)

6. Evaluation

7. Limitations, Risks & Bias

7.1 Limitations

7.2 Bias & content risks

8. Ethical & Historical Context

9. How to Cite

10. Contact & Contributions

Model tree for ZennyKenny/novoyaz-20b

Space using ZennyKenny/novoyaz-20b 1

Novoyaz 20B — Pre-reform → Modern Russian Orthography on gpt-oss-20b

1. Intended Use

1.1 Primary use case

1.2 Out-of-scope / not recommended

2. Quickstart

2.1 Simple inference via transformers (local)

2.2 Using this repo’s Inference Endpoint / HF Inference API

3. LoRA Adapter (checkpoint-60/)

4. Model Architecture & Config

5. Training & Data

5.1 Training procedure (LoRA)

5.2 Training data (task description)

6. Evaluation

7. Limitations, Risks & Bias

7.1 Limitations

7.2 Bias & content risks

8. Ethical & Historical Context

9. How to Cite

10. Contact & Contributions

Model tree for ZennyKenny/novoyaz-20b

Space using ZennyKenny/novoyaz-20b 1

2.1 Simple inference via `transformers` (local)

3. LoRA Adapter (`checkpoint-60/`)