Novoyaz 20B — Pre-reform → Modern Russian Orthography on gpt-oss-20b
Novoyaz 20B is a specialized text-generation model built on top of OpenAI’s open-weight gpt-oss-20b.
Its job is very focused:
Rewrite дореформенный (pre-revolution / pre-1918) Russian text into modern Russian orthography, preserving meaning and punctuation, without adding explanations, commentary, or translation.
This repository contains:
- A task-specific model configuration (
novoyaz-20b) set up for 4-bit inference with bitsandbytes. - A custom inference handler (
handler.py) used for Hugging Face Inference Endpoints, which:- Loads
openai/gpt-oss-20b(orunsloth/gpt-oss-20b-bnb-4bitvia an env flag). - Wraps inputs in a fixed, Russian-language instruction prompt.
- Loads
- A LoRA adapter checkpoint under
checkpoint-60/(trained with Unsloth + TRL onunsloth/gpt-oss-20b-unsloth-bnb-4bit) for advanced users who want to attach the adapter manually.
Note: The current production
handler.pyloads the baseopenai/gpt-oss-20bmodel and applies a strict system prompt for orthography normalization.
The LoRA adapter incheckpoint-60/is provided for offline / custom integration and future experiments.
1. Intended Use
1.1 Primary use case
Novoyaz 20B is designed for one main task:
- Input: Russian text in pre-reform / дореформенная orthography
(e.g. word-final hard signsъ, letters likeѣ, old spellings, etc.), often coming from OCR of older printed materials. - Output: The same text in modern Russian spelling, with:
- Meaning preserved
- Punctuation preserved as much as possible
- No commentary, explanation, or translation
The production handler uses the following instruction (simplified):
Ты – модель, которая строго переписывает дореформенный русский текст
в современную орфографию, не меняя смысл и пунктуацию.
Не добавляй комментарии и не переводь текст.
Typical scenarios:
- Normalizing OCR output from pre-1918 printed sources
- Preparing historical Russian texts for NLP pipelines that expect modern orthography
- Making archival / émigré Russian materials more accessible without changing their meaning
1.2 Out-of-scope / not recommended
Novoyaz 20B is not designed for:
- General chat / assistant behavior
- Translation into other languages
- Content moderation or classification
- Creative writing or long-form generation unrelated to orthography normalization
Outside this domain the model may still produce usable text, but behavior is not tuned or evaluated.
2. Quickstart
2.1 Simple inference via transformers (local)
The snippet below mirrors what handler.py does internally:
Loads
openai/gpt-oss-20bWraps your text with a fixed instruction prefix and suffix
Generates a non-sampled, deterministic normalization
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "openai/gpt-oss-20b" # base model used by handler tokenizer = AutoTokenizer.from_pretrained( model_id, use_fast=True, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto" if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None, trust_remote_code=True, low_cpu_mem_usage=True, ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token if not torch.cuda.is_available(): model.config.use_cache = False PROMPT_PREFIX = ( "Ты – модель, которая строго переписывает дореформенный русский текст " "в современную орфографию, не меняя смысл и пунктуацию. " "Не добавляй комментарии и не переводь текст.\n\nТекст:\n" ) PROMPT_SUFFIX = "\n\nСовременный орфографический вариант:" pre_reform = "Въ началѣ бѣ Слово, и Слово бѣ къ Богу..." prompt = f"{PROMPT_PREFIX}{pre_reform}{PROMPT_SUFFIX}" inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device) gen_kwargs = dict( do_sample=False, temperature=0.0, num_beams=1, max_new_tokens=512, repetition_penalty=1.0, ) with torch.inference_mode(): outputs = model.generate(**inputs, **gen_kwargs) # Strip the prompt tokens from the generated sequence in_len = inputs["input_ids"].shape[-1] gen_only = outputs[0][in_len:] modern_text = tokenizer.decode(gen_only, skip_special_tokens=True).strip() print(modern_text)
You can switch to your own quantized or fine-tuned variant by changing model_id to point to a local directory or a Hugging Face model that includes your weights.
2.2 Using this repo’s Inference Endpoint / HF Inference API
If you deploy this repository (ZennyKenny/novoyaz-20b) as an Inference Endpoint, handler.py expects requests in the following format:
{
"inputs": "дореформенный текст"
}
or
{
"inputs": ["текст 1", "текст 2"]
}
Example client code:
import requests
API_URL = "https://api-inference.huggingface.co/models/ZennyKenny/novoyaz-20b"
HF_TOKEN = "hf_..." # your token
headers = {"Authorization": f"Bearer {HF_TOKEN}"}
pre_reform = "Въ началѣ бѣ Слово, и Слово бѣ къ Богу..."
response = requests.post(API_URL, headers=headers, json={"inputs": pre_reform})
data = response.json()
# data will look like: [{"generated_text": "..." }]
print(data[0]["generated_text"])
By default, the handler:
- Uses deterministic generation (
do_sample = False,temperature = 0.0,num_beams = 1). - Limits responses to
GEN_MAX_NEW_TOKENS(default 512; configurable via environment variable). - Truncates very long inputs to 2048 tokens to avoid pathological cases.
3. LoRA Adapter (checkpoint-60/)
The directory checkpoint-60/ contains a LoRA adapter trained with Unsloth’s SFT trainer and TRL on:
- Base model:
unsloth/gpt-oss-20b-unsloth-bnb-4bit
This is exposed as a PEFT adapter that you can attach manually to a compatible base model:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model_id = "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
adapter_id = "ZennyKenny/novoyaz-20b/checkpoint-60"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
use_fast=True,
trust_remote_code=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto" if torch.cuda.is_available() else None,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
# Then use the same PROMPT_PREFIX / PROMPT_SUFFIX scheme as above.
The checkpoint-60/README.md contains the auto-generated PEFT model card template; this top-level README.md provides the task-specific details and recommended usage.
4. Model Architecture & Config
Core architecture details (from config.json):
- Base architecture:
GptOssForCausalLM(model_type: "gpt_oss") - Hidden size: 2880
- Layers: 24 transformer decoder layers
- Attention heads: 64
- Key/value heads: 8
- Mixture-of-Experts (MoE):
experts_per_token: 4num_local_experts: 32output_router_logits: false
- Context & RoPE:
max_position_embeddings: 131072initial_context_length: 4096rope_scaling: YaRN-style scaling up to 131k tokens
- Quantization config (bitsandbytes):
load_in_4bit: truebnb_4bit_quant_type: "nf4"bnb_4bit_compute_dtype: "bfloat16"bnb_4bit_use_double_quant: truellm_int8_skip_modules: ["router", "lm_head", "embed_tokens"]
This config mirrors openai/gpt-oss-20b, but includes an explicit 4-bit quantization config so the model can be loaded efficiently on consumer hardware with bitsandbytes.
5. Training & Data
5.1 Training procedure (LoRA)
The LoRA adapter in checkpoint-60/ was trained using:
- Base model:
unsloth/gpt-oss-20b-unsloth-bnb-4bit - Method: Supervised fine-tuning (SFT) with LoRA
- Frameworks:
- Unsloth SFT trainer (for efficient fine-tuning and long context)
- TRL (Transformers Reinforcement Learning library by Hugging Face)
- Transformers, Accelerate, BitsAndBytes
The auto-generated training metadata lists:
- TRL: 0.21.0
- Transformers: 4.56.0.dev0
- PyTorch: 2.8.0
- Datasets: 3.6.0
- Tokenizers: 0.21.4
(See checkpoint-60/training_args.bin and checkpoint-60/trainer_state.json for exact hyperparameters.)
5.2 Training data (task description)
The fine-tuning data is a parallel corpus of:
- Source: Russian texts written in pre-reform orthography (pre-1918, plus émigré usage).
- Target: The same texts normalized to modern orthography.
The dataset includes:
- Excerpts from public-domain historical materials that preserve pre-reform spelling.
- Manually or semi-automatically normalized versions in modern spelling.
- Augmented examples to cover edge cases (rare letters, tricky word forms, punctuation quirks, mild OCR noise).
At this time, the exact dataset is not being released; if a cleaned public subset becomes available later, it will be linked from this model card.
6. Evaluation
Novoyaz 20B is evaluated on fidelity of orthography conversion, not on generic benchmarks:
Key questions:
- Does the output preserve the original meaning and sentence structure?
- Does it follow modern Russian spelling norms?
- Does it avoid adding commentary, translation, or explanations?
Current evaluation is mostly qualitative and task-specific, including:
- Manual review on held-out parallel examples.
- Spot-checking OCR output from historical books.
- Comparison to base
gpt-oss-20bwithout the strict normalization prompt or adapter.
No standardized numerical benchmarks are published yet.
If you run your own evaluations (e.g. on a test set of historical texts), contributions via issues / PRs are very welcome.
7. Limitations, Risks & Bias
7.1 Limitations
- Domain-specific:
The model is tuned for pre-reform → modern Russian orthography. It is not a general chat model. - OCR noise:
Very noisy OCR (missing characters, mis-segmented words, etc.) can cause:- Over-aggressive “fixing” of grammar/meaning.
- Occasional hallucinated corrections where the original glyphs are unreadable.
- Edge cases:
Very archaic or regional spellings not present in the training set may be normalized incorrectly or inconsistently.
7.2 Bias & content risks
Novoyaz 20B inherits the content and biases of:
- The base model
gpt-oss-20band its training data. - The historical texts used for fine-tuning (which may contain outdated or discriminatory language).
The task is narrowly defined — rewriting spelling — so harmful or biased content present in the input will usually be preserved (in modern orthography) rather than removed.
Do not rely on this model for:
- Safety filtering
- Content moderation
- Debiasing text
For sensitive domains, consider chaining Novoyaz 20B with dedicated moderation / safety models.
8. Ethical & Historical Context
The Novoyaz project exists to help:
- Researchers, students, and archivists work with pre-revolutionary and émigré Russian sources.
- NLP practitioners integrate historical Russian text into modern pipelines.
The model:
- Does not attempt to change the meaning, ideology, or tone of the text.
- Does make the spelling conform to post-1918 standards, so the text can be read and processed more easily.
Historical texts necessarily reflect the values and biases of their time.
Modernized spelling does not imply endorsement, and users should approach such content with critical historical awareness.
9. How to Cite
If you use Novoyaz 20B in academic work or products, please consider citing both:
The base model (example):
OpenAI. “gpt-oss-120b & gpt-oss-20b Model Card.” 2025.
This project (example):
Hamilton, K. “Novoyaz 20B: Orthography Normalization for Pre-reform Russian on gpt-oss-20b.” Hugging Face, 2025.
10. Contact & Contributions
- Author / maintainer: @ZennyKenny
- Issues & questions: Open an issue on the model page.
- Contributions welcome:
- Examples of failure cases / tricky spellings
- Evaluation scripts and test sets for historical Russian
- Suggestions for improved prompts or integration patterns
If you build something with Novoyaz (OCR pipelines, research tools, teaching aids, etc.), I’d love to hear about it — it helps guide what to improve next.
- Downloads last month
- 53
Model tree for ZennyKenny/novoyaz-20b
Base model
openai/gpt-oss-20b