--- license: apache-2.0 base_model: - ByteDance-Seed/Seed-OSS-36B-Instruct --- # RWKV-Seed-OSS-36B-hxa079 **Acknowledgment** This project received computational resources and technical support from **Recursal.AI**. I'm deeply grateful for their support! This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the **RADLADS** method. --- ## Model Overview * **Model Name:** RWKV-Seed-OSS-36B-hxa079 * **Architecture:** RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention * **Base Model:** ByteDance-Seed/Seed-OSS-36B-Instruct * **Model Revision:** alpha * **Parameters:** ~37.1B * **Context Window (Passkey):** 130k --- ## Architecture Details * **RWKV Layers:** Interleaved RWKV blocks based on the `hxa079` design * **Transformer Layers:** Placed at strategic depths to enhance long-context performance * **Hybrid Design:** * RWKV provides temporal decay and efficient recurrent-style state handling * NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding * **LoRA Customization:** * Rank Decay: 448 * ICLR: 192 * Value Residual Mix: 128 * Key Residual Mix: 128 * Gate: 576 * **RoPE Usage:** Enabled (`use_rope: true`), aligning positional encoding with RWKV blocks --- ## Key Hyperparameters * Hidden Size: 5120 * Intermediate Size: 27,648 * Head Dimension: 128 * Attention Heads: 80 * Key/Value Heads: 8 * Hidden Layers: 64 * Max Position Embeddings: 524,288 * Activation: SiLU * Dropout: 0.1 (residual & attention) * Bias: Disabled for MLP & Attention Output --- ## Evaluation Performance evaluation is ongoing. The model shows promising results in: - Maintaining base model capabilities while achieving linear attention efficiency - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures - Competitive performance on standard language modeling benchmarks - mmlu: 78.39%(Base 82.41%) - gsm8k: 86.88%(Base93.93%) with gentoken=2048 - passkey 130k+(Base 500k) ## Usage with RWKV-Infer - **RWKV-Infer** Triton based Hybrid RWKV Inference engine, can be check at: [https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F](https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F) ## Usage with Hugging Face Transformers need install flash-linear-attention ```bash pip install flash-linear-attention ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = """There is a very famous song that I recall by the singer's surname as Astley. I can't remember the name or the youtube URL that people use to link as an example url. What's song name?""" messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=512) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## Code Repositories - **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS) - **ARWKV Project Code** The ARWKV original training code, can be found at: [https://github.com/yynil/RWKVInside](https://github.com/yynil/RWKVInside) - **Specific Training Code (OpenMOSE):** The training code for this particular model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.) ## Model Card Contact OpenMOSE - 2025