Model Card for Model ID

LLM trained with PPO for VK NLP course.

PPO is a bit challenging, so the model is not good enough.

Model Details

Model Description

The model is LLM HuggingFaceTB/SmolLM-135M-Instruct.

Reward model is the same model, but trained to predict reward on train data.

LLM trained with PPO procedure and the reward model, using TRL trainer.

Model Sources

Getting start

device = torch.device("cuda")

tokenizer = AutoTokenizer.from_pretrained("dmitry315/llm-course-hw2-ppo")
check_model = AutoModelForCausalLM.from_pretrained("dmitry315/llm-course-hw2-ppo")
check_model = check_model.to(device)
check_model = check_model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = check_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# > As a digital AI assistant, I don't have personal experiences or emotions, but I can provide you with a general morning routine that many people follow.
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including dmitry315/llm-course-hw2-ppo