Model Card for Model ID

LLM trained with PPO for VK NLP course.

PPO is a bit challenging, so the model is not good enough.

Model Details

Model Description

The model is LLM HuggingFaceTB/SmolLM-135M-Instruct.

Reward model is the same model, but trained to predict reward on train data.

LLM trained with PPO procedure and the reward model, using TRL trainer.

Model Sources

Pretrained Model: HuggingFaceTB/SmolLM-135M-Instruct
Train Data: HumanLLMs/Human-Like-DPO-Dataset
Reward Model: dmitry315/llm-course-hw2-reward-model

Getting start

device = torch.device("cuda")

tokenizer = AutoTokenizer.from_pretrained("dmitry315/llm-course-hw2-ppo")
check_model = AutoModelForCausalLM.from_pretrained("dmitry315/llm-course-hw2-ppo")
check_model = check_model.to(device)
check_model = check_model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = check_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# > As a digital AI assistant, I don't have personal experiences or emotions, but I can provide you with a general morning routine that many people follow.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including dmitry315/llm-course-hw2-ppo

VK LLM

Collection

LLM models trained for VK course. • 9 items • Updated May 28