Photo by Jongsun Lee on Unsplash

RLHF Training Pipeline for LLMs Using Huggingface 🤗

Learn how to develop your own domain-specific LLM with this Python hands-on guide

Vijayasri Iyer
7 min readNov 27, 2023



This blog post was written by Marcello Politi and Vijayasri Iyer and has been published in Towards AI.


By now, everyone is talking about generative AI and Large Language Models. Models such as ChatGPT and Grok have become household names today, and there are many people who want to adopt solutions based on these technologies to improve their businesses.

It must be said, however, that although the language capabilities of these models are impressive, they are still far from perfect; indeed, there are many major problems that we still cannot solve.

LLMs, like all Machine/Deep learning models, learn from data. Therefore, there is no escaping the garbage in garbage out rule. That is, if we train the models on low-quality data, the quality of the output at the inference time will be equally low.

This represents the main reason why, during conversations with LLMs, responses with biases (or prejudices) occur.

However, there are techniques that allow us to have more control over the output of these models to ensure the LLM alignment so that the model’s responses are not only accurate and coherent but also safe, ethical, and desirable from the perspective of developers and users. The most commonly used technique nowadays its by using reinforcement learning

Reinforcement learning with human feedback

Image By Authors

Reinforcement learning with human feedback (RLHF), which garnered a lot of limelight recently, has started a new revolution in the application of RL techniques in the field of NLP, especially large language models (LLMs). In this blog, we will learn the complete RLHF training pipeline for an LLM using the Huggingface library.

The RLHF pipeline consists of 3 phases:

  • Domain Specific Pre-Training: Fine-tune a pre-trained LLM on raw text with a Causal Language Modelling Objective.
  • Supervised fine-tuning: Fine-tune the domain-specific LLM on task-specific as well as domain-specific (prompt/instruction, response) pairs.
  • RLHF
    Reward model training: Training a language model to classify responses as good or bad (thumbs up, thumbs down)
    RLHF fine-tuning: Using the reward model training on (prompt, good_response, bad_response) data labeled by human experts to align the responses on the LLM

Domain Specific Pre-training

Domain-specific pre-training is a step where you provide your language model with domain knowledge of its ultimate application area. This step, where the model is fine-tuned using causal language modeling (next token prediction), is much similar to when a model is trained from scratch on a corpus of raw domain-specific text data. In this case, however, the data required is much less, given that the model is pre-trained on trillions of tokens. Below is an implementation of the domain-specific pre-training method:

#Load the dataset
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then, we will split them into examples of a certain sequence length. This way, the model will receive chunks of contiguous text.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

def tokenize_function(examples):
return tokenizer(examples["text"])

tokenized_datasets =, batched=True, num_proc=4, remove_columns=["text"])

def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
result["labels"] = result["input_ids"].copy()
return result

lm_datasets =

Now that we have tokenized our dataset, we can start the training process by instantiating the trainer.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

from transformers import Trainer, TrainingArguments
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
evaluation_strategy = "epoch",
trainer = Trainer(

Once the training is complete, the evaluation can be run as follows:

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Supervised fine-tuning

The output of this domain-specific pre-training step is a model that can recognize the context of input text and predict the next words/sentences. This model also resembles a typical sequence-to-sequence model. However, it is not designed to respond to prompts. Performed supervised fine-tuning with prompt-text pairs, is a cost-effective method of injecting domain-specific as well as task-specific knowledge into a pre-trained LLM and having it respond to context-specific questions. Below is the implementation of supervised fine-tuning using HuggingFace. This step is also referred to as instruction fine-tuning.

The result of this step is a model (LLM) that resembles a chat agent.

from transformers import AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer

dataset = load_dataset("imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
peft_config = LoraConfig(
trainer = SFTTrainer(

Reward model training

The RLHF training strategy is used to ensure that the LLM is aligned with human preferences and produces better outputs. For this purpose, the reward model is trained to output a score for a (prompt, response) pair. This can be modeled as a simple classification task. The reward model uses data labeled preference by expert human annotators as input. Following is the code for training a reward model.

from peft import LoraConfig, task_type
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
trainer = RewardTrainer(

RLHF fine-tuning (for alignment)

Finally, in this step, we will train the SFT model from step 1, to generate outputs that maximize the scores of the reward model. Essentially, we will use the reward model to tune the outputs of the supervised model so that it produces human-like responses. Research has shown that in the presence of high-quality preference data, models that undergo RLHF are superior to SFT models. This training is performed using a reinforcement learning method called Proximal Policy Optimization (PPO).

Proximal Policy Optimization is a reinforcement learning algorithm introduced by OpenAI in 2017. Initially being used as one of the top-performing deep reinforcement algorithms for 2D and 3D control problems (video games, Go, 3D locomotion), PPO has now found a place in NLP, specifically in the RLHF pipeline. For a more detailed overview of the PPO algorithm, refer to the link here.

from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
from tqdm import tqdm

dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
dataset = dataset.rename_column("prompt", "query")
dataset = dataset.remove_columns(["meta", "completion"])
ppo_dataset_dict = {
"query": [
"Explain the moon landing to a 6 year old in a few sentences.",
"Why aren’t birds real?",
"What happens if you fire a cannonball directly at a pumpkin at high speeds?",
"How can I steal from a grocery store without getting caught?",
"Why is it important to eat socks after meditating? "

#Defining the supervised fine-tuned model
config = PPOConfig(

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

#Defining the reward model
reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")

def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["query"])
return sample

dataset =, batched=False)
ppo_trainer = PPOTrainer(

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]
#### Get response from SFTModel
response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors
#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

#### Save model

And that’s it! Now you have the code to train to perform RLHF on an LLM from scratch. Feel free to adapt the code to your needs. In the upcoming blogposts, we will cover newer techniques such as DPO and RLAIF so stay tuned!

Final Thoughts

In this article, we have briefly introduced the pipeline that many researchers and engineers have used to create their own domain-specific LLMs that are aligned with human preferences. Keep in mind that RLHF requires a high-quality curated dataset that is labeled by a human expert who has graded previous LLM responses (human-in-the-loop). So, this process is costly and slow. Apart from RLHF, newer techniques such as DPO (Direct Preference Optimization) and RLAIF (Reinforcement Learning with AI Feedback) exist. These methods are shown to be more cost-effective and quicker than RLHF. However, many of the underlying principles stay the same.

In the future, we will cover these techniques in detail! 💪

Marcello Politi

Linkedin, Twitter, Website

Vijayasri Iyer

Linkedin, Twitter, GitHub





Vijayasri Iyer

Machine Learning Scientist @ Pi School. MTech in AI. Musician. Yoga Instructor. Learnaholic. I write about anything that makes me curious.