Photo by Edz Norton on Unsplash

A Quick Introduction to LLM Alignment

A speed-dating style introduction to some famous LLM alignment methods

Vijayasri Iyer
3 min readApr 29, 2024

--

With the rise of LLMs, we have also seen a term called alignment becoming highly popularized in AI literature. Although the field of study existed before, the rise of alignment techniques such as RLHF became sought after, post the release of GPT-3. Let’s discuss them briefly.

First of all, what is AI alignment?

Alignment is an emerging field of study where you ensure that an AI system performs exactly what you want it to perform. Think of a framework like Asimov’s Three Laws of Robotics as a rough example. A first google search might lead you to this definition by IBM, “Alignment is the process of encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible. “ In the context of LLMs specifically, alignment is a process that trains an LLM to ensure that the generated outputs align with human values and goals. This is also called as alignment with respect to human preferences, hence “preference optimization”.

What are the current methods for LLM alignment?

You will find many alignment methods in research literature, we will only stick to 3 alignment methods for the sake of discussion.

RLHF (Reinforcement Learning with Human Feedback):

  • Step 1 & 2: Train an LLM (pre-training for the base model + supervised/instruction fine-tuning for a chat variant)
  • Step 3: RLHF uses an ancillary language model (it could be much smaller than the main LLM) to learn human preferences. This can be done using a preference dataset — it contains a prompt, and a response/set of responses graded by expert human labelers. This is called a “reward model”.
  • Step 4: Use a reinforcement learning algorithm (eg: PPO — proximal policy optimization), where the LLM is the agent, the reward model provides a positive or negative reward to the LLM based on how well it’s responses align with the “human preferred responses”.

In theory, it is as simple as that. However, implementation isn’t that easy — requiring lot of human experts and compute resources. To overcome the “expense” of RLHF, researchers developed DPO

DPO (Direct Preference Optimization):

  • Step 1&2 remain the same
  • Step 4: DPO eliminates the need for the training of a reward model (i.e step 3). How? DPO defines an additional preference loss as a function of it’s policy and uses the language model directly as the reward model. The idea is simple, If you are already training such a powerful LLM, why not train itself to distinguish between good and bad responses, instead of using another model?
  • DPO is shown to be more computationally efficient (in case of RLHF you also need to constantly monitoring the behavior of the reward model) and has better performance than RLHF in several settings.

ORPO (Odds Ratio Preference Optimization):

  • The newest method so far, ORPO combines Step 2, 3 & 4 into a single step — so the dataset required for this method is a combination of a fine-tuning + preference dataset.
  • The supervised fine-tuning and alignment/preference optimization is performed in a single step. This is because the fine-tuning step, while allowing the model to specialize to tasks and domains, it can also increase the probability of undesired responses from the model.
  • ORPO combines the steps using a single objective function by incorporating an odds ratio (OR) term — reward preferred responses & penalizing rejected responses.

Some great resources:

  1. https://huyenchip.com/2023/05/02/rlhf.html
  2. Concise blogpost by Manish Chablani : https://medium.com/@ManishChablani/aligning-llms-with-direct-preference-optimization-dpo-background-overview-intuition-and-paper-0a72b9dc539c
  3. Blog by Zain Ul Abideen : https://medium.com/@zaiinn440/orpo-outperforms-sft-dpo-train-phi-2-with-orpo-3ee6bf18dbf2

In conclusion, we have covered some of the most famous alignment methods briefly here. However, there are many more nuances in the theory and implementation of each of these alignment methods, which can only be gleaned by reading their respective papers. There are also some excellent resources on medium and substack teaching you to implement these methods, which is the best way of understanding them.

--

--

Vijayasri Iyer

Machine Learning Scientist @ Pi School. MTech in AI. Musician. Yoga Instructor. Learnaholic. I write about anything that makes me curious.