Yoga-LLM, Part 3: Extending the LLM to a new language

A tutorial on training LLMs on an Indian language

8 min read5 days ago

Welcome to Part 3 of this blog series, part of the small-scale Yoga LLM project. Previously, we discussed how to perform instruction fine-tuning on a Gemma 2B model for Yoga theory questions. In this installment, we will explore another crucial aspect of LLM applications: multilinguality. Many use cases, including this one, require support for multiple languages. In this post, we will delve into training an LLM to adapt to Hindi. Alright, let’s begin!

LLMs and Multilinguality

Large Language Models (LLMs) demonstrate remarkable skills in understanding and generating human language, but adapting them to new grammar and vocabulary can be quite challenging. When an LLM encounters a new language, it needs to learn the specific syntactic and morphological rules that dictate sentence structure, along with the unique lexical items that make up the vocabulary. The process typically involves pre-training and/or fine-tuning the model on a corpus rich in the target language. During this process, the model adjusts its weights to better capture the linguistic patterns and nuances of the new language. This in turn allows the LLM to generate text that is grammatically correct and contextually appropriate in the given target language.

Training an LLM for a new language typically involves the following stages:

Language-specific pre-training

The first step in language-specific pre-training is to collect a large and diverse corpus of text in the target language. This corpus should be as diverse as possible to capture a wide range of linguistic patterns and vocabulary. Sources might include books, articles, websites, social media, and more. Although having a large amount of data is recommended, there have been experiments such as TamilLlama, where the model has been adapted to a new language using as little as 16K tokens for pre-training! This is great for small-scale projects. Once the data is collected and preprocessed, the LLM is trained on the data corpus. The model learns the statistical properties of the language, including grammar, syntax, and vocabulary. While this stage is thought to be computationally expensive (if you are concerned about model performance, the corpus might be large and exhaustive), it depends on the size of the pre-training corpus. We will be implementing a small-scale setup for pre-training our model on Hindi data.

Tokenizer Adaptation for New Vocabulary

The tokenizer of an LLM plays a crucial role in handling new vocabulary. Tokenizers break down text into smaller units, such as words or subwords, which are then processed by the model. When adapting an LLM to a new language, the tokenizer must be adjusted to recognize and appropriately segment the unique words and subword units of that language.

There are several strategies for tokenizer adaptation:

Training from Scratch: This involves creating a new tokenizer based on a comprehensive corpus of the target language. While this ensures the tokenizer is well-suited to the new language, it can be computationally expensive.
Augmenting Existing Tokenizers: Existing tokenizers can be augmented with new vocabulary items. This is less computationally intensive and allows the model to retain its capabilities in the original language(s) while also accommodating the new language.

Instruction Fine-tuning

The pre-trained model is then fine-tuned on specific tasks/instructions. During this phase, the model learns to apply its language understanding to perform well on particular tasks. Fine-tuning adjusts the model’s weights based on the specific requirements of the tasks, enhancing its performance and accuracy. This is followed by further stages of alignment and evaluation.

Alignment

Alignment involves ensuring that the model’s outputs align with human values, ethical guidelines, and cultural sensitivities. This may include incorporating human feedback to refine the model’s behavior, avoiding biases, and ensuring that the generated content is appropriate for the target audience. We won’t cover the topics of evaluation and alignment in this post, as they are extensive subjects. Perhaps we’ll explore them in a future installment.

Choosing an LLM for Multilingual Training

When comparing LLMs for multilingual applications, several factors should be considered:

Training Data Diversity: Models trained on diverse, high-quality datasets tend to perform better across multiple languages. For example, models like the GPT series, BLOOM, Aya among others are trained on large multilingual corpora. Click here for a complete list of multilingual LLMs. If you choose an already multilingual LLM, it is an advantage in terms of cost and effort because you can directly fine-tune it on your use-case.
Model Size: Larger models generally have a greater capacity to learn and adapt to new languages but come with higher computational requirements.
Tokenizer Compatibility: The tokenizers of several model families have different vocabulary and representation ratios when it comes to non-english languages. This may affect their capability to adapt to new languages. Read this interesting blog on the suitability of Gemma for Indic-LLM fine-tuning. However, the LLaMa and Mistral family of models are also proving to be quite powerful generalizing to new languages.

Creating a dataset

We will proceed with the dataset created in Part 1, which includes a combination of a publicly available dataset and synthetic data generated from raw text. Since this dataset is in English, we need to translate it into Hindi. The best options for this task are the Google Translate API or GPT-3.5/4. However, there are also free options available, such as HuggingFace models and free proxies to the Google Translate API. The latter option is not very stable, so I do not recommend it for large datasets or critical use cases. In this instance, I am using the `googletrans` library.

!pip install googletrans==3.1.0a0

from googletrans import Translator
import pandas as pd

translator = Translator()
df = pd.read_csv("combined_dataset.csv")

# List of columns to translate
columns_to_translate = ['prompt', 'output']

for col in columns_to_translate:
    df[f'{col}_hindi'] = df[col].apply(translator.translate, src='en', dest='hi').apply(getattr, args=('text',))

Now, we can proceed with the rest of the implementation.

Implementation

For the implementation, I will follow this tutorial notebook by Unsloth. However, the code can be easily adapted to the Huggingface library since the principles remain the same.

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2b-it", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Pre-training

For pre-training the model you require a large corpus of raw Hindi text. In this case, we will directly use the text from Wikipedia to train our model.

# Load dataset from Wikipedia
from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.hi", split = "train",)
dataset = dataset["train"]

# Construct a prompt for Wikipedia articles
_wikipedia_prompt = """Wikipedia Article
### Title: {}

### Article:
{}"""
# becomes:
wikipedia_prompt = """विकिपीडिया लेख
### शीर्षक: {}

### लेख:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

Now, we set up the configuration parameters for the training job.

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

And that’s it for the pre-training! The output of the pre-training step is a model that can now perform text completions in Hindi. In order, to make it capable of chat, we need to perform instruction tuning.

Tokenizer compatibility

Depending on the model that you have chosen, you can train your tokenizer from scratch, augment it, or skip it altogether. This depends on the amount of representative text from the target language that is used to train the original base/instruction-tuned model. In this case, the Gemma 2B already supports the Devanagari script, so we do not need to train a tokenizer to deal with it. However, refer to the section above for techniques on training/augmenting a tokenizer on HuggingFace.

Instruction Tuning

For instruction tuning, we will perform two stages: a general-purpose instruction tuning step followed by domain-specific instruction tuning. Is there a recommended method to follow in this regard? This is an active field of research and I urge you to check out the relevant literature for guidance.

General purpose instruction tuning: For this, we will use a Hindi-translated version of the “Alpaca” dataset by Stanford which is a popular instruction-tuning dataset.

# Load the Alpaca dataset
from datasets import load_dataset
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-hindi", 
                              split = "train")

_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

2. Domain-specific instruction tuning: From this point onwards, refer to Part 2 of the blogpost to follow the code walkthrough on instruction tuning. This follows the same process as a typical instruction tuning on the English dataset, just with a Hindi dataset instead of an English dataset. Both steps 1&2 can be performed in the same fashion. This holds true for inference as well. And that’s it, you have now seen a large part of the process to add multilingual support to an LLM!

Conclusion

While this is a relatively simple approach to adapting an LLM to new languages, there are several complexities to deal with such as unavailability of data, tokenizer training, unstability in the pre-training/fine-tuning process, evaluation and alignment of the model in the target language. So, it is easy with some skill to build a prototype but incredibly hard to build reliable working solutions. However, now because you have a high-level, yet hands-on overview of the process. In the next blog, we will move on to dealing with dataset creation for a multimodal LLM. Stay tuned!