Yoga-LLM, Part 1: Dataset Creation
Two methods of data creation for instruction fine-tuning
Welcome to Part 1 of this blog series, which is a part of the project Yoga-LLM. In this blog, we will go through the process of creating a fine-tuning dataset for our use case. For the sake of convenience, henceforth we will refer to fine-tuning as the process of instruction tuning, which enables an LLM to answer domain-specific questions. There are several methods that you can adopt to transfer domain knowledge to your LLM effectively, one of them being domain-specific pre-training. However, given that this is a small-scale project we will stick to a small instruction-tuning dataset to fine-tune the model. Before we start with the implementation, some preliminaries.
What is Instruction fine-tuning/instruction tuning?
Instruction tuning is a process in LLM training where the model is fine-tuned to follow specific instructions more effectively. This involves training the model on a diverse set of tasks and instructions, through which the model improves its generalization capabilities (eg: question answering, summarization, and dialogue generation). Instruction tuning is characterized by the requirement for low quantity, but high-quality data. The best kind of instruction tuning data is one entirely created by human experts. However, not everyone has that kind of time and expertise, hence we rely on AI-assisted methods as an aid to the data creation process. If you’re interested in reading about the multiple stages involved in building an LLM, you can refer to this blog as a starting point. Alright, let’s begin with the implementation!
1. Identifying the data sources
For the LLM, I have identified two datasets: one from Kaggle and another from an NIOS textbook on Yoga. You can find the links to the original text below.
Always remember, when developing commercial (also non-commercial if you’re planning to release it publicly) LLM applications it is extremely important to make sure that your training has been sourced under a permissible license. You don’t want legal troubles knocking at your door :)
2. Prepare the dataset
Let’s cover this in two sections: we will first look at the already available Kaggle dataset, followed by the raw text gathered from the textbook.
Kaggle Dataset
The Kaggle dataset is a dataframe consisting of Hatha Yoga asanas followed by the steps, benefits, contraindications and level of expertise required to perform the asana. This can be perfect for a question-answering dataset!
Studying the dataset, each asana row has 4 potential question-answer pairs:
- What are the steps to perform <asana_name>?
- What are the benefits of performing <asana_name>?
- What are the contraindications of <asana_name>?
- What is the level of <asana_name>?
Now, let’s make some tweaks to make this data ready for feeding to the LLM. Following is the code snippet to convert the existing data into a question-answer format.
This will result in the final data output shown below:
And there you have it! Now let’s move on to the raw text dataset.
Raw text dataset
The Yoga textbook, is a set of multiple pdfs consisting of raw text which needs to be scraped and cleaned before it can be used. So, first we will go through the code to scrape the data from the pdf. The process that I followed to extract raw text is as follows
- Step 1 : Scrape the pdf to a txt file.
- Step 2 : Manually clean pdf of each chapter.
- Step 3 : Concat the text files to form a larger corpus of text.
For step 1, I’m using the pdftotext library. While there are several libraries to choose from, I found that this worked really well for my use-case. Following is the code for installation and extraction.
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!pip install pdftotext
import json
import re
import pdftotext
# Load your PDF
with open("Yoga_(Level-B)_ch-11-final.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Open a text file in append mode
with open("Yoga_levelb_ch-11.txt", "a", encoding="utf-8") as text_file:
for page in pdf:
# Write each page's content to the text file
text_file.write(page)
And concatenate the pdfs after cleaning.
file_names = ['Yoga_levelb_ch-1.txt', 'Yoga_levelb_ch-2.txt', 'Yoga_levelb_ch-3.txt', 'Yoga_levelb_ch-4.txt', 'Yoga_levelb_ch-5.txt',
'Yoga_levelb_ch-6.txt', 'Yoga_levelb_ch-7.txt', 'Yoga_levelb_ch-8.txt', 'Yoga_levelb_ch-9.txt', 'Yoga_levelb_ch-10.txt']
output_file = 'yoga_dataset.txt'
with open(output_file, 'w') as outfile:
for file_name in file_names:
with open(file_name, 'r') as infile:
outfile.write(infile.read() + '\n')
In this case, I scraped each of the pdfs chapterwise and then manually edited some unwanted information including whitespaces to make the PDF clean. You can view the dataset txt files here (kindly make sure of the data licnese before using for any commercial purposes). Now we have the final dataset that looks something like this:
Technically, given that it is a raw text corpus can also be used to pre-train an LLM, however the size is too small if not combined with a much larger dataset. So, instead I will feed this raw text to another LLM, that can “generate question-answer” pairs. We will discuss that in the upcoming section.
3. Synthetic Data Creation
Synthetic data creation is a buzzing area of development, in the age of LLMs. A lot of new LLMs are either trained on significant amounts of synthetic data, such as Microsoft’s Phi, or promise high-quality generation of synthetic data, such as NVIDIA’s Nemotron (not to mention an array of closed-source LLMs). So far, the gold standard for AI-aided synthetic data creation has been GPT-3.5/4 and other closed-source models (Google Gemini, Anthropic Claude). But, for a change, we will explore some open-source options for data creation. I have identified two libraries to explore below :
1. Bonito LLM: It is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning.
2. Genstruct 7B LLM: Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
There are many more libraries and models (LLM/Seq2Seq) as of today for data creation (eg: txtinstruct, distillabel). Make sure you explore the landscape of tools before arriving at a final decision. Following are the code snippets to generate datasets with Bonito and Genstruct.
You can view the original tutorials for these tools in their respective pages linked above (refer description section for each tool). Here are some observations from the entire data generation process:
- Having good GPU compute is very important to get quality/timely outputs from the LLM (so it’s technically not free). I tried this generation pipeline with a T4 GPU and a Colab Pro subscription. I still had to spend time re-running scripts because they would stop abruptly halfway through the process. The Genstruct-7B gave me more trouble than the quantized version of Bonito-LLM.
- Both Genstruct and Bonito generate situational QA-type of questions. I’ve put some examples below:
#GENSTRUCT-7B
"""
[[[Title]]] Yoga
[[[Content]]] body and spirit in balance. The word 'Yoga' is derived from the
sanskrit root 'Yuj' which means to attach, join, harness or yoke.
By practicing the differentyoga techniques you will achieve good
health, relaxation, and inner fulfillment. These will also help you
to meet the different challenges of life with greater ease and
confidence. Yoga at an early age encourages confidence and
awareness to undertake life's activities in a noncompetitive
manner.Yoga is observed to nurture cooperation and compassioninstead of opposition. This is a great gift to give to our children.
The purpose of yoga is an inner cleaning of the individuals
thoughts, emotions and feelings.
The following is an interaction between a user and an AI assistant that is related to the above text.
[[[User]]] Two kids graduated from high school in the same year. Jane went to study yoga in India, while Dana started studying medicine. Recently Jane came back home.
Which girl has studied longer the word 'yuj' means?
[[[Assistant]]] Yoga is a practice with roots in ancient Indian philosophy. The word 'Yoga' itself is derived from the Sanskrit root 'Yuj', which means to attach, join, harness or yoke. Through the practice of different yoga techniques, one can achieve good health, relaxation, and inner fulfillment, helping them to face the challenges of life with greater ease and confidence. It is also observed to nurture cooperation and compassion instead of opposition, a great gift to give to individuals, especially children.
The given scenario involves two girls, Jane and Dana, after graduating from high school. Jane decided to study yoga in India, while Dana chose to study medicine. Notably, Jane recently returned home from her studies in India.
The practice of yoga involves understanding and applying the principles behind its origin, including the meaning and application of the Sanskrit root 'Yuj'. This involves an inner cleaning of one's thoughts, emotions, and feelings, which is achieved through various techniques.
Given that Jane studied yoga, she would have been exposed to the meaning and application of the Sanskrit root 'Yuj'. This would have been a part of her studies and practice of yoga. On the other hand, Dana, who chose to study medicine, wouldn't have had the same exposure since medicine does not inherently involve the study of Sanskrit or the origins of yoga.
Therefore, based on their chosen fields of study, it can be inferred that Jane has studied the word 'Yuj' - the Sanskrit root of the word 'Yoga' - for a longer period than Dana.
"""
#BONITO-LLM
----Generated Instructions----
Input: Question: "What should happen with your eyes?"
Context: "• To start with breathe deeply. Later the breathing should be
fine and slow, with no jerky movements to disturb the spine
or the body.
• Concentrate on deep and fine exhalations, in which the nostrils
do not feel the warmth of breath.
• The lower jaw should hang loose and not be clenched. The
tongue should not be disturbed, and even the pupils of the
eyes should be kept completely passive.
• Relax completely and breathe out slowly.
• If the mind wanders, pause without any strain after each slow"
Answer:
Output: pupils of the eyes should be kept completely passive
Although Bonito does offer more flexibility, there is no proxy for human oversight as far as refining the input context is concerned. In this case, I provided chunks of overlapping context (say around 10 sentences with a 1- sentence overlap) to the LLM. I think this pipeline can be improved a lot more. You can find links to these generated datasets here.
Conclusion
If after reading this blog, you develop even a little bit of insight regarding the 1) state of open source synthetic data creation and 2) how to create datasets by hand, my job is done. As for the Yoga LLM we now have 3 datasets (one manually created, 2 synthetic) for instruction tuning our LLM. In the next post, we will go through the process of fine-tuning the LLM using this data. If you like this, do give it a clap and follow me for the next posts in this series.