Yoga-LLM, Part 4: Multimodal Dataset Creation

A tutorial on creating custom datasets for a multimodal LLM

Vijayasri Iyer
5 min read4 days ago
Photo by Anton Mislawsky on Unsplash

Multimodal large language models (LLMs) are a step ahead of regular LLMs because they integrate and process multiple types of data — such as text, images, and audio — simultaneously. The ability to combine these modalities enhances the model’s versatility and accuracy, making it more powerful and useful in a broader range of applications. In this blog, we are going to further our understanding of the multimodal LLM space (vision language model or VLM) by creating a custom dataset and fine-tuning a multimodal LLM for Yoga pose identification. Let’s begin!

LLMs and Multimodality

The future of LLMs is multimodal. As development in deep learning progresses, an increasing number of models have started adding support for multiple modalities. But how is it different from a regular LLM? Let’s discuss this briefly below.

Inside a Multimodal LLM

Since the term “multimodal model” broadly refers to LLMs that capture relationships between multiple modalities, in this blog, we will narrow this definition to focus on “vision-language models” (VLMs). A vision-language model typically comprises three components: a vision encoder, a text encoder, and a modality alignment module. Let’s explore each of these components in detail.

  1. Vision Encoder: A vision encoder learns efficient representations to encode image data. It converts images into embeddings, which can then be used for querying the images. As the term “embedding” suggests, the popular architecture choice for this module is the Vision Transformer (ViT) and its derivatives. Several LLMs use either plain ViT or the vision encoders of the CLIP/SigLIP models in their architecture.
  2. Text/Language Encoder: A text or language encoder is responsible for encoding text data into embeddings. This component comprises the large language model (LLM) itself.
  3. Modality Alignment Module: This module is responsible for aligning the image and text embeddings. The ultimate goal of the modality alignment module is to ensure that the image and text representations are compatible and can be effectively used together for tasks such as image captioning, visual question answering, and multimodal retrieval. The architecture of this module varies greatly across different Visual Language Models (VLMs) families. Some methods include: 1) using contrastive learning techniques to bring the embeddings from both modalities into a common latent space where they can be directly compared or matched 2) cross-attention mechanisms that allow the model to dynamically focus on relevant parts of the image and text.

Some prominent examples of multimodal examples include GPT-4V, Flamingo, LLaVa, BLIP-2, PaliGemma. For this use case, we will be using the PaliGemma vision-language model.

Dataset Creation

1. Image data selection

For the image data, we will be using the Yoga-82 dataset which can also be found on Kaggle. This dataset consists of 17k images across 82 distinct Yoga poses. This is a great starting point for fine-tuning a vision language model.

2. Text data generation

We will generate a simple question-answering dataset for our instruction tuning step with the images. It will cover the following questions:

  • What asana is being performed by the model?
  • What are the steps to perform <asana_name>?

In this previous exercise of data creation for instruction tuning, we explored using publicly available text data and open-source synthetic data generation options. This time, we will explore a paid option which is the Gemini API for generating text descriptions for the asanas in our dataset. Below you can find the code for the data generation :

!pip install -U -q google-generativeai

import google.generativeai as genai

from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

import pandas as pd

# Read the pandas dataframe
df = pd.read_csv('asana_names.csv')
responses = []

# Loop through each of the asana names
for asana_name in df['Yoga-82 Asana Names']:
system_prompt = """l
You will be provided with an asana name. Your job is to provide a detailed answer to each of the following questions.
1. Describe how to perform this asana in a stepwise manner. The steps should be concise.

Return answer in the form of a JSON.
"""
# Set the model to Gemini 1.5 Pro.
model = genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest",
system_instruction=system_prompt)

# Make the LLM request.
response = model.generate_content(asana_name)
output = response.text
responses.append(output)
print(f"Record {asana_name} added successfully.")

df['Responses'] = responses
df.to_csv('asana_names_with_descriptions.csv', index=False)

3. Creating the Final dataset

Now, we have a CSV file of the names and generated descriptions of Yoga poses. In a business setting, you would typically have an expert/set of experts to review this data thoroughly to check for accuracy. We will move forward with combining the asana descriptions with the corresponding images of the pose/asana to build our final dataset.

import csv
import os
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_dataset(data_folder, dataset_path, output_file):
try:
# Step 2: Read dataset
with open(dataset_path, 'r') as file:
reader = csv.DictReader(file)
dataset = [row for row in reader]

# Step 3: Generate dataset
with open(output_file, 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Image", "Question_1", "Answer_1", "Question_2", "Answer_2"])
for row in dataset:
asana_name = row['Yoga-82 Asana Names']
q1 = "What asana is being performed by the model?"
a1 = f"The model is performing {asana_name}"
q2 = row['Prompt_1']
a2 = row['Output_1']
asana_folder = os.path.join(data_folder, asana_name)

# List images in the folder
images = os.listdir(asana_folder)

# Generate dataset entry for each image
for image in images:
image_path = os.path.join(asana_folder, image)

# Write row to CSV file
writer.writerow([image_path, q1, a1, q2, a2])

logging.info(f"Finished {asana_name}")

logging.info("Image prompt-answer dataset created successfully.")
except Exception as e:
logging.error(f"An error occurred: {str(e)}")

if __name__ == "__main__":
# Step 1: Define paths
data_folder = "data/raw/yoga-82/test" # Update this with the path to your train folder
dataset_path = "data/processed/asana_names_with_descriptions.csv" # Update this with the path to your dataset file
output_file = "data/processed/yoga-82-it-test.csv"

generate_dataset(data_folder, dataset_path, output_file)

You should repeat this process for the train, val, test splits which are already present in the image dataset. And that’s it! We have a multimodal dataset to train our model to identify Yoga poses and describe the steps of each pose in detail.

Conclusion

The aim of this blog is to provide a high-level overview of the process of building a multimodal dataset. In both academia and industry, an increasing number of multimodal models are being trained using a combination of synthetic and real-world data, yielding surprisingly good performance. In the next blog, we will explore how to fine-tune a multimodal model with the dataset we have created. Stay tuned!

--

--

Vijayasri Iyer

Machine Learning Scientist @ Pi School. MTech in AI. Musician. Yoga Instructor. Learnaholic. I write about anything that makes me curious.