Fine-Tuning Open-Source LLM using QLoRA with MLflow and PEFT
Overviewβ
Many powerful open-source LLMs have emerged and are easily accessible. However, they are not designed to be deployed to your production environment out-of-the-box; instead, you have to fine-tune them for your specific tasks, such as a chatbot, content generation, etc. One challenge, though, is that training LLMs is usually very expensive. Even if your dataset for fine-tuning is small, the backpropagation step needs to compute gradients for billions of parameters. For example, fully fine-tuning the Llama7B model requires 112GB of VRAM, i.e. at least two 80GB A100 GPUs. Fortunately, there are many research efforts on how to reduce the cost of LLM fine-tuning.
In this tutorial, we will demonstrate how to build a powerful text-to-SQL generator by fine-tuning the Mistral 7B model with a single 24GB VRAM GPU.
What You Will Learnβ
- Hands-on learning of the typical LLM fine-tuning process.
- Understand how to use QLoRA and PEFT to overcome the GPU memory limitation for fine-tuning.
- Manage the model training cycle using MLflow to log the model artifacts, hyperparameters, metrics, and prompts.
- How to save prompt template and inference parameters (e.g. max_token_length) in MLflow to simplify prediction interface.