Fine-Tuning Meta Llama 3.1 8B with Unsloth
In this guide, we fine-tune the Meta-Llama-3.1-8B-bnb-4bit model, optimized by Unsloth, using Low-Rank Adaptation (LoRA) on the Alpaca-cleaned dataset. We leverage Beam’s infrastructure for compute and storage, then deploy an inference endpoint. Throughout the process, we’ll track and evaluate our fine-tuning performance using Weights & Biases (wandb).
View the Code
See the full code for this example on GitHub.
Setup
Environment Configuration
We define a shared Image
configuration for both fine-tuning and inference, ensuring consistency. The image includes necessary dependencies and installs Unsloth from its GitHub repository.
To use Weights & Biases (wandb) for tracking, you’ll need your API key. You can find it in your wandb dashboard under the “API keys” section. Copy the key and replace YOUR_WANDB_KEY
in the wandb login
command.
Fine-Tuning
The fine-tuning script (finetune.py
) uses Unsloth to adapt the model to the Alpaca-cleaned dataset while tracking metrics with Weights & Biases.
Running Fine-Tuning
Execute the script:
After completion, verify that the files are saved in your Beam Volume:
Here’s the expected output with the fine-tuned files:
Training Performance Metrics
We tracked our fine-tuning process using Weights & Biases, which provided detailed metrics on training progress. The dashboard showed that the training loss started at approximately 1.85 and, despite significant fluctuations, exhibited a general downward trend, ending at around 0.95 by step 60. This suggests that the model was learning patterns from the Alpaca-cleaned dataset over the 60 training steps.
The dashboard shows a consistent decrease in training loss over time, confirming that our model was learning effectively from the Alpaca dataset.
Evaluation
To understand the impact of fine-tuning the Meta Llama 3.1 8B model with Unsloth on the Alpaca-cleaned dataset, we evaluated both the base model and the fine-tuned model on two widely used benchmarks: HellaSwag (a commonsense reasoning task) and MMLU (Massive Multitask Language Understanding, covering a broad range of subjects). The results highlight the fine-tuned model’s improvements over the base model, demonstrating the effectiveness of our fine-tuning process.
Overall Performance
The table below summarizes the overall performance on HellaSwag and MMLU. The fine-tuned model shows modest but consistent gains across both benchmarks.
Benchmark | Base Model | Fine-tuned Model | Improvement |
---|---|---|---|
HellaSwag (acc) | 59.09% | 60.37% | +1.28% |
HellaSwag (acc_norm) | 77.93% | 78.75% | +0.82% |
MMLU (overall) | 61.42% | 62.33% | +0.91% |
- HellaSwag: The fine-tuned model improves accuracy (acc) by 1.28% and normalized accuracy (acc_norm) by 0.82%, indicating better commonsense reasoning capabilities.
- MMLU: An overall improvement of 0.91% suggests the model has enhanced its general knowledge and reasoning across diverse topics.
Analysis
The fine-tuned model demonstrates consistent improvements over the base model, particularly in tasks requiring logical reasoning, ethical judgment, and commonsense understanding. These gains align with the Alpaca-cleaned dataset’s focus on instruction-following and coherent responses.
Inference
Inference Script
The inference script (inference.py
) loads the fine-tuned model and exposes an endpoint for generating responses.
Deploying the Endpoint
Run this command to deploy the inference endpoint:
You’ll get back a URL with the endpoint:
Was this page helpful?