SageMaker :Enhancing Private LLM with QLora Technique(Fine-tuning with SageMaker HuggingFace DLC)
- Dec 20, 2024
- 7 min read
SageMaker : Enhancing Private LLM with QLora Technique(Fine-tuning with SageMaker HuggingFace DLC)

Written by Hyeonmin Kim
Fine-tuning refers to the method of strengthening specific domains by utilizing datasets to train artificial intelligence models.

Today, based on the architecture above, we will:
Use SageMaker Jumpstrart Model(Mistral) with QLora technique
Utilize SageMaker HuggingFace DLC(Deep Learning Containers)
Fine-tune LLM using nlpai-lab/databricks-dolly-15k-ko dataset
Finally enable Korean inference on the Mistral-7B model, which previously couldn't handle Korean
The overall flow is as follows:
After loading the dataset and creating a training dataset suitable for the Mistral model, we use SageMaker Training to run QLora technique scripts. At this time, we use SageMaker HuggingFace DLC to configure the ML learning environment as a container. After training is complete, we deploy to an Endpoint using HuggingFace Inference Container. Users only need to make requests to this Endpoint.
Environment Setup

First, we need an environment to configure the ML pipeline. Refer to previous posts to set up the SageMaker Canvas environment and create a NoteBook for building ML pipelines.
Once the environment is ready, install the necessary Python packages and huggingface-cli for using HuggingFace Hub.
!pip install "transformers==4.34.0" "datasets[s3]==2.13.0" "sagemaker>=2.190.0" "gradio==3.50.2" "huggingface_hub[cli]" --upgrade --quietAfter installation, you need to log into huggingface-cli. The required token value can be found at huggingface.co under Profile - Edit Profile - Access Tokens.

Once you have the token value, proceed with login:
!huggingface-cli login --token hf_xxxxxxxxxxxxxxxxxxxFinally, specify the IAM Role and default bucket:
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")Data Preparation

First, we need to prepare a Korean training set. We'll use nlpai-lab/databricks-dolly-15k-ko, which is a Korean translation of the databricks-dolly dataset provided by Korea University research lab that developed the Kullm model.
The databricks-dolly dataset is an open source created by Databricks, containing instructions including brainstorming, classification, private QA, generation, information extraction, public QA, and summarization. The total number of data points is 15,011.

Since we installed the Datasets library, we can easily load the data:
from datasets import load_dataset
from random import randrange
# Load dataset from the hub
dataset = load_dataset("nlpai-lab/databricks-dolly-15k-ko", split="train")
print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
When the Mistral model learns data, it must follow a specific format to recognize the data. We need to change the dolly dataset format accordingly. The Mistral model distinguishes data with ### Instruction, ### Context (optional), ### Answer, and we need to change the format to match this.
def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return promptprint(format_dolly(dataset[randrange(len(dataset))]))
Now we need to create and initialize a tokenizer. We'll use HuggingFace's AutoTokenizer to create an auto tokenizer suitable for the Mistral model:
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)Now, create a data preprocessing pipeline using the tokenizer created above. Apply templates to the dataset, tokenize, and then divide the dataset into chunks:
from random import randint
import sys
sys.path.append("../scripts/utils")
from pack_dataset import pack_dataset
def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])
# tokenize dataset
dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)
# chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=2048)
# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")
The source code for the used util pack_dataset is as follows:
from itertools import chain
from functools import partial
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}
# empty list to save remainder from batches to use in next batch
def pack_dataset(dataset, chunk_length=2048):
print(f"Chunking dataset into chunks of {chunk_length} tokens.")
def chunk(sample, chunk_length=chunk_length):
# define global remainder variable to save remainder from batches to use in next batch
global remainder
# Concatenate all texts and add remainder from previous batch
concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
# get total number of tokens for batch
batch_total_length = len(concatenated_examples[list(sample.keys())[0]])
# get max number of chunks for batch
if batch_total_length >= chunk_length:
batch_chunk_length = (batch_total_length // chunk_length) * chunk_length
# Split by chunks of max_len.
result = {
k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
for k, t in concatenated_examples.items()
}
# add remainder to global variable for next batch
remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
# prepare labels
result["labels"] = result["input_ids"].copy()
return result
# tokenize and chunk dataset
lm_dataset = dataset.map(
partial(chunk, chunk_length=chunk_length),
batched=True,
)
print(f"Total number of samples: {len(lm_dataset)}")
return lm_datasetOnce the training data is complete, save it to S3 for use in SageMaker Training Job:
training_input_path = f's3://{sess.default_bucket()}/processed/mistral/dolly-ko/train'
lm_dataset.save_to_disk(training_input_path)
print("uploaded data to:")
print(f"training dataset to: {training_input_path}")Training Process

With the dataset prepared, let's proceed with training using SageMaker's training job.
The script used for training was the QLoRA script. QLoRA is an efficient fine-tuning approach that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance. QLoRA backpropagates gradients to LoRA (Low Rank Adapter) through fixed 4-bit quantized pre-trained language models.
The script can be found at: https://github.com/artidoro/qlora/blob/main/qlora.py
Back to SageMaker, let's define parameters for training:
from huggingface_hub import HfFolder
# hyperparameters, which are passed into the training job
hyperparameters ={
'model_id': model_id,
'dataset_path': '/opt/ml/input/data/training',
'num_train_epochs': 3,
'per_device_train_batch_size': 6,
'gradient_accumulation_steps': 2,
'gradient_checkpointing': True,
'bf16': True,
'tf32': True,
'learning_rate': 2e-4,
'max_grad_norm': 0.3,
'warmup_ratio': 0.03,
"lr_scheduler_type":"constant",
'save_strategy': "epoch",
"logging_steps": 10,
'merge_adapters': True,
'use_flash_attn': True,
'output_dir': '/opt/ml/checkpoints',
if HfFolder.get_token() is not None:
hyperparameters['hf_token'] = HfFolder.get_token()We need to define an Estimator. Since we'll use SageMaker HuggingFace DLC, we'll use HuggingFace Estimator. It's integrated into the SageMaker SDK for easy use, so we can define it easily:
from sagemaker.huggingface import HuggingFace
# define Training Job Name
job_name = f'huggingface-qlora-{hyperparameters["model_id"].replace("/","-").replace(".","-")}'
chekpoint_s3 = f's3://{sess.default_bucket()}/checkpoints'
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_qlora.py',
source_dir = '../scripts',
instance_type = 'ml.g5.4xlarge',
instance_count = 1,
checkpoint_s3_uri = chekpoint_s3,
max_run = 2*24*60*60,
base_job_name = job_name,
role = role,
volume_size = 300,
transformers_version = '4.28',
pytorch_version = '2.0',
py_version = 'py310',
hyperparameters = hyperparameters,
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" },
disable_output_compression = True
)Now that all training preparations are complete, let's define the data and proceed with fitting:
data = {'training': training_input_path}
huggingface_estimator.fit(data, wait=True)
Deep learning containers for training are provisioned and training proceeds.

When training is in progress, you can check the progress in SageMaker Training and monitor it through CloudWatch.



Training took a total of 29,172 seconds (approximately 486.2 minutes, 8.1 hours), and since we used g5.4xlarge, it cost $13.16. SageMaker Training works similarly to AWS Batch, with costs only occurring for the time performing work, reducing unnecessary time for provisioning and maintaining instances, making it much cheaper than training on general GPU instances.
Model Deployment

Once model training is complete, we need to deploy it. First, let's check if the model was created normally. The model can be found at output/model with Training Job as prefix in the default s3 path.

Once we confirm the model is saved properly, let's proceed with deployment.
First, we need an image for deployment. Again, we'll load the inference environment image provided by huggingface:
from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.1.0",
session=sess,
)
# print ecr image uri
print(f"llm image uri: {llm_image}")
Using this image, let's define the LLM model. Define environment variables used for deployment and input the S3 URI of the previously confirmed model in model data:
import json
from sagemaker.huggingface import HuggingFaceModel
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
config = {
'HF_MODEL_ID': "/opt/ml/model",
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(1024),
'MAX_TOTAL_TOKENS': json.dumps(2048),
}
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
env=config
)Now that we've defined the model, we just need to deploy:
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
)The deployment process took about 10 minutes.
Korean Inference Testing

Now, to check if Korean was properly learned, let's compare the Mistral JumpStart model with the model I trained by requesting prompts.
The inference code was written as follows:
import json
import boto3
newline, bold, unbold = "\n", "\033[1m", "\033[0m"
endpoint_name = "엔드포인트 이름"
def query_endpoint(payload):
client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint(
EndpointName=endpoint_name, InferenceComponentName='추론 컴포넌트 이름(없다면 생략)', ContentType="application/json", Body=json.dumps(payload).encode("utf-8")
)
model_predictions = json.loads(response["Body"].read())
generated_text = model_predictions[0]["generated_text"]
print(f"Input Text: {payload['inputs']}{newline}" f"Generated Text: {bold}{generated_text}{unbold}{newline}")General Jumpstart Base Model

Model with Korean Dataset Applied

When asking the same question "대한민국 수도 서울에 대해 알려줘(Tell me about Seoul, the capital of South Korea)", you can see that the general model responds in English. Since the base model doesn't support Korean, an English response is the best it can do, and sometimes it doesn't recognize Korean at all and just produces nonsense. However, you can confirm that the model trained with the Korean dataset provides correct answers in Korean.
Today, we fine-tuned the Mistral model, which doesn't support Korean, by training it with a Korean dataset to enable Korean usage. Since the dataset used for training was 15k, which isn't a large amount of data, we could only confirm that Korean is now possible to some extent. This issue would be resolved by using more datasets.
Since training with more datasets would incur considerable costs, we plan to reduce costs by using spot instances for training. We'll share updates on this part when available.
This concludes our post. Thank you.












