한국어 파인 튜닝 하기¶

Contents¶

Wikipedia 한국어 데이터셋으로 Continuous Pre Training을 수행합니다.
- Wikipedia dataset
Alpaca 한국어 번역본 데이터로 Instruction 튜닝을 한국어로 진행합니다.
- Alpaca GPT4 Dataset
차이를 관찰하기 위해서 meta-llama/Llama-3.1-8B 모델로 수행을 해보겠습니다.

Unsloth 라이브러리를 이용해서 진행해보겠습니다, 교육용이기 때문에 빠르고 메모리를 아끼기 위해서 입니다, 본 실습 자료는 unsloth 공식 doc 을 참조해서 작성되었습니다.

In [ ]:

Copied!

# RUNPOD 에 unsloth 가 셋업된 이미지로 진행하면 생략해도 됩니다

# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install unsloth
# RUNPOD 에 unsloth 가 셋업된 이미지로 진행하면 생략해도 됩니다

# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install unsloth

In [ ]:

Copied!

!pip install huggingface_hub
!pip install huggingface_hub

In [ ]:

Copied!

!nvidia-smi
!nvidia-smi

In [ ]:

Copied!

from huggingface_hub import login

login(token='')
# Read 권한 토큰을 입력하세요!
from huggingface_hub import login

login(token='')
# Read 권한 토큰을 입력하세요!

In [ ]:

Copied!





# MEMORY 정리
import torch
import gc

#del model  # or any other variable
#gc.collect()
#torch.cuda.empty_cache()
# MEMORY 정리
import torch
import gc

#del model  # or any other variable
#gc.collect()
#torch.cuda.empty_cache()

In [ ]:

Copied!





from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre-quantized 모델, unsloth에서 직접 제공합니다, 용량이 작아서 빠른 수행을 원하시면 사용하셔도 좋습니다.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre-quantized 모델, unsloth에서 직접 제공합니다, 용량이 작아서 빠른 수행을 원하시면 사용하셔도 좋습니다.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

In [ ]:

Inference!¶

현재 상태에서 테스트를 해봅시다, 기본 라마는 어느 정도의 능력을 가지고 있을까요.

In [ ]:

Copied!

from transformers import TextStreamer

FastLanguageModel.for_inference(model)

text_streamer = TextStreamer(tokenizer)

input_text = "체첸 공화국은"
#input_text = "Chechen Republic is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

model.generate(**inputs, streamer=text_streamer, max_new_tokens=256)
from transformers import TextStreamer

FastLanguageModel.for_inference(model)

text_streamer = TextStreamer(tokenizer)

input_text = "체첸 공화국은"
#input_text = "Chechen Republic is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

model.generate(**inputs, streamer=text_streamer, max_new_tokens=256)

In [ ]:

LoRA adapters 를 준비해서 효율적으로 튜닝을 해봅시다. 1 ~ 10% 정도의 파라메터만 업데이트를 합니다. 아래 3가지 옵션에 주목해보겠습니다. 모두 CPT와 같이 코퍼스를 바꾸거나, 많은 것을 바꿀때 효과적인 요소들 입니다. 앞선 스타일 튜닝 예시에서는 사용하지 않았었죠.

embed_tokens
- token embedding layer 도 학습합니다.
lm_head
- 마지막 vocab space 로 넘겨주는 layer 도 학습합니다.
rsLoRA
- ref, Rank Stabilized LoRA
- 더 큰 rank 를 잡아도 학습이 잘 되도록, LoRA 어댑터를 안정적으로 뽑아주는 요소입니다.

In [ ]:

Copied!





model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Data Prep¶

Wikipedia dataset 에서 한국어 subset 을 읽어오겠습니다. 이를 corpus 로 잡고 한국어를 보다 잘하도록 continually pretrain 을 하겠습니다. 당연히 다른 언어를 하셔도 됩니다. Wikipedia's List of Languages.

completions 만 학습합니다. 챗 스타일이 아닙니다 현재는요. 자세는 내용은 TRL 문서를 더 확인해보세요.
EOS_TOKEN 을 꼭 output token 에 추가하세요. 아니면 안 멈추고 무한 생성합니다.

(참고) ChatML style 학습은 대화형 학습 코드를 참조하세요 -> notebook.

In [14]:

Copied!





# Wikipedia provides a title and an article text.

wikipedia_prompt = """위키피디아 문서
### 제목: {}

### 내용:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass
# Wikipedia provides a title and an article text.

wikipedia_prompt = """위키피디아 문서
### 제목: {}

### 내용:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

데이터셋은 너무 크니, N% 만 뽑아서 확습해 봅시다.

In [ ]:

Copied!

from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)

# select N% of the data to make training faster!
dataset = dataset.train_test_split(train_size = 0.10)["train"]

dataset = dataset.map(formatting_prompts_func, batched = True,)
from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)

# select N% of the data to make training faster!
dataset = dataset.train_test_split(train_size = 0.10)["train"]

dataset = dataset.map(formatting_prompts_func, batched = True,)

In [ ]:

Copied!

dataset[0]
dataset[0]

Continued Pretraining¶

UnslothTrainer를 사용해서 학습을 해봅시다. TRL SFT docs 참조.

적당히 120 스텝만 해보겠습니다. (A40 기준 약 20분)

num_train_epochs=1 주고 max_steps=None 끄면, 1 epoch 을 풀로 학습합니다. 데이터를 많이 넣으면 그만큼 더욱 학습하겠죠.

(추천) embedding_learning_rate는 learning_rate 의 최소 2x ~ 10x 작게 설정해야합니다. CPT 에서는요. embedding이 널뛰면 뒤 레이어가 다 왔다갔다 하니까요.

In [ ]:

Copied!





from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        #warmup_ratio = 0.1,
        #num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        #warmup_ratio = 0.1,
        #num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [ ]:

Copied!





#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [ ]:

Copied!

trainer_stats = trainer.train()
trainer_stats = trainer.train()

Save & Load¶

LoRA adapters 를 저장해보겠습니다. Huggingface's push_to_hub 를 사용해서 허깅페이스에 올리거나, save_pretrained 로 로컬에 저장하거나 둘 다 가능합니다.

[NOTE] 전체 모델을 저장하는게 아니라 LoRA 어댑터만 저장하는 것입니다.

In [ ]:

Copied!





model.save_pretrained("llama3.1-8b-kowiki") # Local saving
tokenizer.save_pretrained("llama3.1-8b-kowiki")
model.push_to_hub("jonhpark/llama3.1-8b-kowiki", token = "") # Online saving
tokenizer.push_to_hub("jonhpark/llama3.1-8b-kowiki", token = "") # Online saving
model.save_pretrained("llama3.1-8b-kowiki") # Local saving
tokenizer.save_pretrained("llama3.1-8b-kowiki")
model.push_to_hub("jonhpark/llama3.1-8b-kowiki", token = "") # Online saving
tokenizer.push_to_hub("jonhpark/llama3.1-8b-kowiki", token = "") # Online saving

Inference!¶

현재 상태에서 테스트를 해봅시다, 데이터 셋의 형태에 주의하면서 inference 를 시켜봅니다.

In [ ]:

Copied!

from transformers import TextStreamer

FastLanguageModel.for_inference(model)

text_streamer = TextStreamer(tokenizer)

input_text="""위키피디아 문서
### 제목: 체첸공화국

### 내용:"""
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

model.generate(**inputs, streamer=text_streamer, max_new_tokens=256)
from transformers import TextStreamer

FastLanguageModel.for_inference(model)

text_streamer = TextStreamer(tokenizer)

input_text="""위키피디아 문서
### 제목: 체첸공화국

### 내용:"""
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

model.generate(**inputs, streamer=text_streamer, max_new_tokens=256)

In [ ]:

Instruction Finetuning¶

Alpaca in GPT4 Dataset 인데, 한국어로 번역된 데이터셋을 사용해봅시다.

참고

원데이터, GPT4 for alpaca vicgalle/alpaca-gpt4
다국어 번역본 MultilingualSIFT project

In [ ]:

Copied!

from datasets import load_dataset
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split = "train")
from datasets import load_dataset
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split = "train")

In [ ]:

Copied!

print(alpaca_dataset[0])
print(alpaca_dataset[0])

In [ ]:

Copied!





_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)
_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

In [ ]:

Copied!





from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use num_train_epochs and warmup_ratio for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        #warmup_ratio = 0.1,
        #num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        report_to="none"
    ),
)
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use num_train_epochs and warmup_ratio for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        #warmup_ratio = 0.1,
        #num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        report_to="none"
    ),
)

In [ ]:

Copied!

trainer_stats = trainer.train()
trainer_stats = trainer.train()

In [ ]:

Copied!





#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

In [ ]:

Copied!





model.save_pretrained("llama3.1-8b-kowiki-instruct") # Local saving
tokenizer.save_pretrained("llama3.1-8b-kowiki-instruct")
model.push_to_hub("jonhpark/llama3.1-8b-kowiki-instruct-lora", token = "") # Online saving
tokenizer.push_to_hub("jonhpark/llama3.1-8b-kowiki-instruct-lora", token = "") # Online saving
model.save_pretrained("llama3.1-8b-kowiki-instruct") # Local saving
tokenizer.save_pretrained("llama3.1-8b-kowiki-instruct")
model.push_to_hub("jonhpark/llama3.1-8b-kowiki-instruct-lora", token = "") # Online saving
tokenizer.push_to_hub("jonhpark/llama3.1-8b-kowiki-instruct-lora", token = "") # Online saving

Inference¶

이번엔 학습한대로, alpaca prompt 포맷에 맞춰서 input을 준비하고 inference 를 해보겠습니다

In [29]:

Copied!

alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""
alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""

In [ ]:

Copied!





# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "What is Korean music like?"
        "체첸공화국에 대해 설명해", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "What is Korean music like?"
        "체첸공화국에 대해 설명해", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

In [ ]:

LoRA 모델 이용할 것인지 아닌지 설정해서 테스트해보세요.

In [ ]:

Copied!





if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        #model_name = "google/gemma-3-12b-it",
        model_name = "jonhpark/lora_model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = "",
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        #model_name = "google/gemma-3-12b-it",
        model_name = "jonhpark/lora_model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = "",
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

In [ ]:

Copied!





inputs = tokenizer(
[
    alpaca_prompt.format(
        "체첸공화국에 대해 설명해",
        "", # output - leave this blank for generation!
    ),
], return_tensors = "pt").to("cuda")


from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "체첸공화국에 대해 설명해",
        "", # output - leave this blank for generation!
    ),
], return_tensors = "pt").to("cuda")


from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

In [ ]:

vLLM 에서 서빙하기 위해 export 하기¶

원하는 형식으로 저장해서 내보낼 수 있습니다. 16bit, 4bit 모두 지원하고요, merge 하지 않고 lora 만 따로 올릴 수도 있습니다.

push_to_hub_merged 로 huggingface 에 올려보세요. https://huggingface.co/settings/tokens 에서 write token 받아야합니다.

In [ ]:

Copied!





# Merge to 16bit
if True: model.save_pretrained_merged("llama3.1-8b-kowiki-instruct-16bit", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("jonhpark/llama3.1-8b-kowiki-instruct-16bit", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("Llama-3.1-8B-kowiki-alpaca-4bit", tokenizer, save_method = "merged_4bit_forced",)
if False: model.push_to_hub_merged("jonhpark/Llama-3.1-8B-kowiki-alpaca-4bit", tokenizer, save_method = "merged_4bit_forced", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("Llama-3.1-8B-kowiki-alpaca-lora", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("jonhpark/Llama-3.1-8B-kowiki-alpaca-lora", tokenizer, save_method = "lora", token = "")
# Merge to 16bit
if True: model.save_pretrained_merged("llama3.1-8b-kowiki-instruct-16bit", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("jonhpark/llama3.1-8b-kowiki-instruct-16bit", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("Llama-3.1-8B-kowiki-alpaca-4bit", tokenizer, save_method = "merged_4bit_forced",)
if False: model.push_to_hub_merged("jonhpark/Llama-3.1-8B-kowiki-alpaca-4bit", tokenizer, save_method = "merged_4bit_forced", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("Llama-3.1-8B-kowiki-alpaca-lora", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("jonhpark/Llama-3.1-8B-kowiki-alpaca-lora", tokenizer, save_method = "lora", token = "")

GGUF¶

GGUF 형식으로도 내보낼 수 있습니다. save_pretrained_gguf , push_to_hub_gguf 를 사용하시면 됩니다.

Quantization 지원 목록은 옆에 링크 참조하세요. Wiki page

q8_0 - Fast conversion. High resource use, but generally acceptable.
q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [ ]:

Copied!





# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q5_k_m", token = "")
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q5_k_m", token = "")

In [ ]: