Local LLM - The Steps

Here is the procedure to install your Windows environment to train a LLM model.
Important: it only works with nVidia Graphic Card because some libraries are developed for it.
Install Python
Download Python 3.12.9 at this address: https://www.python.org/downloads/release/python-3129/
Install Python 3.12.9 by using the wizard and do not forget to allow long path on Windows to avoid some compilation problem.
Remark: Do not install 3.13.x because it's not compatible with some other libraries used.
Check your python version with this command:
python --versionYou should get the python version installed on your machine:
Install Unsloth
Install the "unsloth" system to increase the speed of your LLM training system.
pip install unsloth
nVidia Cuda Version
Important: you have to know your CUDA version to use the right libraries based on your version.
You can use one of these commands to check the version of CUDA:
nvidia-smi
or
nvcc —version
As you can see, I have version 12.8.
By using unsloth, I install the CUDA version 12.4 because it's stable and torch 2.6 with the following command:
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"
C++
Here is the list of C++ component you have to install from Visual Studio:
Install CUDA toolkit
You also have to install the CUDA toolkit.
Here is the download link:
Install Unsloth (again)
Remark: I am not sure I have to install unsloth again but it's not a big problem to do it again to be sure it is installed.
pip install "unsloth[windows] @ git+https://github.com/unslothai/unsloth.git"
Install Triton
Install Triton for Windows by using this command:
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post10/triton-3.2.0-cp312-cp312-win_amd64.whl
Install MSVC
You have to install MSVC if it's not already on your system.
Then you have to add this path to your Windows "Path" variable:
C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.43.34808\bin\Hostx64\x64
Open a terminal and type the following command to verify that your C++ system is correctly installed:
cl
Install CUDA-Enabled PyTorch for Windows for CUDA 12.4
Here is the command to install PyTorch for the CUDA 12.4 version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Verify your installation
Create a file "test_triton.py" on your computer (ex: c:\python\test_triton.py) and copy the following code:
import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
a = torch.rand(3, device="cuda")
b = a + a
b_compiled = add(a, a)
print(b_compiled - b)
print("If you see tensor([0., 0., 0.], device='cuda:0'), then it works")Verify that your environment is correctly installed with this command:
python.exe .\test_triton.py
You should see this:
Install llama-quantize for Windows
You need this because on Windows, with unsloth, it's not included!
Download the executables for Windows from this website:
https://github.com/ggml-org/llama.cpp/releases
Copy all the files to the folder "llama.cpp":
C:\python\SLM_Models\phi_3_5\llama.cpp
Remark: it's needed for this command in the script → model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
Convert the model from BF16 to q4_k_m
Convert the model from "C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" to "C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf".
Add this code inside your python script:
import subprocess exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe" input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf" quantization_type = "q4_k_m" subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)
Final script
import os
import subprocess
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
exe_dir = r"C:\python\SLM_Models\phi_3_5\llama.cpp"
os.environ["PATH"] += os.pathsep + exe_dir
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Phi-3.5-mini-instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit
)
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-3",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 1, # multiprocessing (causing the issue)
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none",
),
)
trainer.train()
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# The following command is converted to Python
# .\llama.cpp\llama-quantize.exe .\model\unsloth.BF16.gguf .\model\unsloth.q4_k_m.gguf q4_k_m
exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe"
input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf"
output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf"
quantization_type = "q4_k_m"
subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)
Optional Components
Install Jupyter Notebook
Install Jupyter Notebook to use the unsloth "ready" notebook.
pip install notebook
Run Jupyter Notebook by using this command in your terminal:
jupyter notebook
unsloth Notebook
I use this unsloth Jupyter Notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
Download it and then use it from your Jupyter Notebook system by running the following command on your terminal:
jupyter notebook
Sources
What | Link | Description |
unsloth | To improve the performance of LLM learning on your machine. | |
triton-windows | ||
CUDA Toolkit |
