Local LLM - The Steps

Here is the procedure to install your Windows environment to train a LLM model.
 

Important: it only works with nVidia Graphic Card because some libraries are developed for it.

Install Python

Download Python 3.12.9 at this address: https://www.python.org/downloads/release/python-3129/
Install Python 3.12.9 by using the wizard and do not forget to allow long path on Windows to avoid some compilation problem.
Remark: Do not install 3.13.x because it's not compatible with some other libraries used.
 
Check your python version with this command:
python --version
You should get the python version installed on your machine:
 

Install Unsloth

Install the "unsloth" system to increase the speed of your LLM training system.
pip install unsloth

nVidia Cuda Version

Important: you have to know your CUDA version to use the right libraries based on your version.
You can use one of these commands to check the version of CUDA:
nvidia-smi
 
or
nvcc —version
As you can see, I have version 12.8.
By using unsloth, I install the CUDA version 12.4 because it's stable and torch 2.6 with the following command:
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"

C++

Here is the list of C++ component you have to install from Visual Studio:
 

Install CUDA toolkit

You also have to install the CUDA toolkit.
Here is the download link:

Install Unsloth (again)

Remark: I am not sure I have to install unsloth again but it's not a big problem to do it again to be sure it is installed.
pip install "unsloth[windows] @ git+https://github.com/unslothai/unsloth.git"

Install Triton

Install Triton for Windows by using this command:
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post10/triton-3.2.0-cp312-cp312-win_amd64.whl

Install MSVC

You have to install MSVC if it's not already on your system.

 

Then you have to add this path to your Windows "Path" variable:
C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.43.34808\bin\Hostx64\x64 
Open a terminal and type the following command to verify that your C++ system is correctly installed:
cl

Install CUDA-Enabled PyTorch for Windows for CUDA 12.4

Here is the command to install PyTorch for the CUDA 12.4 version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Verify your installation

Create a file "test_triton.py" on your computer (ex: c:\python\test_triton.py) and copy the following code:
import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

a = torch.rand(3, device="cuda")
b = a + a
b_compiled = add(a, a)
print(b_compiled - b)
print("If you see tensor([0., 0., 0.], device='cuda:0'), then it works")
Verify that your environment is correctly installed with this command:
python.exe .\test_triton.py
You should see this:

Install llama-quantize for Windows

You need this because on Windows, with unsloth, it's not included!
Download the executables for Windows from this website:
https://github.com/ggml-org/llama.cpp/releases 
Copy all the files to the folder "llama.cpp":
C:\python\SLM_Models\phi_3_5\llama.cpp 
Remark: it's needed for this command in the script → model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

Convert the model from BF16 to q4_k_m

Convert the model from "C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" to "C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf".
Add this code inside your python script:
import subprocess
exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe"
input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf"
output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf"
quantization_type = "q4_k_m"
subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)

Final script

import os
import subprocess
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

exe_dir = r"C:\python\SLM_Models\phi_3_5\llama.cpp"
os.environ["PATH"] += os.pathsep + exe_dir

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False, 
    loftq_config = None,
)

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,  # multiprocessing (causing the issue)
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

trainer.train()

model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

# The following command is converted to Python
# .\llama.cpp\llama-quantize.exe .\model\unsloth.BF16.gguf .\model\unsloth.q4_k_m.gguf q4_k_m
exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe"
input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf"
output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf"
quantization_type = "q4_k_m"
subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)

Optional Components

Install Jupyter Notebook

Install Jupyter Notebook to use the unsloth "ready" notebook.
pip install notebook
Run Jupyter Notebook by using this command in your terminal:
jupyter notebook

unsloth Notebook

Download it and then use it from your Jupyter Notebook system by running the following command on your terminal:
jupyter notebook

Sources

What
Link
Description
unsloth
To improve the performance of LLM learning on your machine.
triton-windows
 
CUDA Toolkit