Local LLM - The Steps

Here is the procedure to install your Windows environment to train a LLM model.
Important: it only works with nVidia Graphic Card because some libraries are developed for it.
Install Python
Download Python 3.12.9 at this address: https://www.python.org/downloads/release/python-3129/
Install Python 3.12.9 by using the wizard and do not forget to allow long path on Windows to avoid some compilation problem.
Remark: Do not install 3.13.x because it's not compatible with some other libraries used.
Check your python version with this command:
python --version
You should get the python version installed on your machine:
Install Unsloth
Install the "unsloth" system to increase the speed of your LLM training system.
pip install unsloth
nVidia Cuda Version
Important: you have to know your CUDA version to use the right libraries based on your version.
You can use one of these commands to check the version of CUDA:
nvidia-smi
or
nvcc —version
As you can see, I have version 12.8.
By using unsloth, I install the CUDA version 12.4 because it's stable and torch 2.6 with the following command:
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"
C++
Here is the list of C++ component you have to install from Visual Studio:
Install CUDA toolkit
You also have to install the CUDA toolkit.
Here is the download link:
Install Unsloth (again)
Remark: I am not sure I have to install unsloth again but it's not a big problem to do it again to be sure it is installed.
pip install "unsloth[windows] @ git+https://github.com/unslothai/unsloth.git"
Install Triton
Install Triton for Windows by using this command:
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post10/triton-3.2.0-cp312-cp312-win_amd64.whl
Install MSVC
You have to install MSVC if it's not already on your system.
Then you have to add this path to your Windows "Path" variable:
C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.43.34808\bin\Hostx64\x64
Open a terminal and type the following command to verify that your C++ system is correctly installed:
cl
Install CUDA-Enabled PyTorch for Windows for CUDA 12.4
Here is the command to install PyTorch for the CUDA 12.4 version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Verify your installation
Create a file "test_triton.py" on your computer (ex: c:\python\test_triton.py) and copy the following code:
import torch import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) output = x + y tl.store(output_ptr + offsets, output, mask=mask) def add(x: torch.Tensor, y: torch.Tensor): output = torch.empty_like(x) n_elements = output.numel() grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),) add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024) return output a = torch.rand(3, device="cuda") b = a + a b_compiled = add(a, a) print(b_compiled - b) print("If you see tensor([0., 0., 0.], device='cuda:0'), then it works")
Verify that your environment is correctly installed with this command:
python.exe .\test_triton.py
You should see this:
Install llama-quantize for Windows
You need this because on Windows, with unsloth, it's not included!
Download the executables for Windows from this website:
https://github.com/ggml-org/llama.cpp/releases
Copy all the files to the folder "llama.cpp":
C:\python\SLM_Models\phi_3_5\llama.cpp
Remark: it's needed for this command in the script → model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
Convert the model from BF16 to q4_k_m
Convert the model from "C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" to "C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf".
Add this code inside your python script:
import subprocess exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe" input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf" quantization_type = "q4_k_m" subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)
Final script
import os import subprocess import torch from unsloth import FastLanguageModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported exe_dir = r"C:\python\SLM_Models\phi_3_5\llama.cpp" os.environ["PATH"] += os.pathsep + exe_dir max_seq_length = 2048 dtype = None load_in_4bit = True model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Phi-3.5-mini-instruct", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit ) model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407, use_rslora = False, loftq_config = None, ) tokenizer = get_chat_template( tokenizer, chat_template = "phi-3", mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, ) def formatting_prompts_func(examples): convos = examples["conversations"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,) trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 1, # multiprocessing (causing the issue) packing = False, args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", ), ) trainer.train() model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") # The following command is converted to Python # .\llama.cpp\llama-quantize.exe .\model\unsloth.BF16.gguf .\model\unsloth.q4_k_m.gguf q4_k_m exe_path = r"C:\python\SLM_Models\phi_3_5\llama.cpp\llama-quantize.exe" input_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.BF16.gguf" output_model = r"C:\python\SLM_Models\phi_3_5\model\unsloth.q4_k_m.gguf" quantization_type = "q4_k_m" subprocess.run([exe_path, input_model, output_model, quantization_type], shell=True)
Optional Components
Install Jupyter Notebook
Install Jupyter Notebook to use the unsloth "ready" notebook.
pip install notebook
Run Jupyter Notebook by using this command in your terminal:
jupyter notebook
unsloth Notebook
I use this unsloth Jupyter Notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
Download it and then use it from your Jupyter Notebook system by running the following command on your terminal:
jupyter notebook
Sources
What | Link | Description |
unsloth | To improve the performance of LLM learning on your machine. | |
triton-windows | ||
CUDA Toolkit |