How to Download Large Language Models and Launch Them with LlamaFactory

Downloading Models Using ModelScope

Downloading with Python:

# !pip install modelscope  

from modelscope.hub.snapshot_download import snapshot_download  
model_dir = snapshot_download(  
    model_id='Qwen/Qwen3-8B',  
    local_dir='/ubuntu-22.04/LLaMA-Factory/models/qwen3-8b',  
    cache_dir='/ubuntu-22.04/LLaMA-Factory/models/qwen3-8b-cache')  

Loading Models Using LLaMA-Factory

Enabling in the terminal:

CUDA_VISIBLE_DEVICES=2,3 \  
API_HOST=0.0.0.0 \  
API_PORT=8001 \  
API_KEY=sk-test\  
llamafactory-cli api \  
  --model_name_or_path /ubuntu-22.04/LLaMA-Factory/models/qwen3-8b \  
  --template qwen \  
  --finetuning_type lora \  
  --trust_remote_code \  
  --max_new_tokens 32768  

Loading Models Using vLLM

Enabling in the terminal:

CUDA_VISIBLE_DEVICES=5 vllm serve /ubuntu-22.04/LLaMA-Factory/models/qwen3-8b --port 8004 --host 0.0.0.0 --max-num-seqs 4 --max-model-len 4096 --served-model-name deepseek-ocr --gpu-memory-utilization 0.2