This article is transcoded by SimpRead, original article at zhuanlan.zhihu.com
Recommended by: Song Zhixue, Source: SwanLab
This article introduces a method of model stitching between the SmolVLM2 visual module and Qwen3-0.6B, and achieves a “Qwen3-SmVL” that features “ultra-small scale + multimodal + Chinese support” through fine-tuning. The fine-tuning process was completed entirely using Muxi GPU, with a complete Github repository and SwanLab records provided.
Abstract
Recently, the Huggingface team released the ultra-small multimodal model SmolVLM2, capable of running 1GB VRAM inference on edge devices. After testing it with excitement, it was found that although the model has extremely powerful visual-text understanding capabilities, it cannot understand Chinese, which is not very friendly to the Chinese tech community.
Coincidentally, when adapting hardware for SwanLab recently, there was a Muxi Xiyun C500 server that was still valid, which inspired the idea of directly fine-tuning and stitching the current leading Chinese small model Qwen3 with SmolVLM2.
This tutorial introduces an idea for model stitching by aligning and fine-tuning the visual module (0.09B) of SmolVLM2 and the smallest Qwen3 model (0.6B), ultimately enabling the Qwen model with certain visual understanding capabilities.
Note on computing resources: This tutorial involves VLM fine-tuning training, which requires high computing power. A GPU with 40G or more VRAM is needed to run the training code in this tutorial.
Contents
-
Background knowledge of SmolVLM2
-
Introduction to model stitching and fine-tuning ideas
-
Implementation of model stitching and key code explanation
-
Construction of fine-tuning dataset
-
Fine-tuning method and code implementation
-
Fine-tuning training & results display
-
Summary of code and dataset links
- Background knowledge of SmolVLM2
First, let us review the architecture of the SmolVLM2 model. SmolVLM2 consists of three main parts: the visual model layer, the feature mapping layer, and the large language model layer, as shown below:

This design is a common VLM scheme nowadays. The core design idea is to concatenate the output features of the visual model with the embedded text features directly and feed them into the language model (LLM), without cross-attention or similar modules.
Compared to earlier architectures like LLaVA, the biggest advantage is maximizing reuse of existing language models. For example, the sizes of Qwen2.5-VL’s 3B, 7B, and 72B models refer only to the LLM part and do not include the vision module. In fact, the 3B model has about 4B parameters total, with the visual module about 0.4B, and all three sizes of VLM use the same visual model.
For larger VLMs, most training of the visual model focuses on the feature mapping and visual modules, only adjusting the language model during final overall fine-tuning for the best performance, preserving the language capability of the VLM.
Below is a brief explanation of each module:
-
Visual model layer: The SmolVLM2-256M version uses Google’s SigLip model, a ViT-based visual model. The smallest SigLip-93M version was chosen. The Huggingface paper does not explicitly say if they used the SigLip parameters directly or built from scratch (readers who notice this can comment). In SmolVLM2 code, this corresponds to the
SmolVLMVisionTransformerclass. -
Feature mapping layer: A simple MLP. To reduce image resolution, SmolVLM2 uses a Pixel shuffle to decrease resolution further, reducing visual token usage and text length. The HF team mentioned this improves performance for small-parameter VLMs. However, the trainable parameters amount to a single-layer neural network. The core role is feature alignment: mapping visual features from 768 dimensions (SigLip) to 576 dimensions (SmolLLM2).
-
Large language model: The SmolVLM2-256M model uses the SmolLM-135M text model. Due to its small size, the HF team employed a two-stage training: large-scale image-text training and specialized fine-tuning for video tasks. To ensure text ability, about 14% of the training data was pure text fine-tuning data. Because the visual module (93M) size approaches the text model (135M), it is speculated that data balancing plays a more important role than freezing the text model.
The HF team also mentioned many tricks for improving performance of small image VLMs in their paper. Interested readers can refer to the SmolVLM2 paper for more details.
- Introduction to Model Stitching and Fine-tuning Ideas
As the saying goes, top-grade ingredients (models) only need the simplest cooking. The idea of model stitching is very straightforward, basically three steps:
-
Adjust SmolVLM2’s “context control format” to make it compatible with Qwen3.
-
Replace the text part of the model from SmolLM2 with Qwen3-0.6B, including its text tokenizer, word embeddings, text model, and the model’s final language model head (LM Head).
-
Reinitialize the feature mapping layer’s MLP, changing it from a 768→576 single-layer neural network to a 768→1024 single-layer neural network.
The overall architecture and the preprocessing/postprocessing of image-text pairs remain the same as SmolVLM2. The specific changes are shown in the figure below:

Next, the author will describe in detail the specific changes made to achieve “stitching,” for readers who might have similar tasks in the future.
- Model Stitching Implementation and Key Code Explanation
First modification: SmolVLM2 Tokenizers part
The first part to change is SmolVLM2’s Tokenizers. There are two main issues involved here:
-
The first is to add SmolVLM2’s special token for indicating image position (Special Token) into Qwen3’s tokenizer. The purpose is to prevent SmolVLM2’s image token
<image>from being split into<,image,>. Fortunately, Qwen3’s tokenizer reserves a special token<|image_pad|>for future multimodal use. Therefore, the author directly replaced<image>with<|image_pad|>, which reserves a placeholder in the text for inserting image features. -
The second issue is: the chat_template of SmolVLM2 differs greatly from that of Qwen3. The chat_template formats text so that the model clearly understands the background information represented by different tokens. In modern terms, this is “context engineering.”
Here the author lists the chat context formats of Qwen3, SmolVLM2, and Qwen2.5-VL in chat scenarios for reader reference.
Qwen3 Chat Context Format
For example, given an image and the question: “What is your name?” with the model answering “My name is Qwen,” the model context would be:
<|im_start|>userWhat is your name?<|im_end|><|im_start|>assistant<think></think>My name is Qwen<|im_end|>
Note that Qwen3 context does not reserve image positions but adds a special token <think></think> for inserting the model’s thought process, along with additional function call control text. To help readers understand, the author below provides a function…Examples of function calls. These function call contexts are used to control the model to call external functions, APIs, or MCP interfaces and receive their return information.
Considering length limitations, this article will not include the full context information involving function calls, reasoning, thinking, etc. (the author printed it out and found it was really too long). Interested readers can refer to the official documentation of Qwen3 for detailed design.
- Qwen3 function call example case: https://qwen.readthedocs.io/zh-cn/latest/framework/function_call.html#the-example-case
It can be said that these complex context informations make it possible for the model to realize diversified abilities such as reasoning and function calling. Multimodal understanding tasks also need to first design the context accordingly.
SmdwadwdoVLM2 chat context format:
Taking a picture with the question “How many dog in there.” and the model answering “There are Three dogs.” as an example. The contexts of three different models are as follows:
<|im_start|>User:<fake_token_around_image><row_1_col_1><image>...<image>><fake_token_around_image><row_1_col_2><image>...<image>><fake_token_around_image><row_1_col_3><image>...<image>...<fake_token_around_image><row_4_col_4><image>...<image><fake_token_around_image><global-img><image>...<image><fake_token_around_image>How many dog in there.<end_of_utterance>Assistant: There are Three dogs.<end_of_utterance>Assistant:
It looks very messy because there are many <image> placeholders. The author deleted a large number of placeholders between <image>...<image> for better article readability. Note that the model’s line breaks and spaces are part of the context and must be strictly followed in indentation during reasoning.
However, we can still find familiar content such as User:, Assistant: and other keywords used to indicate the user’s input and the model’s output. These keywords are similar to Qwen.
Readers will notice that besides <fake_token_around_image>, <image>, and other image indicators, there are position indicators like <row_1_col_1>. This is because SmolVLM2 uses an image splitting technique to prevent downsampling from affecting image resolution. Simply put, it inputs both global images and high-definition local images into the model (see the image splitting module in the figure below). Interested readers can find a detailed technical report from HF at the end of the article.

The Splicing Model Qwen3-SmVL in This Blog Post
Compared to Qwen3, SmolVLM2 lacks much of the上下控制 (context control).
In order to preserve or reserve Qwen3’s capability for reasoning and function calls as much as possible, the author finally chose to insert SmolVLM2’s image feature arrangement into Qwen3’s contextual format. The final context format is as follows:
<|im_start|>user<vision_start><row_1_col_1><|image_pad|> (place of image insertion) <|image_pad|><vision_start> (user question place) <|im_end|><|im_start|>assistant<think></think> (model answer place) <|im_end|><|endoftext|>
It can be seen that the author tries to keep the style consistent with Qwen3 and reuses special tokens. This avoids significant performance loss caused by context differences in subsequent spliced Qwen3-0.6B models. In fact, when designing fine-tuning context, one should try to stay close to the tasks the model was originally trained on to reduce performance degradation caused by fine-tuning.
The code controlling the model’s context format in transformers is not in Python, but a frontend text formatting language called Jinja. The variable scope design of this language is almost magical. Combined with the rich and complex contextual strategies of Qwen3, the author spent 2 hours modifying the chat_template. The author won’t elaborate on modifying chat_template here; interested readers may find the formatted chat_template.jinja file linked at the end of the article. The author plans to write a dedicated blog on model context control and Jinja in the future.
Second Change: Replace SmolVLM2’s SmolLM2 Model with Qwen3-0.6B
Replacing the model is not complicated, mainly dealing with the nested logic of Transformers. Transformers usually recommend separating the pretrained model backbone and downstream task heads. The change logic is shown below:

Taking Qwen3 as an example, the pretrained backbone model is Qwen3Model, which contains only the embedding layer and various decoder layers, outputting hidden states for all input tokens. The downstream models provided by Qwen3 include: Qwen3ForCausalLM for causal language sequence generation, which is commonly used for language generation.
Qwen3ForSequenceClassification handles sentence classification by feeding the last generated token into a single-layer MLP for sequence-level classification, suitable for sentiment analysis and similar tasks; Qwen3ForTokenClassification is for token-level classification, such as named entity recognition tasks.
Qwen3ForQuestionAnswering is specialized for extractive QA tasks, where the input (question, reference text) guides the model to find the most relevant passage. This task has become less popular due to RAG systems, and the author plans to publish a tutorial series on fine-tuning for tasks other than causal language generation.
Key code is as follows
`from transformers import ( AutoProcessor, AutoModelForImageTextToText, AutoTokenizer, AutoModelForCausalLM) # Replace text model and heads
smolvlm2_02B_model = AutoModelForImageTextToText.from_pretrained(
“model/SmolVLM2-256M-Video-Instruct”,
torch_dtype=torch.bfloat16,
_attn_implementation=“eager”,
).to(device)
qwen3_06b_model = AutoModelForCausalLM.from_pretrained(
“model/Qwen3-0.6B”,
torch_dtype=torch.bfloat16
).to(device)
smolvlm2_02B_model.model.text_model = qwen3_06b_model.model
smolvlm2_02B_model.lm_head = qwen3_06b_model.lm_head
…`
Next is more complex—replacing all key variables, such as the placeholder for image features in text sequences image_token_id, the token indicating generation stop eos_token_id, and the vocab_size used for loss calculation. Qwen’s vocabulary size is 151,936, much larger than SmolVLM2’s 49,280. The specific code is:
`…# Replace vocabulary size
smolvlm2_02B_model.vocab_size = qwen3_06b_model.vocab_size
smolvlm2_02B_model.model.vocab_size = qwen3_06b_model.vocab_size
smolvlm2_02B_model.config.vocab_size = qwen3_06b_model.vocab_size
smolvlm2_02B_model.config.text_config.vocab_size = qwen3_06b_model.vocab_size
smolvlm2_02B_model.model.config.vocab_size = qwen3_06b_model.vocab_size
smolvlm2_02B_model.model.config.text_config.vocab_size = qwen3_06b_model.vocab_size
Replace image tokens
smolvlm2_02B_model.image_token_id = 151655
smolvlm2_02B_model.model.image_token_id = 151655
smolvlm2_02B_model.config.image_token_id = 151655
smolvlm2_02B_model.model.config.image_token_id = 151655
Replace generation stop token
smolvlm2_02B_model.generation_config.eos_token_id = 151645
···`
From the above code, when replacing variables, you need to replace the nested model’s variables as well. Previously, the author only replaced SmolVLMForConditionalGeneration but forgot to replace image_token_id inside SmolVLMModel, leading to the language model not receiving image features. This caused the loss to drop extremely quickly and be very low, with grad_norm appearing normal but very poor inference results. Below is the training loss curve with the error:

The author initially did not find the error and performed full fine-tuning (blue curve), the loss quickly dropped below 0.1. However, actual inference showed the model had no image understanding capability. Then the author ran an experiment freezing the language model and only fine-tuned the visual model (red curve), which had no loss decrease, locating that image features were not passed in correctly. After fixing, the correct loss decline curve is the yellow line.
Third Change: Build and Replace Feature Mapping Layer
This is relatively simple; you only need to rebuild a dimension-alignment SmolVLMConnector. Qwen3’s hidden_dim is 1024, SigLip’s hidden_dim is 768, so build a SmolVLMConnector mapping from 768 to 1024. Code:
`···
Build config and create connector
@dataclass
class VisionConfig:
hidden_size: int = 768
@dataclass
class TextConfig:
hidden_size: int = 1024
@dataclass
class ConnectConfig:
scale_factor: int = 4
vision_config: VisionConfig = VisionConfig()
text_config: TextConfig = TextConfig()
new_connector_config = ConnectConfig()
Replace SigLip to LLM connector layer
new_connector = SmolVLMConnector(new_connector_config).to(device).to(torch.bfloat16)
smolvlm2_02B_model.model.connector = new_connector
···`
- Fine-tuning Dataset Construction
The author originally planned to find a Chinese multimodal dataset but found the related resources sparse. Therefore, decided to use English multimodal datasets temporarily, and later consider data synthesis to translate some data into Chinese. Data synthesis and ratio issues will be discussed in future blogs.

For convenience, this project uses HuggingFace’s integrated multimodal dataset the Cauldron. “Cauldron” in Chinese is similar to “釜” (a pot for boiling), maybe the HF team played on “炼丹” (alchemy) pun. This dataset integrates the training sets from 50 visual fine-tuning task datasets to fine-tune the HuggingFace released multimodal Idefics2 model. These 50+ datasets have been formatted uniformly (see below) with a total of 1,880,992 entries, and about 169GB when fully downloaded, very convenient.

However, the dataset text is all in English and most replies in subdatasets are very short, often one word, which complicates training later. This blog will not discuss data construction and ratio. The focus is on adding visual capability to Qwen3.
Download links:
-
HuggingFace Hub:
https://huggingface.co/datasets/HuggingFaceM4/the_cauldron -
ModelScope:
https://modelscope.cn/datasets/AI-ModelScope/the_cauldron
During testing, some subdatasets like “mimic_cgd,” “localized_narratives,” “okvqa,” “ocrvqa,” and “clevr_math” had loading issues. Users training on this dataset are advised to preprocess these manually. There are also community reports for downloading these separately from the original sources. The author plans to complete and re-upload the full the Cauldron dataset later.
- Fine-tuning Method and Code Implementation
Frozen Model Parameter Fine-tuning
The overall method uses standard Teacher Forcing with CLM model and cross-entropy loss. Since the target of this tutorial is enabling Chinese multimodal ability first (performance optimization blogs to come), for efficiency, in the alignment fine-tuning stage only the feature projector and language model heads are trained, while visual and text models are frozen.
Core code for freezing parameters:
def freeze_model(qwen_smvl): for _, param in qwen_smvl.model.text_model.named_parameters(): param.requires_grad = False for _, param in qwen_smvl.model.vision_model.named_parameters(): param.requires_grad = False return qwen_smvl
After freezing, training parameters, total parameters, and their ratio:
trainable params: 12.00M || all params: 662.87M || trainable %: 1.81
Text Length, Loss Masking, and Truncation Strategy
Text Length
Because visual features occupy a large number of tokens, the author tested that the_cauldron images take about 0.8K to 1.3K tokens. Most text tokens in the dataset are between 200-500, rarely 3-4K. Therefore, a uniform 2K text length was chosen, truncating excess.
Different from text-only fine-tuning, text truncation length cannot be less than image tokens, otherwise feature concatenation leads to errors (and truncating image tokens makes that training data meaningless). For those with less than 64G VRAM needing to reduce text length (not recommended below 1.5K), it is better to reduce image resolution as well. Future blogs will focus on reducing image token usage.
Because of text length limits and inability to truncate image tokens, no “packing dataset” method was used to improve training efficiency.
Since some datasets contain multiple images per example, but this training uses 2k text length (contrast to HF’s SmolVLM-256M using 8K, 2.2B version using 16K), only the first image per example is used.
Loss Masking
Teacher Forcing in text fine-tuning involves two strategies:
- Fine-tune on entire text containing both “user question” and “model response”.
- Fine-tune only on the “model response” portion.
The comparison is shown below. Usually, fine-tuning only on model response enables better generalization (a trick also mentioned in HF’s SmolVLM2 paper). However, the author chose full text fine-tuning to improve training efficiency and will include ablation studies in future blogs.
Note that during full-text fine-tuning, image tokens should be masked to prevent loss computation on image placeholder tokens, which hurts model performance.
Key code:
def data_collate_fix2k(examples, processor, device, max_length=2048): batch_text = [] batch_image = [] for example in examples: images = example["images"][:1] # allow only one image to reduce VRAM pressure batch_image.append(images) image_num = len(images) chat_texts = example["texts"][0] messages = [ { "role": "user", "content": [{"type": "image"}] * image_num + [{"type": "text", "text": chat_texts["user"]}], }, { "role": "assistant", "content": [{"type": "text", "text": chat_texts["assistant"]}], }, ] text = processor.apply_chat_template( messages, enable_thinking=False, add_generation_prompt=False ) batch_text.append(text) batch = processor( text=batch_text, images=batch_image, max_length=max_length, return_tensors="pt", padding="max_length", truncation=True, ) labels = batch["input_ids"].clone() labels[labels == processor.tokenizer.pad_token_id] = -100 labels[labels == processor.image_token_id] = -100 batch["labels"] = labels return batch.to(device, dtype=torch.bfloat16)
Fine-tuning Hyperparameters
Learning Rate
Because only the feature mapping layer (connector) is trained, which is randomly initialized to align with Qwen3’s dimensions (theoretically special initialization strategies could be used to improve performance, but not considered here due to model size), the learning rate is set to 1e-4, a common setting in lora.
To ensure effective convergence, learning rate decay is an essential trick, using community popular cosine decay down to zero. Warm-up is 10% of total steps (fixed at 50 if steps exceed 1,000k).
Batch sizeBatch size is generally better when larger; however, due to the very long text length of the VLM model, we use 1 batch per GPU and 4 gradient accumulations (grad accelerate), which is equivalent to a batch size of 32 when training with 8 GPUs.
Training Parameter Settings Code
training_args = TrainingArguments( seed=42, data_seed=42, max_steps=200, # num_train_epochs=1, # train 1 epoch about 1k steps per_device_train_batch_size=1, gradient_accumulation_steps=4, dataloader_pin_memory=False, warmup_ratio=0.1, learning_rate=1e-4, lr_scheduler_type="cosine", weight_decay=0.01, logging_steps=5, eval_strategy="steps", eval_steps=0.125, save_strategy="steps", save_steps=0.125, save_total_limit=8, optim="adamw_torch", bf16=True, output_dir=f"./model/freeze_except_connector_cocovqa", overwrite_output_dir=False, report_to="swanlab", run_, remove_unused_columns=False, gradient_checkpointing=False,)
Training Environment
The fine-tuning code is based on the Mu Xi Xi Yun C500 general GPU implementation, with 64GB video memory.
Readers attempting this project code can run this tutorial using Nvidia GPUs with more than 40GB of VRAM.
Regarding the training environment, besides installing the GPU driver and PyTorch corresponding to the hardware, this tutorial requires additional installation of the full Huggingface suite, as follows:
torch # Recommended version >=6.0 torchvision transformers >=4.53.0 accelerate datasets num2words # SmolVLM2 required
An extra note: if using Mu Xi GPU for training, you need to follow the official Mu Xi documentation to find and download the Mu Xi version of torch. The other HF environment and Nvidia are basically the same. Here is a handy command to check GPUs on Mu Xi:
mx-smi
The output is as follows:
=================== MetaX System Management Interface Log =================== Timestamp : Sat Jul 12 14:58:51 2025 Attached GPUs : 8 +---------------------------------------------------------------------------------+ | MX-SMI 2.1.12 Kernel Mode Driver Version: 2.12.13 | | MACA Version: 2.29.0.19 BIOS Version: 1.22.3.0 | |------------------------------------+---------------------+----------------------+ | GPU NAME | Bus-id | GPU-Util | | Temp Pwr:Usage/Cap | Memory-Usage | |====================================+=====================+======================| | 0 MetaX C500 | 0000:0e:00.0 | 0% | | 36C 69W / 350W | 5680/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 1 MetaX C500 | 0000:0f:00.0 | 0% | | 38C 70W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 2 MetaX C500 | 0000:10:00.0 | 0% | | 37C 69W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 3 MetaX C500 | 0000:12:00.0 | 1% | | 37C 71W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 4 MetaX C500 | 0000:35:00.0 | 0% | | 37C 70W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 5 MetaX C500 | 0000:36:00.0 | 1% | | 36C 68W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 6 MetaX C500 | 0000:37:00.0 | 0% | | 39C 73W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| | 7 MetaX C500 | 0000:38:00.0 | 0% | | 38C 71W / 350W | 4986/65536 MiB | |+------------------------------------+---------------------+----------------------+| +---------------------------------------------------------------------------------+ | Process: | GPU PID Process Name GPU Memory Usage(MiB) | |=================================================================================| | 0 3496691 python3.10 4066 | | 0 3496692 python3.10 102 | | 0 3496693 python3.10 102 | | 0 3496694 python3.10 102 | | 0 3496695 python3.10 102 | | 0 3496696 python3.10 102 | | 0 3496697 python3.10 102 | | 0 3496698 python3.10 170 | | 1 3496692 python3.10 4154 | | 2 3496693 python3.10 4154 | | 3 3496694 python3.10 4154 | | 4 3496695 python3.10 4154 | | 5 3496696 python3.10 4154 | | 6 3496697 python3.10 4154 | | 7 3496698 python3.10 4154 | +---------------------------------------------------------------------------------+
Training Code Implementation
When constructing the training code, I used the Trainer class from the HuggingFace Transformers framework to complete the training code. The training logic implemented by the Trainer class can basically complete most fine-tuning tasks. The only thing to mention here is that I used Qwen3-0.6B instead of the usual Qwen3-0.6B-Base model typically used for this kind of task. Qwen3-0.6B, compared to Qwen3-0.6B-Base, has undergone instruction-tuning and alignment, enabling it for chat Q&A functionality.
Usually, continuing training on a fine-tuned model will to some degree cause performance degradation; however, in this fine-tuning, I froze the LLM parameters, so a fine-tuned model must be selected to enable multimodal Q&A capability.
I used bfloat16 precision during training. Compared to float16, bfloat16 increases the number of mantissa bits, providing higher precision during training.
During the early scheme validation phase, I used the cocoqa dataset and conducted 200 steps of fine-tuning. After confirming the feasibility of the approach, I planned to use a complete dataset for fine-tuning. Considering that the training dataset only has 12M tokens relative to the model, I sampled the dataset at a ratio of 1:10 between parameter count and training tokens — a total of 60K samples from the dataset are used for actual training (the text length is calculated as 2k, but there is actual padding, so the actual tokens involved are fewer than 120M). I believe this number of training samples is sufficient for model convergence, which subsequent experiments confirmed by achieving the expected performance.
Key Training Code Implementation
The code is quite long because checkpoint resume capability is added.
`################# Start Training ################
last_checkpoint = None
load last checkpoint if available
if (os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir):
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError( f"Output directory ({training_args.output_dir}) already exists" )
print( f"Checkpoint detected, resuming training at {last_checkpoint}." )
Init Trainer
trainer = Trainer(
model=qwen_smvl,
args=training_args,
train_dataset=raw_data[“train”],
eval_dataset=raw_data[“test”],
data_collator=collate_fn,
)
trainer.train(resume_from_checkpoint=last_checkpoint)
qwen_smvl.save_pretrained(training_args.output_dir)`
- Fine-tuning Training & Results Presentation
Code Preparation and Environment Installation
pip install -r requirements.txt
Dataset and Model Download
I provide an automatic download script; note this script uses the MoTa community to complete the model and dataset downloads.
bash download_resource.sh
Small Batch Fine-tuning
To perform quick verification, I first used the cocoqa dataset and conducted 200 steps of training, with all parameters as previously described. Run the experiment with the following command. It is recommended to use 8 GPUs for training. Training on 8 Mu Xi GPUs is expected to take about 20 minutes.
`# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py ./cocoqa_train.yaml
8 GPU training
accelerate --num_process 8 train.py ./cocoqa_train.yaml`
Note: This project uses SwanLab for training log recording and analysis. If not logged in to SwanLab, use swanlab login to sign in. Seeing the following result after running indicates the experiment has started successfully:

Below are training loss and test loss plots from small batch fine-tuning:

After finishing training, the model automatically uses a dog image with the question “What animals are in the picture?” for inference. The inference result is as follows:
At first, when seeing the model answer “rabbit” to an image of three dogs, I thought the fine-tuning failed. Of course, if the fine-tuning truly failed, the model would not output an animal type but rather garbled characters or notify the user that no image was detected. The recognition error was actually due to too few training steps. Increasing training steps and data volume later allowed the model to correctly recognize dogs and accurately state there are three dogs.

PS: The author has published the training results on SwanLab. Interested readers can check for themselves. SwanLab also supports cloning the author’s training logs; you can clone my project during your training for comparison.
Full Fine-tuning Results Presentation
Run the experiment with the following commands. It is recommended to use 8 GPUs. Training on 8 Mu Xi Xi Yun C500 GPUs is expected to take about 1.5 hours.
`# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py ./full_train.yaml
8 GPU training
accelerate --num_process 8 train.py ./full_train.yaml`
The figure below shows loss comparison between full fine-tuning and small batch training. You can see the loss becomes more oscillatory with full dataset fine-tuning, due to the richer data types posing challenges for the model’s learning.

Further comparing training and test losses of full training and small batch training: the full training model achieved a training loss of 0.61, much lower than using only the cocoqa model, and the evaluation loss was also far lower, maintaining around 0.58.

It is worth mentioning that since the test set is relatively small (only 64 samples), the gap between training loss and test loss cannot be directly understood as evidence of overfitting. In large model training, given sufficiently large datasets, training loss is usually considered equivalent to evaluation loss.
Additionally, analyzing the training loss and average gradient norm (Grad Norm) after 1k steps shows the training task is more than halfway through, and the learning rate is beginning to decay rapidly. As shown below, despite the rapid learning rate decrease, the model loss does not show obvious further decline, indicating the model is sufficiently trained.

Regarding training efficiency, we can see that we have not fully exploited the GPU capabilities. This is also due to the multimodal task network architecture being quite complex, including many concatenation operations on images and texts, resulting in GPU performance not being fully utilized.

Similarly, after training completion, the model is tested again on a dog picture. This time the model understands the image and Chinese question, providing the correct answer. More importantly, the model fully retains all the original capabilities of Qwen3-0.6B, including function calls and reasoning. On top of that, adding only 0.09B parameters gives the model image understanding ability!
Model Inference and Effect Analysis
- Code and Dataset Links Summary
Fine-tuning The Cauldron dataset download links:
-
HuggingFace Hub:
https://huggingface.co/datasets/HuggingFaceM4/the_cauldron -
ModelScope:
https://modelscope.cn/datasets/AI-ModelScope/the_cauldron
Qwen3-0.6B model download:
-
HuggingFace Hub:
https://huggingface.co/Qwen/Qwen3-0.6B -
ModelScope:
https://modelscope.cn/Qwen/Qwen3-0.6B
Complete experiment code GitHub link:
Experiment SwanLab logs:
- SwanLab training overview:
https://swanlab.cn/@ShaohonChen/Qwen3-SmVL/overview
References
- Huggingface SmolVLM2 technical report: https://arxiv.org/pdf/2504.05299