How to improve recognition capability of a basic VLM-OCR model by combining traditional computer vision techniques

Translated by SimpRead, original URL www.lapis.cafe

Alibaba Cloud’s Qwen large model saw significant price reductions at the end of 2024, especially for the Qwen-VL series, offering developers affordable multimodal vision-language processing capabilities.

Content Timeliness Notice

“This blog post was published over a year ago. The content may now be outdated, and relevant technologies, policies, or circumstances might have changed. Readers are advised to verify the latest information.”

Alibaba Cloud’s Tongyi Qianwen (Qwen) large models have consistently delivered strong performance in both closed-source and open-source domains. At the end of 2024, the Qwen-VL (Vision-Language) series underwent a major price reduction—an excellent development for developers like myself who want to leverage large models for personal projects. Earlier this year, I primarily relied on classic convolutional neural networks such as ResNet for image classification. However, with the rapid advancement of multimodal vision-language models (VLMs) and the increasingly evident cost advantage of Qwen-VL, the expense of categorizing hundreds or thousands of personal photos has dropped to an acceptable and highly attractive level.

Thus, I began exploring how to use these powerful vision-language models (VLMs) to improve my workflow. Previously, I had to painfully label data, fine-tune ResNet models manually, and constantly monitor loss curves during training. Now, with the zero-shot learning capability of VLMs, I can simply write prompts, and the model understands and automatically classifies images—even producing nuanced results based on my needs. For example, when organizing travel photos, I can directly instruct the Qwen-VL model to identify which images are “sunsets,” “food,” “portraits,” or “landscapes.” I can even define more specific categories like “beaches at dusk” or “city skylines at sunrise,” and the model delivers reliable classifications—all without any additional training. This would have been unimaginable just a few years ago.

As a developer, this combination of ease of use and powerful performance is nothing short of a blessing. In the past, developing models like ResNet or EfficientNet required extensive effort in tedious parameter tuning, often involving overnight sessions of manual data annotation—leading to extremely low development efficiency. Now, leveraging high-performing pre-trained models and convenient inference mechanisms offered by VLMs, I only need to craft well-designed prompts, then call Alibaba Cloud’s open API to quickly achieve desired classification or analysis functions. This greatly enhances productivity, allowing me to focus more on building core business logic rather than being bogged down by foundational model training and optimization.


Section 1: Performing Image Classification Tasks Using Qwen-VL #

The principle here is quite simple: since VLM models understand both images and text, they can easily perform cross-modal understanding and reasoning. We merely send the model an image along with a specific prompt, and it automatically determines and outputs the image category. To simplify result handling, we only need to implement a basic result-matching parsing module. At the system architecture level, this project interacts with the Qwen-VL model via an OpenAI-compatible API interface, managing API keys and other configurations through environment variables to ensure flexibility, security, and maintainability.

To improve classification speed, the project supports concurrent processing of multiple images and integrates image preprocessing and compression features, effectively balancing processing efficiency and image quality to guarantee stable and efficient operation. Below is the GitHub repository link:

Lapis0x0/VLMClassifier

Processing Pipeline Details #

The project follows this processing flow: First, input images undergo detailed preprocessing. This includes resizing images to a maximum dimension of 1024x1024 pixels, converting them to the RGB color space, and applying JPEG compression with a quality factor of 85 to optimize data transmission efficiency.

After preprocessing, the image is encoded into Base64 format, and a request containing the image data and predefined classification prompts is constructed. Then, the Qwen-VL-Plus model is invoked via API for inference. The returned results are deeply parsed to accurately determine the final image category. Currently, the system includes default categories such as anime, daily life photos, pets, work-related images, and memes, while also supporting custom category extensions via environment variables to meet individual user needs.

For performance optimization, the project employs thread pools to enable concurrent processing of multiple images and incorporates robust exception-handling mechanisms to address potential issues. Additionally, all major parameters are configurable via environment variables, and the image optimization strategy balances processing efficiency with image quality.

Key advantages of this project include:

  1. Leveraging advanced VLM models to ensure accurate understanding of image content;
  2. Implementing efficient concurrent processing to significantly boost speed;
  3. Offering excellent configurability and extensibility for user customization;
  4. Integrating comprehensive image preprocessing that maintains image quality while improving efficiency.

Thanks to these features, the project can be widely applied in scenarios such as bulk image categorization, photo library management, and automated image classification systems.

Summary #

I love this kind of straightforward, broadly applicable solution offered by VLMs—minimal design effort yields excellent results.


Section 2: Organizing Notes Using the Qwen-VL-OCR Model #

Approach Overview #

Beyond content classification, Qwen-VL has a specialized variant—Qwen-VL-OCR—specifically optimized for extracting text from images. It efficiently recognizes and extracts textual information from various types of images, including:

  • Documents: Scanned documents, PDFs
  • Tables: Various forms of tabular data
  • Exams: Test papers, practice questions
  • Handwritten Text: Handwritten notes, letters

Currently, the Qwen-VL-OCR model supports multiple languages: Chinese, English, French, Japanese, Korean, German, Russian, Italian, Vietnamese, and Arabic.

NOTE

The input/output pricing for this model is ¥5 per million tokens—extremely cost-effective.

Therefore, we can leverage the powerful OCR capabilities of Qwen-VL-OCR to automate note archiving and organization.

Implementation Steps:

  1. Image Preprocessing: My personal notes are often multi-page stitched together, so necessary preprocessing steps such as page segmentation, resizing, grayscaling, and denoising are required to improve OCR accuracy. While the Qwen-VL-OCR model exhibits some robustness to image quality, proper preprocessing further enhances recognition performance.
  2. Text Extraction: Feed the preprocessed image into the Qwen-VL-OCR model, which automatically identifies and extracts the text, outputting it in plain text form.
  3. Post-processing & Refinement: The raw OCR output is unstructured and may not follow the original reading order. We need to invoke another LLM to revise and polish the text to better suit our reading habits and archival requirements.
  4. Archival Organization: Based on the extracted text and our specific needs, organize the notes into different folders or databases—for instance, categorizing by keywords, topics, or dates.

Project Repository:

Lapis0x0/NoteOCR

1. Note Detection and Page Segmentation — Edge and Contour-Based Page Recognition #

Given that models have limited context capacity, feeding a dozen pages of notes at once would likely degrade recognition quality. Therefore, we first need to “trim” the input—detecting and segmenting pages—so the model processes one page at a time, leading to easier and more accurate recognition.

My current detection method relies on classical computer vision techniques, operating in three stages: preprocessing, edge detection and line extraction, and page region identification, to locate the boundaries of each note page.

(1) Preprocessing

  • Grayscale Conversion: Convert the color image to grayscale. This simplifies subsequent processing by reducing it to a single-channel representation.

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
  • CLAHE Contrast Enhancement: To handle uneven lighting or low contrast, I apply Contrast Limited Adaptive Histogram Equalization (CLAHE). CLAHE enhances local contrast by performing histogram equalization within small tiles, avoiding excessive noise amplification.

    • Mathematically, CLAHE divides the image into tiles, computes histograms for each, performs equalization, and uses bilinear interpolation to smoothly combine results.
    • Core formula (for one tile):
    g = \frac{N}{(L - 1)} \sum_{i=0}^{f} \text{hist}(i)

    Where:

    • g: pixel value after equalization
    • f: original pixel value
    • L: number of gray levels (e.g., 256)
    • \text{hist}(i): count of pixels with intensity i
    • N: total number of pixels in the tile
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)
    
  • Gaussian Blur Denoising: Apply Gaussian blur to reduce high-frequency noise. Kernel size can be adjusted based on noise level.

    • 2D Gaussian function:
    G(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}

    Where:

    • (x,y): pixel coordinates
    • \sigma: standard deviation controlling blur strength
    blurred = cv2.GaussianBlur(enhanced, (5, 5), 0)
    

(2) Edge Detection and Line Extraction: Outlining Page Boundaries

After preprocessing, we proceed to detect page edges.

  • Canny Edge Detection: Use the classic Canny edge detection algorithm, which detects edges using gradient computation, non-maximum suppression, and double thresholding.

    edges = cv2.Canny(blurred, 50, 150)
    
  • Hough Transform for Line Detection: Apply the Hough Transform to extract straight-line features corresponding to page borders. It maps lines in image space to points in parameter space.

    • A line in polar coordinates:
    \rho = x\cos\theta + y\sin\theta

    Where:

    • \rho: distance from origin to line
    • \theta: angle between normal vector and x-axis
    • (x,y): point on the line
    • Hough transform identifies lines by finding peaks in the (\rho,\theta) parameter space
    lines = cv2.HoughLines(edges, 1, np.pi / 180, 200)
    
  • Line Mask Creation and Morphological Operations: Detected lines are processed further to form complete page boundaries. A line mask is created and subjected to dilation and erosion to connect broken edges and remove small noise artifacts.

(3) Page Region Identification: Bounding Each Note Page

With clear edge information, we identify individual page regions.

  • Find Contours: Use cv2.findContours to detect contours in the image.

    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
  • Filter Rectangular Regions: Select contour candidates likely to represent pages based on area and aspect ratio.

  • Perspective Transformation for Correction: Apply perspective transformation to correct skewed or distorted pages. By mapping the four corner points of the detected rectangle to a standard rectangular frame, the page is rectified.

    • The perspective transformation matrix M is derived by solving:
    \begin{bmatrix} x_i' \\ y_i' \\ 1 \end{bmatrix} = M \begin{bmatrix} x_i \\ y_i \\ 1 \end{bmatrix}

    Where:

    • (x_i, y_i): original corner points
    • (x_i', y_i'): transformed coordinates
    • i = 1,2,3,4: four corners
  • Sort by Position: Finally, sort detected page regions from left to right to preserve correct page order.

(4) Alternative Method: Page Segmentation Based on Text Density Analysis

Although the primary method works well in most cases, I’ve designed a fallback approach (_fallback_page_detection) to enhance system robustness. If the main method fails to detect the expected number of pages (e.g., no pages found or count ≠ 3), this alternative is triggered.

This method assumes blank spaces exist between note pages, characterized by low text density.

  1. Image Binarization: Separate text from background via binarization.

    • Common methods include global and adaptive thresholding. Adaptive thresholding formula:
    T(x,y) = \mu(x,y) - C

    Where:

    • T(x,y): threshold for pixel (x,y)
    • \mu(x,y): mean intensity in local neighborhood
    • C: constant offset
  2. Compute Horizontal Text Density Distribution: Count foreground pixels row by row in the binary image to obtain horizontal text density.

  3. Moving Average Smoothing: Smooth the density curve using moving average to suppress noise.

  4. Locate Local Minima: Identify local minima in the smoothed curve—these typically correspond to inter-page gaps.

  5. Equal-Interval Splitting (Fallback): If no suitable split points are found, fall back to equal-interval division as a last resort.

Reflection: Is This Page Detection Method Optimal? #

Certainly not. Recently, I noticed that models like Qwen-VL and Gemini seem to natively support object detection, capable of directly outputting bounding boxes. Future versions could test using VLMs directly for page detection and segmentation, potentially enhancing the project’s robustness.

2. Performing OCR and Post-Processing Refinement #

Once segmented, each note page is submitted directly to the model for OCR. Below is a code example using Qwen-VL-OCR:

import os
from openai import OpenAI

client = OpenAI(
    # If you haven't set environment variables, replace the next line with: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                    "min_pixels": 28 * 28 * 4,
                    "max_pixels": 28 * 28 * 1280
                },
                # For optimal recognition, the model internally uses "Read all the text in the image." regardless of user input.
                {"type": "text", "text": "Read all the text in the image."},
            ]
        }
    ]
)

print(completion.choices[0].message.content)

After OCR, each page’s result is sent to another LLM for refinement and polishing.

Here’s my personal editing prompt:

messages = [
  {"role": "system", "content": "You are a professional note organization assistant. Your task is to organize and refine classroom notes extracted via OCR, making them clearer and better structured while preserving original emphasis markers. Only output the refined note content, with no additional explanations."},
  {"role": "user", "content": f"""Please help organize the following classroom notes with these requirements:
1. Maintain original structure and formatting
2. Preserve all emphasis markers
3. Correct obvious OCR errors (e.g., incorrect names, terms, nouns)
4. Improve paragraph breaks and indentation
5. Ensure mathematical formulas and symbols are accurate
"""}
]

I typically use Deepseek v3 or gpt-4o-1120 for this step.

Finally, the program automatically merges the refined OCR results from all pages into a single Markdown file.

NOTE

The next blog post may analyze the technical report of Qwen-VL and explore using Qwen-VL or Gemini for note detection and page segmentation.