multi-ocr-sdk: A pip package supporting multiple OCR engines

Project Introduction

MULTI-OCR-SDK is a simple and efficient Python SDK for invoking various OCR APIs (currently supporting DeepSeek-OCR, Vision-Language Models (VLMs), and PaddleOCR). It converts documents (PDFs, images) into Markdown text with high accuracy and performance.

Usage

Installation

# Install via pip  
pip install multi-ocr-sdk  
# Or install via uv  
uv add multi-ocr-sdk  

Basic Usage of VLM

import os  
from multi_ocr_sdk import VLMClient  

API_KEY = "your_api_key_here"  
BASE_URL = "http://your_url/v1/chat/completions"  
file_path = "./examples/example_files/DeepSeek_OCR_paper_mini.pdf"  

client = VLMClient(api_key=API_KEY, base_url=BASE_URL)  

result = client.parse(  
    file_path=file_path,  
    prompt="You are an OCR bot. Recognize the content of the input file and output it in Markdown format, preserving formatting information such as charts and tables as much as possible. Do not comment on or summarize the file content—output only the recognized content.",  
    model="Qwen3-VL-8B",  
    # timeout=100, # Optional parameter; default is 60 seconds. For large files requiring extended VLM processing time, increase this value accordingly.  
    # dpi=60  # Optional parameter; default is 72. Lower DPI yields blurrier images, reduces input token consumption, but degrades recognition quality—adjust to suit your needs.  
    # pages=[1,2] # Optional parameter. Not required for single images or single-page PDFs. For multi-page PDFs, all pages are processed by default; specify page numbers here to process only selected pages.  
)  
print(result)  

Basic Usage of DeepSeek-OCR

from multi_ocr_sdk import DeepSeekOCR  

client = DeepSeekOCR(  
    api_key="your_api_key",  
    base_url="https://api.siliconflow.cn/v1/chat/completions"  # Or your provider's endpoint  
)  

# Simple document  
text = client.parse("invoice.pdf", mode="free_ocr")  

# Complex tables  
text = client.parse("statement.pdf", mode="grounding")  

# Custom DPI  
text = client.parse("document.pdf", dpi=300)  

Background Story

Recently, DeepSeek released its OCR model. After trying it out, we found it works exceptionally well.

Later, we discovered that someone on GitHub had developed a DeepSeek-OCR-SDK, which was very convenient to use. Based on this, we proposed several feature requests and collaborated with the original author to enhance and extend the SDK’s functionality.

During usage, we noticed that SiliconFlow’s free ds-ocr service easily hits rate limits, and we were unwilling to pay for upgrades. Switching to other third-party services yielded subpar results (we tested several L-site providers, with disappointing experiences). Thus, we considered whether support for alternative OCR models—such as Qwen-OCR—could be added.

After considerable effort, we refactored the original DeepSeek-OCR-SDK and added support for VLMs. Real-world testing confirmed that Qwen3-VL-8B delivers excellent results.

Next, we plan to support additional widely-used OCR engines. Bug reports and pull requests are warmly welcomed! :heart: