- Sequential (for loop)
- The simplest and easiest to debug. Suitable for scenarios with few files or where strict sequential processing is required.
- multiprocessing / ProcessPoolExecutor (multi-processing)
- Suitable for CPU-intensive tasks (can utilize multiple cores).
- Costs: process startup/communication overhead, need to serialize (pickle) function arguments/results, some objects cannot be serialized (e.g., open sockets).
- When used directly in notebooks, you may need
if __name__ == "__main__"protection or put the logic in a script to run.
- asyncio + aiohttp / asynchronous I/O
- Suitable for a large number of concurrent, lightweight I/O operations (e.g., sending requests to many servers at once).
- Advantages: saves more memory under high-concurrency HTTP request scenarios, more efficient than threads.
- Disadvantages: requires rewriting code in async/await style, third-party libraries need async versions (requests → aiohttp).
- Distributed / task queue (Celery, RQ, Kafka, etc.)
- When the task volume is very large or requires persistence/retries/observability, use queues to distribute tasks to a worker cluster.
- Production-level solutions with the cost of more infrastructure.
- Subprocess / external programs
- Put PDF parsing or OCR in a separate process or container, Python just does scheduling.
- Thread pool + rate limiting + retries (common combination)
- When making batch requests to external APIs, rate limiting (to avoid being blocked), retries, and timeout control are usually needed.