Differences in Python Concurrency Handling (Thread Pool, async Asynchronous, Distributed, etc.)

  1. Sequential (for loop)
  • The simplest and easiest to debug. Suitable for scenarios with few files or where strict sequential processing is required.
  1. multiprocessing / ProcessPoolExecutor (multi-processing)
  • Suitable for CPU-intensive tasks (can utilize multiple cores).
  • Costs: process startup/communication overhead, need to serialize (pickle) function arguments/results, some objects cannot be serialized (e.g., open sockets).
  • When used directly in notebooks, you may need if __name__ == "__main__" protection or put the logic in a script to run.
  1. asyncio + aiohttp / asynchronous I/O
  • Suitable for a large number of concurrent, lightweight I/O operations (e.g., sending requests to many servers at once).
  • Advantages: saves more memory under high-concurrency HTTP request scenarios, more efficient than threads.
  • Disadvantages: requires rewriting code in async/await style, third-party libraries need async versions (requests → aiohttp).
  1. Distributed / task queue (Celery, RQ, Kafka, etc.)
  • When the task volume is very large or requires persistence/retries/observability, use queues to distribute tasks to a worker cluster.
  • Production-level solutions with the cost of more infrastructure.
  1. Subprocess / external programs
  • Put PDF parsing or OCR in a separate process or container, Python just does scheduling.
  1. Thread pool + rate limiting + retries (common combination)
  • When making batch requests to external APIs, rate limiting (to avoid being blocked), retries, and timeout control are usually needed.