While skimming through the new Python 3.14 release notes 1, I've
noticed the introduction of the concurrent.interpreters
module, which allows
running multiple interpreters in the same process. This apparently was already
supported in the underlying C API, but wasn't available as a module in the
standard library.
This is interesting since, from Python 3.12, interpreters are isolated from one another and, crucially, each has its own Global Interpreter Lock (GIL).
This means we can start a thread and run a new interpreter there, releasing the GIL and effectively running concurrent and parallel code in the same process.
With the caveat that data has to be copied from and to the interpreters, since they are isolated thus there's no shared memory2.
This way we can achieve the isolation and parallelism of processes, but with the efficiency of threads (lower memory overhead and faster startup times).
A quick benchmark
A new
InterpreterPoolExecutor
has also been released, that extends ThreadPoolExecutor
to run the work in a
separate interpreter for each thread.
We can then easily compare this to implementations that use processes
(ProcessPoolExecutor
) and threads (ThreadPoolExecutor
).
We want to run this CPU-heavy code:
def is_prime(n):
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
This compares:
- a sequential for loop
- a
ThreadPoolExecutor
- a
ProcessPoolExecutor
- a
InterpreterPoolExecutor
3
start = time.perf_counter()
for number in PRIMES:
_ = is_prime(number)
end = time.perf_counter()
print(f"For loop took: {end - start}")
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor() as executor:
for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
pass
end = time.perf_counter()
print(f"ThreadPoolExecutor took: {end - start}")
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor() as executor:
for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
pass
end = time.perf_counter()
print(f"ProcessPoolExecutor took: {end - start}")
start = time.perf_counter()
with concurrent.futures.InterpreterPoolExecutor() as executor:
for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
pass
end = time.perf_counter()
print(f"InterpreterPoolExecutor took: {end - start}")
and prints the running time of all four on this input:
PRIMES = [
112272535095293,
112582705942171,
112272535095293,
115280095190773,
115797848077099,
1099726899285419,
]
The results are:
For loop took: 1.5507157149986597
ThreadPoolExecutor took: 3.0324967070009734
ProcessPoolExecutor took: 0.6867031029978534
InterpreterPoolExecutor took: 0.5512552690015582
This looks promising but just playing around with the number of processes and
threads (max_workers
) narrows the difference between the last two
implementations.
For loop took: 1.5788738779992855
ThreadPoolExecutor took: 2.927569447001588
ProcessPoolExecutor took: 0.5861226909983088
InterpreterPoolExecutor took: 0.5852165760006756
This might be due to the fact that starting the interpreters has not been optimized yet1. In fact, reusing the same pool results in better performance.
For loop took: 1.5554543390026083
ThreadPoolExecutor took: 2.9296667560010974
ThreadPoolExecutor took: 2.820398619001935
ProcessPoolExecutor took: 0.695847008999408
ProcessPoolExecutor took: 0.6095230329992773
InterpreterPoolExecutor took: 0.5355863310032873
InterpreterPoolExecutor took: 0.4764342630005558
Copying memory
The interepreters approach will massively outperform the ProcessPoolExecutor
when copying large amounts of memory. This runs the CPU-heavy code from before
but also returns 100MB of memory.
def give_me_memory(n):
_ = is_prime(n)
return b"\xaa" * (100 * 1024 * 1024)
This is faster since memory is copied inside the same process instead of between processes.
For loop took: 1.336727849999079
ThreadPoolExecutor took: 2.5152056050010287
ProcessPoolExecutor took: 1.036969170001612
InterpreterPoolExecutor took: 0.7322701810007857
GIL
The interpreter-per-thread approach is mostly beneficial for code that is normally limited by the GIL. Code that releases the GIL, like I/O tasks and C libraries, will perform as fast as the thread pool implementation (plus the overhead of starting a new interpreter).
For example, this code is CPU-heavy but releases the GIL:
def no_gil_task(_):
_ = hashlib.pbkdf2_hmac(
hash_name="sha256",
password=b"a_very_hard_password",
salt=os.urandom(16),
iterations=2_000_000,
)
and we get:
For loop took: 5.7273291679994145
ThreadPoolExecutor took: 1.6703275659965584
ProcessPoolExecutor took: 1.6922476879990427
InterpreterPoolExecutor took: 1.658340528003464
Third-party library support
Code in the standard library already supports being run in a sub-interpreter. Unfortunately, third-party libraries will have to adapt to be compatible.
For example, currently numpy
doesn't support them4.
References
- https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-multiple-interpreters
- https://docs.python.org/3.14/library/concurrent.interpreters.html#module-concurrent.interpreters
- https://docs.python.org/3.14/library/concurrent.futures.html#interpreterpoolexecutor
- https://peps.python.org/pep-0734/
Footnotes
-
Actually there are a couple of types we can share: https://docs.python.org/3.14/library/concurrent.interpreters.html#sharing-objects ↩
-
The
is_prime
function needs to be imported from another file to makeInterpreterPoolExecutor
work. This is because each interpreter has its own__main__
module 5. ↩ -
https://docs.python.org/3.14/library/concurrent.interpreters.html#running-in-an-interpreter ↩