While skimming through the new Python 3.14 release notes 1, I've noticed the introduction of the concurrent.interpreters module, which allows running multiple interpreters in the same process. This apparently was already supported in the underlying C API, but wasn't available as a module in the standard library.

This is interesting since, from Python 3.12, interpreters are isolated from one another and, crucially, each has its own Global Interpreter Lock (GIL).

This means we can start a thread and run a new interpreter there, releasing the GIL and effectively running concurrent and parallel code in the same process.

With the caveat that data has to be copied from and to the interpreters, since they are isolated thus there's no shared memory2.

This way we can achieve the isolation and parallelism of processes, but with the efficiency of threads (lower memory overhead and faster startup times).

A quick benchmark

A new InterpreterPoolExecutor has also been released, that extends ThreadPoolExecutor to run the work in a separate interpreter for each thread. We can then easily compare this to implementations that use processes (ProcessPoolExecutor) and threads (ThreadPoolExecutor).

We want to run this CPU-heavy code:

def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

This compares:

  • a sequential for loop
  • a ThreadPoolExecutor
  • a ProcessPoolExecutor
  • a InterpreterPoolExecutor3
start = time.perf_counter()
for number in PRIMES:
    _ = is_prime(number)
end = time.perf_counter()
print(f"For loop took: {end - start}")

start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor() as executor:
    for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
        pass
end = time.perf_counter()
print(f"ThreadPoolExecutor took: {end - start}")

start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor() as executor:
    for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
        pass
end = time.perf_counter()
print(f"ProcessPoolExecutor took: {end - start}")

start = time.perf_counter()
with concurrent.futures.InterpreterPoolExecutor() as executor:
    for _, _ in zip(PRIMES, executor.map(is_prime, PRIMES)):
        pass
end = time.perf_counter()
print(f"InterpreterPoolExecutor took: {end - start}")

and prints the running time of all four on this input:

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419,
]

The results are:

For loop took: 1.5507157149986597
ThreadPoolExecutor took: 3.0324967070009734
ProcessPoolExecutor took: 0.6867031029978534
InterpreterPoolExecutor took: 0.5512552690015582

This looks promising but just playing around with the number of processes and threads (max_workers) narrows the difference between the last two implementations.

For loop took: 1.5788738779992855
ThreadPoolExecutor took: 2.927569447001588
ProcessPoolExecutor took: 0.5861226909983088
InterpreterPoolExecutor took: 0.5852165760006756

This might be due to the fact that starting the interpreters has not been optimized yet1. In fact, reusing the same pool results in better performance.

For loop took: 1.5554543390026083
ThreadPoolExecutor took: 2.9296667560010974
ThreadPoolExecutor took: 2.820398619001935
ProcessPoolExecutor took: 0.695847008999408
ProcessPoolExecutor took: 0.6095230329992773
InterpreterPoolExecutor took: 0.5355863310032873
InterpreterPoolExecutor took: 0.4764342630005558

Copying memory

The interepreters approach will massively outperform the ProcessPoolExecutor when copying large amounts of memory. This runs the CPU-heavy code from before but also returns 100MB of memory.

def give_me_memory(n):
    _ = is_prime(n)
    return b"\xaa" * (100 * 1024 * 1024)

This is faster since memory is copied inside the same process instead of between processes.

For loop took: 1.336727849999079
ThreadPoolExecutor took: 2.5152056050010287
ProcessPoolExecutor took: 1.036969170001612
InterpreterPoolExecutor took: 0.7322701810007857

GIL

The interpreter-per-thread approach is mostly beneficial for code that is normally limited by the GIL. Code that releases the GIL, like I/O tasks and C libraries, will perform as fast as the thread pool implementation (plus the overhead of starting a new interpreter).

For example, this code is CPU-heavy but releases the GIL:

def no_gil_task(_):
    _ = hashlib.pbkdf2_hmac(
        hash_name="sha256",
        password=b"a_very_hard_password",
        salt=os.urandom(16),
        iterations=2_000_000,
    )

and we get:

For loop took: 5.7273291679994145
ThreadPoolExecutor took: 1.6703275659965584
ProcessPoolExecutor took: 1.6922476879990427
InterpreterPoolExecutor took: 1.658340528003464

Third-party library support

Code in the standard library already supports being run in a sub-interpreter. Unfortunately, third-party libraries will have to adapt to be compatible.

For example, currently numpy doesn't support them4.

References

Footnotes

  1. https://docs.python.org/3/whatsnew/3.14.html 2

  2. Actually there are a couple of types we can share: https://docs.python.org/3.14/library/concurrent.interpreters.html#sharing-objects

  3. The is_prime function needs to be imported from another file to make InterpreterPoolExecutor work. This is because each interpreter has its own __main__ module 5.

  4. https://github.com/numpy/numpy/issues/24755

  5. https://docs.python.org/3.14/library/concurrent.interpreters.html#running-in-an-interpreter