feat(subinterpreter): reusable PyThreadState via subinterpreter_thread_state (#6073)

* feat(subinterpreter): add opt-in TLS-cached thread state mode

subinterpreter_scoped_activate previously created and destroyed a fresh
PyThreadState on every activation when the calling OS thread was not
already running the target interpreter. Workloads that repeatedly
re-enter the same sub-interpreter from the same thread therefore churn
thread states and lose per-thread interpreter state between activations
(see pybind/pybind11#6040).

Add an opt-in subinterpreter_thread_state::cached policy: on first use a
PyThreadState is created and stored in OS-thread-local storage keyed by
the target interpreter; subsequent activations on that thread only swap
it in/out and never destroy it. The default stays transient, so existing
behavior is unchanged.

Since pybind11 does not control thread lifetime, cleanup is explicit:
subinterpreter::release_cached_thread_state() releases the calling
thread's cached state for one interpreter, and the static
release_all_cached_thread_states() releases all of the calling thread's
cached states as an end-of-thread hook. The TLS map's destructor only
frees its own nodes and never touches the Python C API, so an
unreleased state leaks rather than crashing at thread exit.

Includes test coverage and embedding docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style: pre-commit fixes

* refactor(subinterpreter): replace cached enum/TLS with subinterpreter_thread_state RAII

Address review feedback on the original "cached" mode by switching to an
explicit two-RAII design suggested by @b-pass:

  "Create a class ... to RAII-manage the PyThreadState but start its
   lifetime in an already released state. You could create another
   class (or modify scoped_activate) to scoped/RAII activate the
   inactive threadstate."

Removed
  - enum subinterpreter_thread_state { transient, cached } and the
    defaulted ctor parameter on subinterpreter_scoped_activate.
  - detail::subinterpreter_thread_state_cache thread_local map.
  - subinterpreter::release_cached_thread_state() and
    subinterpreter::release_all_cached_thread_states().

This eliminates: the hidden per-thread map, the "release_all" footgun
across pybind11 modules (the cache was module-local), and the implicit
"must not be active when called" contract on the release functions.

Added
  - Public class subinterpreter_thread_state that owns one PyThreadState
    for a given subinterpreter on its constructing OS thread, created in
    a released state (not current, no GIL). Non-copyable, non-movable
    (PyThreadState is bound to its creating OS thread).
  - subinterpreter_scoped_activate(subinterpreter_thread_state &)
    overload: swaps the owned PyThreadState in on entry, swaps it out
    on exit, does not touch its lifetime.

Behavior
  - The existing subinterpreter_scoped_activate(subinterpreter const &)
    overload is unchanged (still transient: New on entry, Delete on
    exit). All previously-working code keeps working.
  - With subinterpreter_thread_state, one OS thread can alternate
    between multiple subinterpreters and each PyThreadState is preserved
    across activations -- the use case that gil_scoped_release/acquire
    + a long-lived scoped_activate cannot solve alone (the per-thread
    internals.tstate slot holds only one inactive tstate).
  - The dtor of subinterpreter_thread_state guards against the
    "destroyed-while-active" contract violation: if Swap reveals the
    cached tstate was current, do not Swap back to a now-deleted
    pointer (the safe-when-active fix b-pass requested for the old
    release_* functions, applied at the natural location instead).

Lifetime contract is enforced by ordinary C++ scope: typical placement
is `thread_local`. No new release/cleanup APIs are required.

Tests cover (a) tstate identity preserved across activations on a
thread, (b) transient and reusing modes do not share state, (c)
different OS threads get distinct PyThreadStates, and (d) the
multi-subinterpreter alternation case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(subinterpreter): address review on #6073 (same-thread checks, test scoping)

Per @b-pass's review:

- ~subinterpreter_thread_state(): add a PYBIND11_DETAILED_ERROR_MESSAGES-
  guarded check that destruction happens on the OS thread that created the
  PyThreadState (same PyThread_get_thread_native_id pattern as ~subinterpreter),
  failing with pybind11_fail otherwise.
- subinterpreter_scoped_activate(subinterpreter_thread_state &): add the
  matching DETAILED_ERROR_MESSAGES check that activation happens on the
  creating OS thread, enforcing the newly documented rule.
- docs: document that activating a subinterpreter_thread_state on another OS
  thread is illegal.
- tests: keep each subinterpreter (and its subinterpreter_thread_state) in an
  enclosing scope so destruction order is thread-state -> subinterpreter ->
  unsafe_reset_internals_for_single_interpreter(). The previous top-level
  declarations ran the reset while the subinterpreters were still alive, which
  is the likely cause of the CI crashes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: fix codespell (re-used -> reused) in embedding.rst

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
ymwang78
2026-05-25 21:31:14 +08:00
committed by GitHub
parent f891299e6a
commit 46ebf5031b
3 changed files with 351 additions and 2 deletions

View File

@@ -345,6 +345,75 @@ Example:
}
Reusing a thread state across activations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, :class:`subinterpreter_scoped_activate` creates a fresh
``PyThreadState`` on entry and destroys it on exit. A thread that repeatedly
re-enters the same sub-interpreter therefore allocates and frees a thread state
every time, and does **not** preserve any per-thread interpreter state between
activations.
For workloads where a single OS thread re-enters one or more sub-interpreters
many times, pybind11 provides :class:`subinterpreter_thread_state` — an RAII
object that owns a ``PyThreadState`` and lets you swap it in for the duration
of each :class:`subinterpreter_scoped_activate` scope without destroying it
between activations:
.. code-block:: cpp
// Create the PyThreadState once. It is created in a "released" state:
// not current, no GIL acquired.
thread_local py::subinterpreter_thread_state ts(sub);
{
// Swap it in; the subinterpreter's GIL is acquired.
py::subinterpreter_scoped_activate guard(ts);
// ... use the sub-interpreter ...
}
// Swap-out only; the PyThreadState is kept alive in `ts`.
{
py::subinterpreter_scoped_activate guard(ts);
// The same PyThreadState is reused; its per-thread interpreter state
// is preserved across activations.
}
This composes naturally with multiple sub-interpreters on the same OS thread:
hold one :class:`subinterpreter_thread_state` per sub-interpreter and alternate
between them. Each ``PyThreadState`` is independent and is preserved across
activations.
.. code-block:: cpp
thread_local py::subinterpreter_thread_state ts_a(sub_a);
thread_local py::subinterpreter_thread_state ts_b(sub_b);
{ py::subinterpreter_scoped_activate guard(ts_a); /* in sub_a */ }
{ py::subinterpreter_scoped_activate guard(ts_b); /* in sub_b */ }
{ py::subinterpreter_scoped_activate guard(ts_a); /* same PyThreadState as before */ }
The default behavior is unchanged: the
:class:`subinterpreter_scoped_activate(subinterpreter const&)` overload still
creates and destroys a transient ``PyThreadState`` per scope, and it never
shares a thread state with any :class:`subinterpreter_thread_state` that may
also exist for the same sub-interpreter on the same thread.
.. warning::
Lifetime and threading requirements for :class:`subinterpreter_thread_state`:
- It must be constructed and destroyed on the **same OS thread**. A
``PyThreadState`` is bound to its creating thread; deleting it on another
thread is undefined behavior. Holding the object as a ``thread_local``
satisfies this automatically.
- It must only be activated (with :class:`subinterpreter_scoped_activate`)
on the **same OS thread** that constructed it. Activating it on a
different thread is illegal.
- It must be destroyed while its sub-interpreter is still alive.
- It must **not** be destroyed while a :class:`subinterpreter_scoped_activate`
referring to it is alive — the activator holds a reference into it.
GIL API for sub-interpreters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -22,13 +22,27 @@
PYBIND11_NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
class subinterpreter;
class subinterpreter_thread_state;
/// Activate the subinterpreter and acquire its GIL, while also releasing any GIL and interpreter
/// currently held. Upon exiting the scope, the previous subinterpreter (if any) and its
/// associated GIL are restored to their state as they were before the scope was entered.
///
/// Two construction modes are supported:
///
/// 1. `subinterpreter_scoped_activate(subinterpreter const &)`:
/// Transient mode (the default). A fresh PyThreadState is created on entry and destroyed on
/// exit. This is the established behavior; existing code is unaffected.
///
/// 2. `subinterpreter_scoped_activate(subinterpreter_thread_state &)`:
/// Reuse mode. The PyThreadState owned by the given subinterpreter_thread_state is swapped
/// in on entry and swapped out (but NOT destroyed) on exit, so repeated activations on the
/// same OS thread reuse the same PyThreadState and preserve its per-thread interpreter state.
/// Use this when a single OS thread re-enters one or more subinterpreters many times.
class subinterpreter_scoped_activate {
public:
explicit subinterpreter_scoped_activate(subinterpreter const &si);
explicit subinterpreter_scoped_activate(subinterpreter_thread_state &ts);
~subinterpreter_scoped_activate();
subinterpreter_scoped_activate(subinterpreter_scoped_activate &&) = delete;
@@ -41,6 +55,9 @@ private:
PyThreadState *tstate_ = nullptr;
PyGILState_STATE gil_state_;
bool simple_gil_ = false;
// When true, tstate_ is owned by a subinterpreter_thread_state and must NOT be destroyed
// when this scope exits (only swapped out).
bool borrowed_ = false;
};
/// Holds a Python subinterpreter instance
@@ -228,10 +245,71 @@ public:
private:
friend class subinterpreter_scoped_activate;
friend class subinterpreter_thread_state;
PyInterpreterState *istate_ = nullptr;
PyThreadState *creation_tstate_ = nullptr;
};
/// RAII wrapper that owns a PyThreadState bound to a specific subinterpreter on the OS thread
/// that constructed it. Intended to be held long-lived (e.g. as a `thread_local`, or inside a
/// per-thread struct) so that many subinterpreter_scoped_activate scopes on the same OS thread
/// can reuse a single PyThreadState instead of creating and destroying one each time.
///
/// The PyThreadState is created on construction in a *released* state: it is NOT made current,
/// and no GIL is acquired. Activation is the job of subinterpreter_scoped_activate.
///
/// A single OS thread can hold one of these per subinterpreter and alternate between them via
/// subinterpreter_scoped_activate without churning PyThreadState objects.
///
/// Lifetime / threading requirements:
///
/// - Construction and destruction must happen on the SAME OS thread (a PyThreadState is bound
/// to the OS thread that created it; deleting it on a different thread is undefined behavior).
/// - The owning subinterpreter must still be alive when this object is destroyed.
/// - This object must NOT be destroyed while a subinterpreter_scoped_activate referring to it is
/// still alive (the activator holds a reference into it).
///
/// Typical usage:
///
/// @code
/// thread_local py::subinterpreter_thread_state ts(sub);
/// {
/// py::subinterpreter_scoped_activate guard(ts); // swap-in only
/// // ... use the subinterpreter ...
/// } // swap-out, tstate kept alive
/// {
/// py::subinterpreter_scoped_activate guard(ts); // reuses the same PyThreadState
/// // ...
/// }
/// @endcode
class subinterpreter_thread_state {
public:
/// Create a PyThreadState for `si` on the calling OS thread. The new state is left in a
/// released state (not current, no GIL acquired).
explicit subinterpreter_thread_state(subinterpreter const &si);
/// Destroy the owned PyThreadState. Must run on the same OS thread that constructed this
/// object, while the owning subinterpreter is still alive, and while no
/// subinterpreter_scoped_activate referring to this object is alive.
~subinterpreter_thread_state();
subinterpreter_thread_state(subinterpreter_thread_state const &) = delete;
subinterpreter_thread_state(subinterpreter_thread_state &&) = delete;
subinterpreter_thread_state &operator=(subinterpreter_thread_state const &) = delete;
subinterpreter_thread_state &operator=(subinterpreter_thread_state &&) = delete;
/// The interpreter this thread state belongs to.
PyInterpreterState *interpreter_state() const { return istate_; }
/// The owned PyThreadState pointer; valid for the lifetime of this object.
PyThreadState *raw_thread_state() const { return tstate_; }
private:
friend class subinterpreter_scoped_activate;
PyThreadState *tstate_ = nullptr;
PyInterpreterState *istate_ = nullptr;
};
class scoped_subinterpreter {
public:
scoped_subinterpreter() : si_(subinterpreter::create()), scope_(si_) {}
@@ -244,6 +322,8 @@ private:
subinterpreter_scoped_activate scope_;
};
// --- subinterpreter_scoped_activate -----------------------------------------------------------
inline subinterpreter_scoped_activate::subinterpreter_scoped_activate(subinterpreter const &si) {
if (!si.istate_) {
pybind11_fail("null subinterpreter");
@@ -267,6 +347,47 @@ inline subinterpreter_scoped_activate::subinterpreter_scoped_activate(subinterpr
detail::get_internals().tstate = tstate_;
}
inline subinterpreter_scoped_activate::subinterpreter_scoped_activate(
subinterpreter_thread_state &ts) {
if (ts.tstate_ == nullptr) {
pybind11_fail("subinterpreter_scoped_activate: empty subinterpreter_thread_state");
}
if (detail::get_interpreter_state_unchecked() == ts.istate_) {
// We are already on this interpreter -- e.g. nested activation, or a different
// PyThreadState for the same interpreter is already current on this thread. Match the
// fast path of the (subinterpreter const&) overload: just ensure the GIL is held. The
// `ts` argument's PyThreadState is intentionally NOT swapped to here; the already-current
// tstate keeps being used until the outer scope exits.
simple_gil_ = true;
gil_state_ = PyGILState_Ensure();
return;
}
#if defined(PYBIND11_DETAILED_ERROR_MESSAGES)
{
// A PyThreadState is bound to its creating OS thread; it may only be activated there.
bool same_thread = true;
# ifdef PY_HAVE_THREAD_NATIVE_ID
same_thread = PyThread_get_thread_native_id() == ts.tstate_->native_thread_id;
# endif
if (!same_thread) {
pybind11_fail("subinterpreter_scoped_activate: a subinterpreter_thread_state must be "
"activated on the same OS thread that constructed it!");
}
}
#endif
tstate_ = ts.tstate_;
borrowed_ = true;
// make the interpreter active and acquire the GIL
old_tstate_ = PyThreadState_Swap(tstate_);
// save this in internals for scoped_gil calls (see also: PR #5870)
detail::get_internals().tstate = tstate_;
}
inline subinterpreter_scoped_activate::~subinterpreter_scoped_activate() {
if (simple_gil_) {
// We were on this interpreter already, so just make sure the GIL goes back as it was
@@ -279,8 +400,12 @@ inline subinterpreter_scoped_activate::~subinterpreter_scoped_activate() {
}
#endif
detail::get_internals().tstate.reset();
PyThreadState_Clear(tstate_);
PyThreadState_DeleteCurrent();
if (!borrowed_) {
PyThreadState_Clear(tstate_);
PyThreadState_DeleteCurrent();
}
// When borrowed_, tstate_ stays alive in its owning subinterpreter_thread_state for
// reuse; the PyThreadState_Swap below merely detaches it from this thread.
}
// Go back the previous interpreter (if any) and acquire THAT gil
@@ -288,4 +413,50 @@ inline subinterpreter_scoped_activate::~subinterpreter_scoped_activate() {
}
}
// --- subinterpreter_thread_state --------------------------------------------------------------
inline subinterpreter_thread_state::subinterpreter_thread_state(subinterpreter const &si) {
if (!si.istate_) {
pybind11_fail("subinterpreter_thread_state: null subinterpreter");
}
istate_ = si.istate_;
// PyThreadState_New does not require holding any GIL and does not make the new state current.
tstate_ = PyThreadState_New(istate_);
if (tstate_ == nullptr) {
pybind11_fail("subinterpreter_thread_state: PyThreadState_New returned null");
}
}
inline subinterpreter_thread_state::~subinterpreter_thread_state() {
if (tstate_ == nullptr) {
return;
}
#if defined(PYBIND11_DETAILED_ERROR_MESSAGES)
{
// A PyThreadState must be cleared and deleted on the OS thread that created it.
bool same_thread = true;
# ifdef PY_HAVE_THREAD_NATIVE_ID
same_thread = PyThread_get_thread_native_id() == tstate_->native_thread_id;
# endif
if (!same_thread) {
pybind11_fail("~subinterpreter_thread_state: must be destroyed on the same OS thread "
"that constructed it!");
}
}
#endif
// The PyThreadState must be made current to be cleared and deleted on the owning OS thread.
// Swap it in (which acquires the subinterpreter's GIL), clear+delete, then restore whatever
// was active before.
PyThreadState *prev = PyThreadState_Swap(tstate_);
PyThreadState_Clear(tstate_);
PyThreadState_DeleteCurrent();
// If `prev` is tstate_ itself, the user destroyed this object while it was active via a
// subinterpreter_scoped_activate -- a contract violation, but be defensive: do NOT swap back
// to a now-deleted pointer. Leaving the thread with no current interpreter is consistent
// with the cached state having just been destroyed.
if (prev != nullptr && prev != tstate_) {
PyThreadState_Swap(prev);
}
}
PYBIND11_NAMESPACE_END(PYBIND11_NAMESPACE)

View File

@@ -153,6 +153,115 @@ TEST_CASE("Move Subinterpreter") {
}
# endif
TEST_CASE("Reused Subinterpreter thread state (single interpreter)") {
PyThreadState *first = nullptr;
PyThreadState *second = nullptr;
PyThreadState *transient_ts = nullptr;
PyThreadState *worker_ts = nullptr;
// The subinterpreter is kept in this enclosing scope so that every
// subinterpreter_thread_state is destroyed first, then the subinterpreter, and only then
// unsafe_reset_internals_for_single_interpreter() runs (after the scope closes).
{
py::subinterpreter sub = py::subinterpreter::create();
{
py::subinterpreter_thread_state ts(sub);
{
py::subinterpreter_scoped_activate guard(ts);
first = PyThreadState_Get();
py::list(py::module_::import("sys").attr("path")).append(py::str("."));
}
{
py::subinterpreter_scoped_activate guard(ts);
second = PyThreadState_Get();
}
// Same OS thread + same subinterpreter_thread_state => the PyThreadState is reused.
REQUIRE(first != nullptr);
REQUIRE(first == second);
// The (subinterpreter const&) ctor does not share with the reusable tstate: while
// `ts` is still alive, a transient activation gets a distinct PyThreadState.
{
py::subinterpreter_scoped_activate guard(sub);
transient_ts = PyThreadState_Get();
}
REQUIRE(transient_ts != first);
// A different OS thread holds its own subinterpreter_thread_state (both alive
// concurrently => distinct PyThreadState pointers).
{
py::gil_scoped_release nogil;
std::thread([&]() {
py::subinterpreter_thread_state worker_ts_owner(sub);
py::subinterpreter_scoped_activate guard(worker_ts_owner);
worker_ts = PyThreadState_Get();
// worker_ts_owner is destroyed at scope exit, on the same OS thread that
// constructed it.
}).join();
}
REQUIRE(worker_ts != nullptr);
REQUIRE(worker_ts != first);
// ts is destructed at the end of this block on this same OS thread (deleting its
// PyThreadState), while `sub` is still alive.
}
// sub is destructed at the end of this block.
}
unsafe_reset_internals_for_single_interpreter();
}
TEST_CASE("Reused Subinterpreter thread state (multiple interpreters)") {
// The core multi-subinterpreter use case: one OS thread alternates between two
// subinterpreters and each PyThreadState is preserved across activations.
PyThreadState *a1 = nullptr;
PyThreadState *a2 = nullptr;
PyThreadState *b1 = nullptr;
PyThreadState *b2 = nullptr;
// Everything is kept in this enclosing scope. Destruction order at the closing brace is
// ts_b, ts_a, sub_b, sub_a -- i.e. each subinterpreter_thread_state is destroyed before its
// subinterpreter -- and unsafe_reset_internals_for_single_interpreter() only runs afterwards.
{
py::subinterpreter sub_a = py::subinterpreter::create();
py::subinterpreter sub_b = py::subinterpreter::create();
py::subinterpreter_thread_state ts_a(sub_a);
py::subinterpreter_thread_state ts_b(sub_b);
{
py::subinterpreter_scoped_activate guard(ts_a);
a1 = PyThreadState_Get();
}
{
py::subinterpreter_scoped_activate guard(ts_b);
b1 = PyThreadState_Get();
}
{
py::subinterpreter_scoped_activate guard(ts_a);
a2 = PyThreadState_Get();
}
{
py::subinterpreter_scoped_activate guard(ts_b);
b2 = PyThreadState_Get();
}
REQUIRE(a1 != nullptr);
REQUIRE(b1 != nullptr);
// Identity is preserved across activations for each interpreter independently.
REQUIRE(a1 == a2);
REQUIRE(b1 == b2);
// And the two interpreters have distinct thread states (both alive => reliable
// comparison).
REQUIRE(a1 != b1);
}
unsafe_reset_internals_for_single_interpreter();
}
TEST_CASE("GIL Subinterpreter") {
PyInterpreterState *main_interp = PyInterpreterState_Get();