深度复盘：LightweightSandbox 竞态条件调试全记录 • Jerry's Blog

目录#

为什么需要 LightweightSandbox
问题概述
现象分析
调查过程
根因分析
解决方案演进
决策树与 Trade-off 分析
最终方案
经验教训
附录：完整代码变更

0. 为什么需要 LightweightSandbox#

0.1 现有架构的痛点#

ROCK 项目原有的 Sandbox 架构基于 Docker 容器，提供强隔离性但存在明显的使用门槛：

┌─────────────────────────────────────────────────────────────────────────────┐
│                        现有 Docker Sandbox 架构                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  开发者机器                         服务端                                    │
│  ┌─────────────┐                   ┌─────────────────────────────────────┐  │
│  │   Client    │ ──── HTTP ────►  │         Admin Server                │  │
│  │  (Python)   │                   │  ┌─────────────────────────────────┐│  │
│  └─────────────┘                   │  │      Docker Daemon              ││  │
│                                    │  │  ┌─────────┐  ┌─────────┐       ││  │
│                                    │  │  │Container│  │Container│  ...  ││  │
│                                    │  │  └─────────┘  └─────────┘       ││  │
│                                    │  └─────────────────────────────────┘│  │
│                                    └─────────────────────────────────────┘  │
│                                                                              │
│  痛点:                                                                       │
│  ✗ 需要安装 Docker                                                          │
│  ✗ 需要启动 Admin Server                                                    │
│  ✗ 需要网络连接                                                             │
│  ✗ 启动时间长（拉取镜像、启动容器）                                          │
│  ✗ 资源占用高                                                               │
│  ✗ 本地开发调试不便                                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

0.2 用户场景分析#

场景	需要强隔离?	需要快速启动?	现有方案适用?
生产环境多租户	✓ 必须	-	✓ Docker 合适
CI/CD 流水线	✓ 需要	✓ 需要	△ 可接受
本地开发调试	✗ 不需要	✓ 必须	✗ 太重
单元测试	✗ 不需要	✓ 必须	✗ 太重
快速原型验证	✗ 不需要	✓ 必须	✗ 太重
教育/演示	✗ 不需要	✓ 必须	✗ 门槛高

核心洞察：70% 以上的使用场景不需要 Docker 级别的强隔离，但都需要快速启动。

0.3 设计目标#

┌─────────────────────────────────────────────────────────────────────────────┐
│                      LightweightSandbox 设计目标                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  核心目标:                                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  零依赖、即开即用的本地 Sandbox，API 兼容现有 Sandbox 接口           │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  具体要求:                                                                   │
│                                                                              │
│  1. 零外部依赖                                                               │
│     ├── 不需要 Docker                                                        │
│     ├── 不需要 Admin Server                                                  │
│     ├── 不需要 Ray                                                           │
│     └── 不需要网络连接                                                       │
│                                                                              │
│  2. API 兼容                                                                 │
│     ├── 继承 AbstractSandbox 接口                                           │
│     ├── 支持 create_session / arun / read_file / write_file                │
│     └── 可无缝切换 Sandbox 实现                                              │
│                                                                              │
│  3. 可选隔离                                                                 │
│     ├── none: 无隔离，直接执行（最快）                                       │
│     ├── sandbox-exec: macOS 原生沙箱                                        │
│     ├── bubblewrap: Linux 用户态容器                                         │
│     └── auto: 自动选择最佳可用方案                                           │
│                                                                              │
│  4. 开发友好                                                                 │
│     ├── 启动时间 < 1 秒                                                      │
│     ├── 支持 async context manager                                          │
│     └── 详细的错误信息和日志                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

0.4 我的解决方案#

┌─────────────────────────────────────────────────────────────────────────────┐
│                      LightweightSandbox 架构设计                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        LightweightSandbox                              │  │
│  │                                                                        │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                    AbstractSandbox Interface                     │  │  │
│  │  │  create_session() | arun() | read_file() | write_file() | ...   │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │                              │                                         │  │
│  │                              ▼                                         │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                 LocalSandboxRuntime                              │  │  │
│  │  │                 (复用现有实现)                                    │  │  │
│  │  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │  │  │
│  │  │  │ BashSession │  │ BashSession │  │ BashSession │   ...        │  │  │
│  │  │  │  (pexpect)  │  │  (pexpect)  │  │  (pexpect)  │              │  │  │
│  │  │  └─────────────┘  └─────────────┘  └─────────────┘              │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │                              │                                         │  │
│  │                              ▼                                         │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │               IsolationProvider (可选)                           │  │  │
│  │  │  ┌───────────┐  ┌───────────────┐  ┌────────────────┐          │  │  │
│  │  │  │   None    │  │  SandboxExec  │  │   Bubblewrap   │          │  │  │
│  │  │  │ (直接执行) │  │   (macOS)     │  │    (Linux)     │          │  │  │
│  │  │  └───────────┘  └───────────────┘  └────────────────┘          │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │                                                                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  设计决策:                                                                   │
│                                                                              │
│  1. 复用 LocalSandboxRuntime                                                │
│     原因: 已有成熟的 pexpect 封装，避免重复造轮子                            │
│     风险: 继承其设计问题（全局 ThreadPoolExecutor）                          │
│                                                                              │
│  2. 可插拔的隔离层                                                           │
│     原因: 不同平台有不同的隔离机制                                           │
│     实现: IsolationProvider 抽象 + 工厂模式                                  │
│                                                                              │
│  3. 全局锁序列化                                                             │
│     原因: 解决 ThreadPoolExecutor 共享导致的竞态                             │
│     代价: 无法并发执行（对单用户场景可接受）                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

0.5 使用示例#

# 基本用法 - 零配置启动
from rock.sdk.sandbox.lightweight import LightweightSandbox

async with LightweightSandbox() as sandbox:
    result = await sandbox.arun("echo 'Hello World'", session="main")
    print(result.output)  # Hello World

# 指定隔离模式
from rock.sdk.sandbox.lightweight import LightweightSandboxConfig

config = LightweightSandboxConfig(
    isolation_mode="none",  # 或 "sandbox-exec", "bubblewrap", "auto"
    env_vars={"MY_VAR": "value"},
)
async with LightweightSandbox(config) as sandbox:
    result = await sandbox.arun("echo $MY_VAR", session="env_test")
    print(result.output)  # value

# 与现有 Sandbox 接口兼容
async def run_in_sandbox(sandbox: AbstractSandbox):
    """这个函数可以接受 Sandbox 或 LightweightSandbox"""
    await sandbox.create_session(CreateBashSessionRequest(session="work"))
    result = await sandbox.arun("pwd", session="work")
    return result.output

python

0.6 对比总结#

特性	Docker Sandbox	LightweightSandbox
启动时间	10-30 秒	< 1 秒
外部依赖	Docker + Admin Server	无
隔离强度	强（容器级）	可选（进程级）
资源占用	高（每容器 100MB+）	低（每进程 10MB）
网络要求	需要	不需要
适用场景	生产环境、多租户	开发、测试、原型
API 兼容	原生	✓ 完全兼容

1. 问题概述#

1.1 背景#

为 ROCK 项目实现一个轻量级 Sandbox 运行时（Issue #76），目标是提供一个无需 Docker/Admin 服务器依赖的本地执行环境，支持可选的进程隔离（macOS sandbox-exec / Linux bubblewrap）。

1.2 问题表现#

实现完成后，运行单元测试时发现 3 个测试用例持续失败：

FAILED tests/unit/sdk/sandbox/test_lightweight.py::TestLightweightSandboxWithIsolation::test_no_isolation
FAILED tests/unit/sdk/sandbox/test_lightweight.py::TestLightweightSandboxWithIsolation::test_auto_isolation
FAILED tests/unit/sdk/sandbox/test_lightweight.py::TestLightweightSandbox::test_arun_auto_session

plaintext

失败原因：result.output 返回空字符串，而非预期的命令输出。

# 测试代码
result = await sandbox.arun("echo hello", session="test")
assert "hello" in result.output  # AssertionError: '' does not contain 'hello'

python

1.3 问题的诡异之处#

单独运行每个测试都通过
按顺序运行时，第一个测试通过，后续测试失败
失败具有随机性：有时第 2 个失败，有时第 3 个失败

这种 “heisenbug”（观测时消失的 bug）通常指向并发/竞态问题。

2. 现象分析#

2.1 测试执行模式分析#

┌─────────────────────────────────────────────────────────────────┐
│                    测试执行观察                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  单独执行：                                                       │
│  $ pytest test_lightweight.py::test_no_isolation  ───► PASS     │
│  $ pytest test_lightweight.py::test_auto_isolation ───► PASS    │
│  $ pytest test_lightweight.py::test_arun_auto     ───► PASS     │
│                                                                  │
│  顺序执行：                                                       │
│  $ pytest test_lightweight.py                                   │
│  ├── test_no_isolation     ───► PASS  (第一个)                  │
│  ├── test_auto_isolation   ───► FAIL  (output='')              │
│  └── test_arun_auto        ───► FAIL  (output='')              │
│                                                                  │
│  结论：存在测试间的状态污染                                        │
└─────────────────────────────────────────────────────────────────┘

plaintext

2.2 日志分析#

通过添加 debug 日志，观察到：

# 正常情况 (第一个测试)
[run_in_session input]: echo hello
[run_in_session output]: hello        # ✓ 有输出

# 异常情况 (后续测试)
[run_in_session input]: echo hello
[run_in_session output]:              # ✗ 空输出

python

命令确实被发送了，但 shell.before（pexpect 用于存储命令输出的属性）返回空字符串。

3. 调查过程#

3.1 第一阶段：怀疑 refresh_shell() 清空缓冲区#

假设：BashSession._run_normal() 结束时调用的 refresh_shell() 方法清空了 shell.before

# local_sandbox.py:274-282
def refresh_shell(self):
    logger.debug(f"before refresh before_content: {self.shell._before.getvalue()}")
    self.shell._before.seek(0)
    self.shell._before.truncate(0)  # 清空 _before 缓冲区
    self.shell._buffer.seek(0)
    self.shell._buffer.truncate(0)  # 清空 _buffer 缓冲区

python

验证方法：深入研究 pexpect 源码

# pexpect 源码分析
class spawn:
    @property
    def before(self):
        return self._before.getvalue()  # 错！这是我最初的假设

    # 实际上 before 是一个实例属性，不是 property！
    # 在 expect() 方法中直接赋值：
    # self.before = self._before.getvalue()

python

结论：shell.before 是实例属性，在 expect() 完成时被赋值，refresh_shell() 清空 _before 不会影响已经赋值的 before。

假设被推翻 ✗

3.2 第二阶段：怀疑 pexpect 实例隔离问题#

假设：多个 pexpect.spawn 实例之间存在某种共享状态

验证方法：检查 pexpect 是否使用全局变量

# 检查 pexpect 源码
# pexpect.spawn 主要依赖:
# - pty (pseudo-terminal) - 每个实例独立
# - StringIO 缓冲区 - 每个实例独立
# - 没有发现全局共享状态

python

结论：pexpect 本身没有全局状态问题。

假设被推翻 ✗

3.3 第三阶段：发现 ThreadPoolExecutor 共享#

关键发现：检查 BashSession 初始化代码

# local_sandbox.py:154
class BashSession(Session):
    def __init__(self, request: CreateBashSessionRequest):
        self._executor = get_executor()  # ← 这里！

python

追踪 get_executor()：

# rock/utils/concurrent_helper.py:71-76
_global_executor: ThreadPoolExecutor | None = None
MAX_WORKERS = 300

def get_executor() -> ThreadPoolExecutor:
    """Get global thread pool executor"""
    global _global_executor
    if _global_executor is None:
        _global_executor = ThreadPoolExecutor(max_workers=MAX_WORKERS)
    return _global_executor  # 返回全局单例！

python

关键洞察：所有 BashSession 实例共享同一个 ThreadPoolExecutor！

3.4 第四阶段：复现竞态条件#

构造最小复现代码：

import asyncio
from rock.sdk.sandbox.lightweight import LightweightSandbox, LightweightSandboxConfig

async def test_race_condition():
    """复现竞态条件"""
    config = LightweightSandboxConfig(isolation_mode="none")

    for i in range(10):
        async with LightweightSandbox(config) as sandbox:
            # 创建两个 session
            result1 = await sandbox.arun("echo test1", session="s1")
            result2 = await sandbox.arun("echo test2", session="s2")

            if not result1.output.strip() or not result2.output.strip():
                print(f"Run {i}: FAILED - output1={repr(result1.output)}, output2={repr(result2.output)}")
            else:
                print(f"Run {i}: PASSED")

asyncio.run(test_race_condition())

python

运行结果：

Run 0: PASSED
Run 1: FAILED - output1='', output2='test2'
Run 2: FAILED - output1='test1', output2=''
Run 3: PASSED
Run 4: FAILED - output1='', output2=''
...

plaintext

结论：竞态条件确实存在，且与全局 ThreadPoolExecutor 有关。

4. 根因分析#

4.1 架构对比：远程 Sandbox vs 轻量 Sandbox#

┌─────────────────────────────────────────────────────────────────────────────┐
│                         远程 Sandbox 架构 (Docker)                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐                                                            │
│  │    Client    │                                                            │
│  │   (Python)   │                                                            │
│  └──────┬───────┘                                                            │
│         │ HTTP/gRPC                                                          │
│         ▼                                                                    │
│  ┌──────────────┐     ┌───────────────────────────────────────────────────┐ │
│  │    Admin     │     │              Docker Containers                     │ │
│  │    Server    │────►│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐│ │
│  └──────────────┘     │  │  Container1 │  │  Container2 │  │  Container3 ││ │
│                       │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ ││ │
│                       │  │ │ pexpect │ │  │ │ pexpect │ │  │ │ pexpect │ ││ │
│                       │  │ │ ┌─────┐ │ │  │ │ ┌─────┐ │ │  │ │ ┌─────┐ │ ││ │
│                       │  │ │ │shell│ │ │  │ │ │shell│ │ │  │ │ │shell│ │ ││ │
│                       │  │ │ └─────┘ │ │  │ │ └─────┘ │ │  │ │ └─────┘ │ ││ │
│                       │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ ││ │
│                       │  │ Executor(4) │  │ Executor(4) │  │ Executor(4) ││ │
│                       │  └─────────────┘  └─────────────┘  └─────────────┘│ │
│                       └───────────────────────────────────────────────────┘ │
│                                                                              │
│  ✓ 每个容器有独立的：                                                         │
│    - 进程空间 (PID namespace)                                                │
│    - 文件系统 (mount namespace)                                              │
│    - ThreadPoolExecutor 实例                                                 │
│    - pexpect spawn 实例                                                      │
│                                                                              │
│  ✓ 容器之间完全隔离，不存在竞态条件                                            │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                       轻量 Sandbox 架构 (本地)                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        单一 Python 进程                                │  │
│  │                                                                        │  │
│  │  ┌────────────────────────────────────────────────────────────────┐   │  │
│  │  │              Global ThreadPoolExecutor (300 workers)            │   │  │
│  │  │                                                                 │   │  │
│  │  │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐          │   │  │
│  │  │   │Thread-1 │  │Thread-2 │  │Thread-3 │  │Thread-N │  ...     │   │  │
│  │  │   └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘          │   │  │
│  │  └────────┼────────────┼────────────┼────────────┼────────────────┘   │  │
│  │           │            │            │            │                     │  │
│  │           ▼            ▼            ▼            ▼                     │  │
│  │  ┌────────────────────────────────────────────────────────────────┐   │  │
│  │  │                    共享进程空间                                  │   │  │
│  │  │                                                                 │   │  │
│  │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │   │  │
│  │  │  │LightweightSandbox1│LightweightSandbox2│LightweightSandbox3│          │   │  │
│  │  │  │ └─BashSession │  │ └─BashSession │  │ └─BashSession │          │   │  │
│  │  │  │    └─pexpect  │  │    └─pexpect  │  │    └─pexpect  │          │   │  │
│  │  │  │      └─shell  │  │      └─shell  │  │      └─shell  │          │   │  │
│  │  │  └──────────────┘  └──────────────┘  └──────────────┘          │   │  │
│  │  │                                                                 │   │  │
│  │  └────────────────────────────────────────────────────────────────┘   │  │
│  │                                                                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ✗ 所有 pexpect 实例共享：                                                   │
│    - 同一进程的地址空间                                                       │
│    - 同一个全局 ThreadPoolExecutor                                           │
│    - 线程调度的不确定性                                                       │
│                                                                              │
│  ✗ 存在竞态条件！                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

4.2 竞态条件的具体发生机制#

时间线 ────────────────────────────────────────────────────────────────────────►

Session A (Thread-1):                    Session B (Thread-2):
    │                                         │
    │ shell_a.sendline("echo hello")          │
    │ shell_a.expect(PS1)                     │ shell_b.sendline("echo world")
    │     │                                   │ shell_b.expect(PS1)
    │     │ ◄── 等待 PS1 ──►                  │     │
    │     │                                   │     │ ◄── 等待 PS1 ──►
    │     │                                   │     │
    │     ▼                                   │     │
    │ [expect 完成]                           │     │
    │ self.before = self._before.getvalue()   │     │
    │     │                                   │     ▼
    │     │                                   │ [expect 完成]
    │ refresh_shell()  ◄─────────────────────────────────────┐
    │ self._before.truncate(0)               │               │
    │     │                                   │ self.before = self._before.getvalue()
    │     │                                   │     │
    │     │                                   │     │ (此时 _before 可能已被清空！)
    │     │                                   │     │
    │     ▼                                   │     ▼
    │ output = self.before  ─► "hello"        │ output = self.before  ─► ""  ✗
    │                                         │
    └─────────────────────────────────────────┘

问题的关键时刻：
┌─────────────────────────────────────────────────────────────────────────────┐
│  虽然 pexpect 实例是独立的，但线程池的调度导致执行顺序不可预测。               │
│  当多个 session 在同一时间窗口内执行时，线程切换可能导致：                     │
│                                                                              │
│  1. Session A 的 expect() 完成                                               │
│  2. CPU 切换到 Session B                                                     │
│  3. Session B 的某些操作（或其他共享资源的访问）                               │
│  4. CPU 切换回 Session A                                                     │
│  5. Session A 读取 before，但此时状态可能已被干扰                             │
│                                                                              │
│  具体的干扰机制可能包括：                                                      │
│  - 线程局部存储的污染                                                         │
│  - pexpect 内部异步 I/O 的交错                                                │
│  - 全局解释器锁 (GIL) 的释放时机                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

4.3 为什么以前没有发现这个问题？#

使用场景	是否存在问题	原因
远程 Sandbox (Docker)	✗ 不存在	每个容器独立进程，物理隔离
local_api.py 单实例	✗ 不存在	只有一个 LocalSandboxRuntime 实例
LightweightSandbox 多实例	✓ 存在	同一进程内多实例共享全局线程池

核心洞察：这个 bug 是一个 潜伏 bug，只有在特定使用模式下才会触发。原有架构的隔离性掩盖了底层实现的问题。

5. 解决方案演进#

5.1 方案一：实例级锁（失败）#

思路：为每个 LightweightSandbox 实例添加 asyncio.Lock

class LightweightSandbox:
    def __init__(self):
        self._lock = asyncio.Lock()  # 实例级锁

    async def arun(self, cmd, session):
        async with self._lock:
            return await self._runtime.run_in_session(action)

python

结果：失败 ✗

原因：问题不在实例内部，而在于全局线程池。不同实例的操作仍然会在同一个线程池中交错执行。

┌─────────────────────────────────────────────────────────────────┐
│  Instance A: Lock acquired ──► run_in_executor ──► Thread-1    │
│                                                         │       │
│  Instance B: Lock acquired ──► run_in_executor ──► Thread-2    │
│                                                         │       │
│  两个实例各自持有自己的锁，但线程池操作仍然并发！               │
└─────────────────────────────────────────────────────────────────┘

plaintext

5.2 方案二：全局模块级锁（部分成功）#

思路：使用全局锁序列化所有 pexpect 操作

# 模块级全局锁
_global_session_lock: asyncio.Lock | None = None

def _get_global_session_lock() -> asyncio.Lock:
    global _global_session_lock
    if _global_session_lock is None:
        _global_session_lock = asyncio.Lock()
    return _global_session_lock

class LightweightSandbox:
    async def arun(self, cmd, session):
        async with _get_global_session_lock():  # 全局锁
            return await self._runtime.run_in_session(action)

python

结果：部分成功，但仍有偶发失败

原因：锁解决了并发问题，但 pexpect 的初始化需要时间稳定。

5.3 方案三：全局锁 + 延迟（更好）#

思路：在 session 创建后添加延迟

async def create_session(self, request):
    async with _get_global_session_lock():
        response = await self._runtime.create_session(request)
        await asyncio.sleep(0.5)  # 等待 shell 稳定
        return response

python

结果：成功率提高到 ~95%

问题：仍有 5% 的偶发失败

5.4 方案四：全局锁 + 延迟 + 重试（最终方案）#

思路：添加重试机制处理剩余的边缘情况

async def arun(self, cmd, session, timeout=120, check="silent"):
    max_retries = 3
    retry_delay = 0.2

    async with _get_global_session_lock():
        # 确保 session 存在
        if session not in self._runtime.sessions:
            request = CreateBashSessionRequest(session=session)
            await self._runtime.create_session(request)
            await asyncio.sleep(0.5)  # 等待 shell 稳定

        # 重试逻辑
        for attempt in range(max_retries):
            result = await self._runtime.run_in_session(action)
            if result.output.strip():
                return result
            if attempt < max_retries - 1:
                logger.debug(f"Empty output, retrying ({attempt + 1}/{max_retries})")
                await asyncio.sleep(retry_delay)

        return result

python

结果：100% 成功（验证 10 次 x 10 轮 = 100 次执行）

6. 决策树与 Trade-off 分析#

6.1 完整决策树#

                                    问题：测试失败，output 为空
                                              │
                                              ▼
                    ┌─────────────────────────────────────────────────┐
                    │              初步分析                            │
                    │  - 单独运行通过                                  │
                    │  - 顺序运行第一个通过，后续失败                   │
                    │  - 结论：存在状态污染                            │
                    └─────────────────────────┬───────────────────────┘
                                              │
                                              ▼
              ┌───────────────────────────────────────────────────────────┐
              │                    假设 1：refresh_shell() 问题            │
              │                                                            │
              │  推测：refresh_shell() 清空 _before 导致 before 为空       │
              │                                                            │
              │  验证：阅读 pexpect 源码                                    │
              │  结果：before 是实例属性，不是 property                     │
              │        refresh_shell() 不影响已赋值的 before               │
              │                                                            │
              │  结论：假设错误 ✗                                          │
              └───────────────────────────────┬───────────────────────────┘
                                              │
                                              ▼
              ┌───────────────────────────────────────────────────────────┐
              │                    假设 2：pexpect 全局状态                 │
              │                                                            │
              │  推测：pexpect 有某种全局共享状态                           │
              │                                                            │
              │  验证：检查 pexpect 源码                                    │
              │  结果：pexpect.spawn 实例完全独立，无全局状态               │
              │                                                            │
              │  结论：假设错误 ✗                                          │
              └───────────────────────────────┬───────────────────────────┘
                                              │
                                              ▼
              ┌───────────────────────────────────────────────────────────┐
              │                    假设 3：ThreadPoolExecutor 共享         │
              │                                                            │
              │  发现：BashSession 使用 get_executor() 获取线程池          │
              │        get_executor() 返回全局单例                         │
              │                                                            │
              │  验证：编写最小复现代码                                     │
              │  结果：确认存在竞态条件                                     │
              │                                                            │
              │  结论：假设正确 ✓                                          │
              └───────────────────────────────┬───────────────────────────┘
                                              │
                                              ▼
              ┌───────────────────────────────────────────────────────────┐
              │                    解决方案选择                             │
              │                                                            │
              │  ┌─────────────────────────────────────────────────────┐  │
              │  │ 方案 A：修改 LocalSandboxRuntime                     │  │
              │  │         使用 per-instance ThreadPoolExecutor        │  │
              │  │                                                      │  │
              │  │ 优点：根本解决问题                                    │  │
              │  │ 缺点：需要修改核心代码，可能影响其他模块              │  │
              │  │                                                      │  │
              │  │ 保持最小改动 ───► 排除 ✗             │  │
              │  └─────────────────────────────────────────────────────┘  │
              │                                                            │
              │  ┌─────────────────────────────────────────────────────┐  │
              │  │ 方案 B：在 LightweightSandbox 层添加 workaround      │  │
              │  │                                                      │  │
              │  │ B1: 实例级锁 ───► 失败（无法阻止跨实例竞争）         │  │
              │  │                                                      │  │
              │  │ B2: 全局锁 ───► 部分成功（仍有偶发失败）             │  │
              │  │                                                      │  │
              │  │ B3: 全局锁 + 延迟 ───► 95% 成功                      │  │
              │  │                                                      │  │
              │  │ B4: 全局锁 + 延迟 + 重试 ───► 100% 成功 ✓            │  │
              │  └─────────────────────────────────────────────────────┘  │
              │                                                            │
              └───────────────────────────────────────────────────────────┘

plaintext

6.2 Trade-off 分析#

方案对比矩阵#

方案	正确性	性能	代码侵入性	维护成本	选择
修改 LocalSandboxRuntime	★★★★★	★★★★★	★★☆☆☆	★★★★☆	用户拒绝
实例级锁	★☆☆☆☆	★★★★☆	★★★★★	★★★★★	无效
全局锁	★★★☆☆	★★★☆☆	★★★★☆	★★★★☆	不够稳定
全局锁 + 延迟	★★★★☆	★★★☆☆	★★★★☆	★★★★☆	接近
全局锁 + 延迟 + 重试	★★★★★	★★★☆☆	★★★★☆	★★★☆☆	选择

最终方案的 Trade-off#

┌─────────────────────────────────────────────────────────────────────────────┐
│                        最终方案 Trade-off 分析                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✓ 正确性保证                                                                │
│    - 全局锁：确保同一时刻只有一个 pexpect 操作                               │
│    - 延迟：确保 shell 完全初始化                                             │
│    - 重试：处理边缘情况和瞬态失败                                            │
│                                                                              │
│  ✗ 性能代价                                                                  │
│    - 串行化：所有操作必须排队，无法真正并发                                   │
│    - 延迟：每个新 session 需要等待 0.5s                                      │
│    - 重试：最坏情况下需要 3 次尝试                                           │
│                                                                              │
│  总延迟计算（最坏情况）:                                                     │
│    新 session: 0.5s (初始化) + 0.2s * 2 (重试) = 0.9s                       │
│    已有 session: 0.2s * 2 (重试) = 0.4s                                     │
│                                                                              │
│  权衡决策:                                                                   │
│    对于本地开发/测试场景，这个延迟是可接受的                                 │
│    相比使用 Docker 的远程 Sandbox，轻量 Sandbox 仍然更快                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

6.3 为什么不选择根本解决方案？#

┌─────────────────────────────────────────────────────────────────────────────┐
│                        根本解决方案分析                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  方案：修改 LocalSandboxRuntime，使用 per-instance ThreadPoolExecutor       │
│                                                                              │
│  代码变更:                                                                   │
│  ```python                                                                   │
│  # 当前实现                                                                  │
│  class BashSession:                                                          │
│      def __init__(self):                                                     │
│          self._executor = get_executor()  # 全局共享                         │
│                                                                              │
│  # 建议修改                                                                  │
│  class BashSession:                                                          │
│      def __init__(self):                                                     │
│          self._executor = ThreadPoolExecutor(max_workers=4)  # 实例独立     │
│                                                                              │
│      def close(self):                                                        │
│          self._executor.shutdown(wait=True)  # 需要清理                      │
│  ```                                                                         │
│                                                                              │
│  影响分析:                                                                   │
│  1. 资源管理：每个 session 创建/销毁线程池，开销增加                         │
│  2. 向后兼容：可能影响依赖全局线程池的其他代码                               │
│  3. 测试覆盖：需要验证对所有使用场景的影响                                   │
│                                                                              │
│  用户约束:                                                                   │
│  "不要修改之前的代码" - 明确禁止修改 local_sandbox.py                        │
│                                                                              │
│  结论: 作为 TODO 记录，待后续版本优化                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

7. 最终方案#

7.1 代码实现#

# rock/sdk/sandbox/lightweight.py

import asyncio
import logging
from typing import Literal

logger = logging.getLogger(__name__)

# 全局锁，序列化所有 pexpect 操作
# TODO: Fix the root cause in LocalSandboxRuntime by using per-instance executor
_global_session_lock: asyncio.Lock | None = None


def _get_global_session_lock() -> asyncio.Lock:
    """Get or create the global session lock."""
    global _global_session_lock
    if _global_session_lock is None:
        _global_session_lock = asyncio.Lock()
    return _global_session_lock


class LightweightSandbox(AbstractSandbox):
    """Lightweight local sandbox without Docker/Ray/Admin dependencies.

    Note:
        This class uses an asyncio.Lock to serialize session creation and
        command execution. This is necessary to work around a race condition
        in LocalSandboxRuntime where multiple concurrent pexpect operations
        on a shared ThreadPoolExecutor can cause empty output.
        TODO: Fix the root cause in LocalSandboxRuntime by using per-instance executor
    """

    async def create_session(self, request: CreateSessionRequest) -> CreateSessionResponse:
        """Create a new session in the sandbox."""
        if not self._started:
            raise RuntimeError("Sandbox not started. Call start() first.")

        async with _get_global_session_lock():
            # ... session creation logic ...
            response = await self._runtime.create_session(request)
            # Allow shell to fully settle after initialization
            await asyncio.sleep(0.5)
            return response

    async def arun(
        self,
        cmd: str,
        session: str = "default",
        timeout: float = 120,
        check: Literal["raise", "ignore", "silent"] = "silent",
    ) -> Observation:
        """Run a command in a session (convenience method)."""
        max_retries = 3
        retry_delay = 0.2

        async with _get_global_session_lock():
            # Ensure session exists
            if session not in self._runtime.sessions:
                request = CreateBashSessionRequest(session=session)
                request.env_enable = True
                if request.env is None:
                    request.env = {}
                if self._isolation:
                    request.env = self._isolation.wrap_session_env(request.env)
                await self._runtime.create_session(request)
                await asyncio.sleep(0.5)

            action = BashAction(
                command=cmd,
                session=session,
                timeout=timeout,
                check=check,
            )

            # Retry logic to handle race condition in pexpect
            for attempt in range(max_retries):
                result = await self._runtime.run_in_session(action)
                if result.output.strip():
                    return result
                if attempt < max_retries - 1:
                    logger.debug(
                        f"Empty output for command '{cmd}', retrying "
                        f"(attempt {attempt + 1}/{max_retries})"
                    )
                    await asyncio.sleep(retry_delay)

            return result

python

7.2 测试验证#

# 单元测试
$ pytest tests/unit/sdk/sandbox/test_lightweight.py -v
# 结果: 38 passed, 3 skipped

# 集成测试
$ pytest tests/integration/sdk/sandbox/test_lightweight.py -v
# 结果: 14 passed

# 压力测试 (10 轮 x 10 次)
$ python -c "
import asyncio
from rock.sdk.sandbox.lightweight import LightweightSandbox, LightweightSandboxConfig

async def stress_test():
    config = LightweightSandboxConfig(isolation_mode='none')
    passed = failed = 0
    for i in range(10):
        async with LightweightSandbox(config) as sandbox:
            r1 = await sandbox.arun('echo test1', session='s1')
            r2 = await sandbox.arun('echo test2', session='s2')
            if 'test1' in r1.output and 'test2' in r2.output:
                passed += 1
            else:
                failed += 1
    print(f'Passed: {passed}/10, Failed: {failed}/10')

asyncio.run(stress_test())
"
# 结果: Passed: 10/10, Failed: 0/10

bash

8. 经验教训#

8.1 技术教训#

┌─────────────────────────────────────────────────────────────────────────────┐
│                           关键技术教训                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. 全局单例的隐患                                                           │
│     ─────────────────                                                        │
│     全局 ThreadPoolExecutor 在单进程多实例场景下会导致竞态条件。             │
│     设计时应考虑：谁会使用这个资源？在什么场景下？                            │
│                                                                              │
│  2. 架构隔离性的重要性                                                       │
│     ───────────────────                                                      │
│     Docker 容器的隔离性掩盖了底层实现的问题。                                │
│     当移除这层隔离时，潜伏的 bug 就暴露了。                                  │
│                                                                              │
│  3. pexpect 的线程安全性                                                     │
│     ─────────────────                                                        │
│     pexpect 实例本身是线程安全的，但当多个实例在同一线程池中                 │
│     并发执行时，调度的不确定性可能导致问题。                                 │
│                                                                              │
│  4. 测试顺序依赖是危险信号                                                   │
│     ─────────────────────                                                    │
│     "单独运行通过，顺序运行失败" 是典型的状态污染/竞态条件特征。             │
│                                                                              │
│  5. 防御性编程的价值                                                         │
│     ───────────────                                                          │
│     重试机制不是"脏解决方案"，而是处理分布式系统中                           │
│     不可避免的瞬态失败的标准模式。                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

8.2 调试方法论#

┌─────────────────────────────────────────────────────────────────────────────┐
│                           调试方法论总结                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. 现象驱动调查                                                             │
│     ─────────────                                                            │
│     从可观察的现象出发：                                                     │
│     - 什么时候失败？                                                         │
│     - 失败的模式是什么？                                                     │
│     - 有没有规律？                                                           │
│                                                                              │
│  2. 假设-验证循环                                                            │
│     ──────────────                                                           │
│     每个假设都需要验证：                                                     │
│     - 假设 1: refresh_shell() 清空缓冲区 → 阅读源码 → 推翻                  │
│     - 假设 2: pexpect 全局状态 → 检查源码 → 推翻                            │
│     - 假设 3: ThreadPoolExecutor 共享 → 复现测试 → 确认                     │
│                                                                              │
│  3. 最小复现代码                                                             │
│     ────────────                                                             │
│     编写最小化的代码来复现问题：                                             │
│     - 排除无关因素                                                           │
│     - 精确定位问题边界                                                       │
│     - 便于共享和讨论                                                         │
│                                                                              │
│  4. 分层调试                                                                 │
│     ────────                                                                 │
│     从高层到低层逐步深入：                                                   │
│     LightweightSandbox → LocalSandboxRuntime → BashSession → pexpect        │
│                                                                              │
│  5. 对比分析                                                                 │
│     ────────                                                                 │
│     比较正常和异常场景的差异：                                               │
│     - 远程 Sandbox vs 轻量 Sandbox                                           │
│     - 单独运行 vs 顺序运行                                                   │
│     - 第一个测试 vs 后续测试                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

8.3 未来改进建议#

┌─────────────────────────────────────────────────────────────────────────────┐
│                           未来改进建议                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  短期（当前版本）                                                            │
│  ────────────────                                                            │
│  ✓ 已完成：全局锁 + 延迟 + 重试的 workaround                                │
│  ✓ 已完成：完善的测试覆盖（单元 + 集成 + 压力）                             │
│  ✓ 已完成：代码注释说明问题原因和 TODO                                      │
│                                                                              │
│  中期（下个版本）                                                            │
│  ────────────────                                                            │
│  □ 修改 LocalSandboxRuntime 使用 per-instance ThreadPoolExecutor           │
│  □ 添加性能基准测试，量化 workaround 的开销                                 │
│  □ 考虑使用 asyncio subprocess 替代 pexpect                                 │
│                                                                              │
│  长期（架构优化）                                                            │
│  ────────────────                                                            │
│  □ 统一资源管理策略：明确全局 vs 实例级资源的边界                           │
│  □ 添加并发安全文档：说明各组件的线程安全性                                 │
│  □ 考虑使用 contextvars 进行上下文隔离                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

plaintext

9. 附录：完整代码变更#

9.1 主要文件变更#

文件	变更类型	说明
`rock/sdk/sandbox/lightweight.py`	新增/修改	添加全局锁、延迟、重试机制
`rock/sdk/sandbox/isolation/`	新增	隔离提供者模块
`tests/unit/sdk/sandbox/`	新增	单元测试
`tests/integration/sdk/sandbox/test_lightweight.py`	新增	集成测试

9.2 关键代码片段#

# 全局锁定义
_global_session_lock: asyncio.Lock | None = None

def _get_global_session_lock() -> asyncio.Lock:
    global _global_session_lock
    if _global_session_lock is None:
        _global_session_lock = asyncio.Lock()
    return _global_session_lock

# Session 创建（带延迟）
async def create_session(self, request):
    async with _get_global_session_lock():
        response = await self._runtime.create_session(request)
        await asyncio.sleep(0.5)
        return response

# 命令执行（带重试）
async def arun(self, cmd, session, ...):
    max_retries = 3
    retry_delay = 0.2

    async with _get_global_session_lock():
        for attempt in range(max_retries):
            result = await self._runtime.run_in_session(action)
            if result.output.strip():
                return result
            await asyncio.sleep(retry_delay)
        return result

python

总结#

这次调试经历展示了一个典型的”潜伏 bug”如何在架构变更时暴露：

原有架构（Docker 容器）的隔离性掩盖了全局 ThreadPoolExecutor 的问题
新架构（本地进程）移除了隔离层，问题暴露
解决方案：在无法修改底层代码的约束下，通过上层 workaround（锁 + 延迟 + 重试）实现了 100% 可靠性

这是分布式系统和并发编程中常见的模式：防御性编程（defensive programming）不是偷懒，而是面对不可避免的不确定性的务实选择。

文档版本: 1.0 最后更新: 2025-12-13