Most AI agents you see online are impressive in demos and useless in production. They call subprocess.run(), lose all shell state between steps, choke on interactive prompts, and give up after the first failed command. After building several of these frustrating toys, I decided to build something that actually works.
The result is a fully autonomous ReAct agent — about 1,100 lines of Python — that runs inside a real persistent terminal session. It plans before it acts, heals itself after errors, monitors long-running commands with live LLM checks, and streams its entire reasoning trace to either your terminal or a browser dashboard in real time.
Let me show you exactly what it can do — and then break down how it works under the hood.
The Core Idea: ReAct in a Real Shell
The ReAct framework (Yao et al., 2022) gives agents a clean loop: Think, then Act, then observe the result, then think again. On paper it's elegant. In practice, every existing open-source implementation I found uses subprocess.run() for shell commands — which means every command starts a brand-new, stateless process. You can't cd into a directory and stay there. You can't activate a Python virtualenv. You can't set an environment variable and read it back two steps later.
My agent solves this by running the LLM's commands inside a persistent PTY (pseudo-terminal) session — a real bash shell that stays open for the entire task. It's the same mechanism your terminal emulator uses. The shell is always there, remembers everything, and programs can't tell they're being driven by software.
Every command runs in one continuous bash session.cd,export,source venv/bin/activate— all of it sticks.
A Concrete Example: Installing Neofetch Over SSH
Let's say you type this task:
Here's exactly what happens, step by step:
Phase 1 — Planning
Before the main loop even starts, the agent makes a single LLM call and generates a structured plan. This isn't just cosmetic — it pre-populates a todo list that guides every subsequent step:
○ 1. Verify SSH connectivity to 192.168.1.42
IF FAILS: Check host reachability with ping first
○ 2. Check if neofetch is already installed on the remote host
○ 3. Determine the remote OS and package manager
○ 4. Install neofetch using the appropriate package manager
○ 5. Verify the installation succeeded
════════════════════════════
Phase 2 — The ReAct Loop
Now the agent enters the main loop. Watch the complete Thought → Action → Observation cycle:
[THINKING]
I need to first verify I can reach 192.168.1.42 over SSH.
I'll use a quick connection test with a no-op command.
[ACTION] run_command
command: ssh -o ConnectTimeout=5 user@192.168.1.42 'echo OK'
timeout: 15
OK
✓ exit_code: 0
─────────── Step 2 ───────────
[THINKING]
SSH works. Now check if neofetch is already there
to avoid a redundant install.
[ACTION] run_command
command: ssh user@192.168.1.42 'which neofetch 2>/dev/null && echo INSTALLED || echo MISSING'
timeout: 10
MISSING
─────────── Step 3 ───────────
[THINKING]
Not installed. I need to detect the remote OS and
package manager before choosing the install command.
[ACTION] run_command
command: ssh user@192.168.1.42 'cat /etc/os-release | grep ^ID='
timeout: 10
ID=ubuntu
─────────── Step 4 ───────────
[THINKING]
Ubuntu — I'll use apt. Running with -y flag for
non-interactive install. Setting a generous timeout
since apt may update package lists first.
[ACTION] run_command
command: ssh user@192.168.1.42 'sudo apt-get install -y neofetch'
timeout: 120
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed: neofetch
Setting up neofetch (7.1.0-2) ...
✓ exit_code: 0
─────────── Step 5 ───────────
[THINKING]
Verify the install actually worked.
[ACTION] run_command
command: ssh user@192.168.1.42 'neofetch --version'
timeout: 10
Neofetch 7.1.0
[ACTION] final_answer
✅ neofetch 7.1.0 successfully installed on 192.168.1.42
via apt on Ubuntu. Verification confirmed.
Tokens in: 6,842 · Tokens out: 489 · Total: 7,331
The entire thing ran autonomously. No human intervention, no copy-pasting commands, no babysitting. And because every SSH call went through the same persistent shell session, environment context was preserved throughout.
What Makes This Different
There are plenty of AI agents out there. Here's what this one does that the others don't:
Persistent PTY Shell
One real bash session for the entire task. cd, export, source — all preserved across steps.
Pre-Task Planning
Generates a numbered plan before acting. Tasks complete in 29% fewer steps on average.
7-Category Error Healing
Classifies failures (COMMAND_NOT_FOUND, PERMISSION_DENIED, NETWORK_ERROR…) and injects targeted recovery guidance.
Live Health Checks
During long-running commands, the LLM monitors live output and decides: continue, extend, or kill.
Token-Efficient Prompts
Injects only task-relevant rules using keyword scoring. Reduces prompt tokens by ~38%.
CLI + Web UI
Same agent core drives both a terminal and a real-time browser dashboard via Flask-SocketIO.
The Architecture at a Glance
The system is split into six Python modules with clean separation between concerns:
# The full stack, simplified
# 1. main.py — entry point (CLI or --web)
# 2. agent.py — ReAct loop, planning, error healing
# 3. tools.py — tool schemas + dispatcher
# 4. command_runner.py — persistent PTY bash session
# 5. emitter.py — output routing (terminal vs browser)
# 6. web_server.py — Flask + SocketIO real-time UI
def run_agent(goal: str, session_history: list) -> list:
system_prompt = _build_system_prompt(goal) # flow-selected rules
plan = _plan_task(goal, system_prompt) # pre-task LLM call
while True: # ReAct loop
response = llm_stream(messages) # think
tool = extract_tool_call(response) # act
result = execute_tool(tool) # observe
messages.append(result)
if tool.name == "final_answer":
break
return _summarise_session(messages) # compress history
The PTY Engine
The magic that makes shell state persistence work is pexpect — a library that creates a real pseudo-terminal. The agent spawns a single /bin/bash process at startup and keeps it alive. To detect when a command finishes, it uses a trick: every command is suffixed with a UUID-stamped echo:
# What actually gets sent to the PTY for every command:
your_command_here
echo "AGENTEND3f8a2b:$?"
# The runner reads output until it sees this marker,
# then extracts the exit code from the capture group.
# UUID prefix makes it impossible to confuse with real output.
The terminal is configured to 220 columns wide (prevents line-wrap artifacts), echo is disabled (so commands don't appear in their own output), and a custom AGENTPROMPT> marker replaces the default bash prompt for reliable prompt detection.
The Flow Injection Algorithm
The flows.txt file holds a library of behavioral rules grouped by task type: networking, file operations, package management, etc. Instead of dumping the entire file into every prompt (expensive and distracting), the agent scores each flow block against the current task using word overlap:
def _select_flow(goal: str, flows_text: str):
goal_words = set(goal.lower().split())
scores = {}
for flow_name, flow_content in flow_blocks.items():
name_words = set(flow_name.lower().split())
scores[flow_name] = len(name_words & goal_words) # overlap count
best = max(scores, key=scores.get)
second = sorted(scores, key=scores.get, reverse=True)[1]
# Always include global rules + best match.
# Include second if score ≥ 2 (complex cross-domain task).
return assemble(rules_block, best, second if scores[second] >= 2 else None)
Result: prompt tokens drop by ~38% on typical tasks, and the model focuses on what's actually relevant.
The Error Healing System
When a command fails, the agent doesn't just retry blindly. It classifies the error into one of seven categories and injects targeted recovery guidance before the next LLM call:
| Error Type | Trigger | Recovery Guidance |
|---|---|---|
| COMMAND_NOT_FOUND | exit 127 / "command not found" | Use which, or install the tool |
| PERMISSION_DENIED | exit 126 / "permission denied" | Try chmod +x or sudo |
| MISSING_FILE | "no such file or directory" | Check path or create the resource |
| ALREADY_EXISTS | "already exists" / "not empty" | Check if operation is already done |
| NETWORK_ERROR | "connection refused" / "timed out" | Verify connectivity and hostname |
| SYNTAX_ERROR | "syntax error" / "parse error" | Read exact line number, fix it |
| PACKAGE_NOT_FOUND | "not found" + "package"/"formula" | Check spelling, try alternate PM |
The healing prompt also forces the model to explicitly state why the previous command failed and what it will try differently — preventing it from just running the same broken command again.
How It Compares to Existing Agents
| Feature | This Project | LangChain | AutoGPT | AutoGen |
|---|---|---|---|---|
| Shell state persistence | ✓ PTY | ✗ subprocess | ✗ | ✗ |
| Interactive programs (ssh, sudo) | ✓ pexpect | limited | ✗ | ✗ |
| Pre-task planning | ✓ | optional | ✓ | ✓ |
| Dynamic timeout management | ✓ health check | ✗ | ✗ | ✗ |
| Typed error + healing prompts | ✓ 7 types | generic retry | generic | generic |
| Context compression | ✓ auto | manual | limited | limited |
| Script reuse across tasks | ✓ index.txt | ✗ | ✗ | ✗ |
| Human hand-off (Ctrl+]) | ✓ | ✗ | ✗ | partial |
The Numbers
All 15 functional test cases passed in testing on macOS (Apple M2) using GPT-4o-mini. Some highlights:
- Flow injection reduced average prompt tokens by 38%
- Planning reduced average steps per task by 29% (8.7 → 6.2)
- Simple tasks complete in 4–8 seconds
- Complex multi-step tasks (network audit) in 2–5 minutes
- Rate-limit backoff (429) recovered 100% of the time in tests
- Ctrl+C during stream: agent paused cleanly, removed partial message, offered checkpoint
How to Run It
# 1. Clone and set up
git clone https://github.com/your-username/react-agent
cd react-agent
python3 -m venv venv && source venv/bin/activate
pip install openai pexpect flask flask-socketio python-dotenv inquirer
# 2. Create .env
echo "OPENAI_API_KEY=sk-..." > .env
echo "LLM=openai" >> .env # or 'gemini'
# 3a. Run in terminal
python3 main.py
# 3b. Or launch the web UI
python3 main.py --web --port 7788
# → open http://localhost:7788
Switching to Gemini 2.0 Flash Lite instead of GPT-4o-mini is one environment variable change — the entire agent is provider-agnostic because both APIs speak the same OpenAI-compatible function-calling format.
What's Next
The honest limitations of the current version: it's single-user only, has no persistent memory across restarts, and doesn't run on Windows (PTY is Unix-only). The roadmap has a few clear priorities:
Persistent memory — a SQLite or ChromaDB-backed store so the agent remembers what it installed, which IPs it discovered, and what scripts it wrote across sessions. Local LLM support — an Ollama integration for fully offline, air-gapped operation. Docker sandbox mode — an optional flag that runs the PTY inside a container with resource limits for untrusted workloads.
The multi-agent parallelism idea is also appealing: a coordinator that spawns worker agents to scan different network segments simultaneously, then synthesizes their results. But that's a future project.
The core insight that made this work is simple: the shell is already a stateful, event-driven runtime. The only thing missing was an intelligence layer that could reason about what to run next. A persistent PTY + a reasoning LLM + a clean feedback loop is genuinely all you need. The rest is engineering.
If you have questions about the PTY mechanics, the flow injection algorithm, or the health check protocol, drop them in the comments. Happy to go deeper on any of it.