I Built an AI Agent That Actually Controls Your Terminal

Most AI agents you see online are impressive in demos and useless in production. They call subprocess.run(), lose all shell state between steps, choke on interactive prompts, and give up after the first failed command. After building several of these frustrating toys, I decided to build something that actually works.

The result is a fully autonomous ReAct agent — about 1,100 lines of Python — that runs inside a real persistent terminal session. It plans before it acts, heals itself after errors, monitors long-running commands with live LLM checks, and streams its entire reasoning trace to either your terminal or a browser dashboard in real time.

Let me show you exactly what it can do — and then break down how it works under the hood.

★

The Core Idea: ReAct in a Real Shell

The ReAct framework (Yao et al., 2022) gives agents a clean loop: Think, then Act, then observe the result, then think again. On paper it's elegant. In practice, every existing open-source implementation I found uses subprocess.run() for shell commands — which means every command starts a brand-new, stateless process. You can't cd into a directory and stay there. You can't activate a Python virtualenv. You can't set an environment variable and read it back two steps later.

My agent solves this by running the LLM's commands inside a persistent PTY (pseudo-terminal) session — a real bash shell that stays open for the entire task. It's the same mechanism your terminal emulator uses. The shell is always there, remembers everything, and programs can't tell they're being driven by software.

Every command runs in one continuous bash session. cd, export, source venv/bin/activate — all of it sticks.

★

A Concrete Example: Installing Neofetch Over SSH

Let's say you type this task:

agent — bash

Task > Install neofetch on the machine at 192.168.1.42 via SSH

Here's exactly what happens, step by step:

Phase 1 — Planning

Before the main loop even starts, the agent makes a single LLM call and generates a structured plan. This isn't just cosmetic — it pre-populates a todo list that guides every subsequent step:

planning

═══════════ PLAN ═══════════
○ 1. Verify SSH connectivity to 192.168.1.42
IF FAILS: Check host reachability with ping first
○ 2. Check if neofetch is already installed on the remote host
○ 3. Determine the remote OS and package manager
○ 4. Install neofetch using the appropriate package manager
○ 5. Verify the installation succeeded
════════════════════════════

Phase 2 — The ReAct Loop

Now the agent enters the main loop. Watch the complete Thought → Action → Observation cycle:

agent loop — live

─────────── Step 1 ───────────

[THINKING]
I need to first verify I can reach 192.168.1.42 over SSH.
I'll use a quick connection test with a no-op command.

[ACTION] run_command
command: ssh -o ConnectTimeout=5 user@192.168.1.42 'echo OK'
timeout: 15

OK

✓ exit_code: 0

─────────── Step 2 ───────────

[THINKING]
SSH works. Now check if neofetch is already there
to avoid a redundant install.

[ACTION] run_command
command: ssh user@192.168.1.42 'which neofetch 2>/dev/null && echo INSTALLED || echo MISSING'
timeout: 10

MISSING

─────────── Step 3 ───────────

[THINKING]
Not installed. I need to detect the remote OS and
package manager before choosing the install command.

[ACTION] run_command
command: ssh user@192.168.1.42 'cat /etc/os-release | grep ^ID='
timeout: 10

ID=ubuntu

─────────── Step 4 ───────────

[THINKING]
Ubuntu — I'll use apt. Running with -y flag for
non-interactive install. Setting a generous timeout
since apt may update package lists first.

[ACTION] run_command
command: ssh user@192.168.1.42 'sudo apt-get install -y neofetch'
timeout: 120

Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed: neofetch
Setting up neofetch (7.1.0-2) ...

✓ exit_code: 0

─────────── Step 5 ───────────

[THINKING]
Verify the install actually worked.

[ACTION] run_command
command: ssh user@192.168.1.42 'neofetch --version'
timeout: 10

Neofetch 7.1.0

[ACTION] final_answer
✅ neofetch 7.1.0 successfully installed on 192.168.1.42
via apt on Ubuntu. Verification confirmed.

Tokens in: 6,842 · Tokens out: 489 · Total: 7,331

The entire thing ran autonomously. No human intervention, no copy-pasting commands, no babysitting. And because every SSH call went through the same persistent shell session, environment context was preserved throughout.

★

What Makes This Different

There are plenty of AI agents out there. Here's what this one does that the others don't:

🖥

Persistent PTY Shell

One real bash session for the entire task. cd, export, source — all preserved across steps.

🧠

Pre-Task Planning

Generates a numbered plan before acting. Tasks complete in 29% fewer steps on average.

🔧

7-Category Error Healing

Classifies failures (COMMAND_NOT_FOUND, PERMISSION_DENIED, NETWORK_ERROR…) and injects targeted recovery guidance.

⏱

Live Health Checks

During long-running commands, the LLM monitors live output and decides: continue, extend, or kill.

📦

Token-Efficient Prompts

Injects only task-relevant rules using keyword scoring. Reduces prompt tokens by ~38%.

🌐

CLI + Web UI

Same agent core drives both a terminal and a real-time browser dashboard via Flask-SocketIO.

★

The Architecture at a Glance

The system is split into six Python modules with clean separation between concerns:

# The full stack, simplified

# 1. main.py — entry point (CLI or --web)
# 2. agent.py — ReAct loop, planning, error healing
# 3. tools.py — tool schemas + dispatcher
# 4. command_runner.py — persistent PTY bash session
# 5. emitter.py — output routing (terminal vs browser)
# 6. web_server.py — Flask + SocketIO real-time UI

def run_agent(goal: str, session_history: list) -> list:
    system_prompt = _build_system_prompt(goal)   # flow-selected rules
    plan          = _plan_task(goal, system_prompt)  # pre-task LLM call

    while True:                                  # ReAct loop
        response = llm_stream(messages)          # think
        tool     = extract_tool_call(response)   # act
        result   = execute_tool(tool)            # observe
        messages.append(result)

        if tool.name == "final_answer":
            break

    return _summarise_session(messages)          # compress history

The PTY Engine

The magic that makes shell state persistence work is pexpect — a library that creates a real pseudo-terminal. The agent spawns a single /bin/bash process at startup and keeps it alive. To detect when a command finishes, it uses a trick: every command is suffixed with a UUID-stamped echo:

# What actually gets sent to the PTY for every command:
your_command_here
echo "AGENTEND3f8a2b:$?"

# The runner reads output until it sees this marker,
# then extracts the exit code from the capture group.
# UUID prefix makes it impossible to confuse with real output.

The terminal is configured to 220 columns wide (prevents line-wrap artifacts), echo is disabled (so commands don't appear in their own output), and a custom AGENTPROMPT> marker replaces the default bash prompt for reliable prompt detection.

The Flow Injection Algorithm

The flows.txt file holds a library of behavioral rules grouped by task type: networking, file operations, package management, etc. Instead of dumping the entire file into every prompt (expensive and distracting), the agent scores each flow block against the current task using word overlap:

def _select_flow(goal: str, flows_text: str):
    goal_words = set(goal.lower().split())

    scores = {}
    for flow_name, flow_content in flow_blocks.items():
        name_words = set(flow_name.lower().split())
        scores[flow_name] = len(name_words & goal_words)  # overlap count

    best   = max(scores, key=scores.get)
    second = sorted(scores, key=scores.get, reverse=True)[1]

    # Always include global rules + best match.
    # Include second if score ≥ 2 (complex cross-domain task).
    return assemble(rules_block, best, second if scores[second] >= 2 else None)

Result: prompt tokens drop by ~38% on typical tasks, and the model focuses on what's actually relevant.

The Error Healing System

When a command fails, the agent doesn't just retry blindly. It classifies the error into one of seven categories and injects targeted recovery guidance before the next LLM call:

Error Type	Trigger	Recovery Guidance
COMMAND_NOT_FOUND	exit 127 / "command not found"	Use `which`, or install the tool
PERMISSION_DENIED	exit 126 / "permission denied"	Try `chmod +x` or `sudo`
MISSING_FILE	"no such file or directory"	Check path or create the resource
ALREADY_EXISTS	"already exists" / "not empty"	Check if operation is already done
NETWORK_ERROR	"connection refused" / "timed out"	Verify connectivity and hostname
SYNTAX_ERROR	"syntax error" / "parse error"	Read exact line number, fix it
PACKAGE_NOT_FOUND	"not found" + "package"/"formula"	Check spelling, try alternate PM

The healing prompt also forces the model to explicitly state why the previous command failed and what it will try differently — preventing it from just running the same broken command again.

★

How It Compares to Existing Agents

Feature	This Project	LangChain	AutoGPT	AutoGen
Shell state persistence	✓ PTY	✗ subprocess	✗	✗
Interactive programs (ssh, sudo)	✓ pexpect	limited	✗	✗
Pre-task planning	✓	optional	✓	✓
Dynamic timeout management	✓ health check	✗	✗	✗
Typed error + healing prompts	✓ 7 types	generic retry	generic	generic
Context compression	✓ auto	manual	limited	limited
Script reuse across tasks	✓ index.txt	✗	✗	✗
Human hand-off (Ctrl+])	✓	✗	✗	partial

★

The Numbers

All 15 functional test cases passed in testing on macOS (Apple M2) using GPT-4o-mini. Some highlights:

Flow injection reduced average prompt tokens by 38%
Planning reduced average steps per task by 29% (8.7 → 6.2)
Simple tasks complete in 4–8 seconds
Complex multi-step tasks (network audit) in 2–5 minutes
Rate-limit backoff (429) recovered 100% of the time in tests
Ctrl+C during stream: agent paused cleanly, removed partial message, offered checkpoint

★

How to Run It

# 1. Clone and set up
git clone https://github.com/your-username/react-agent
cd react-agent
python3 -m venv venv && source venv/bin/activate
pip install openai pexpect flask flask-socketio python-dotenv inquirer

# 2. Create .env
echo "OPENAI_API_KEY=sk-..." > .env
echo "LLM=openai" >> .env   # or 'gemini'

# 3a. Run in terminal
python3 main.py

# 3b. Or launch the web UI
python3 main.py --web --port 7788
# → open http://localhost:7788

Switching to Gemini 2.0 Flash Lite instead of GPT-4o-mini is one environment variable change — the entire agent is provider-agnostic because both APIs speak the same OpenAI-compatible function-calling format.

★

What's Next

The honest limitations of the current version: it's single-user only, has no persistent memory across restarts, and doesn't run on Windows (PTY is Unix-only). The roadmap has a few clear priorities:

Persistent memory — a SQLite or ChromaDB-backed store so the agent remembers what it installed, which IPs it discovered, and what scripts it wrote across sessions. Local LLM support — an Ollama integration for fully offline, air-gapped operation. Docker sandbox mode — an optional flag that runs the PTY inside a container with resource limits for untrusted workloads.

The multi-agent parallelism idea is also appealing: a coordinator that spawns worker agents to scan different network segments simultaneously, then synthesizes their results. But that's a future project.

★

The core insight that made this work is simple: the shell is already a stateful, event-driven runtime. The only thing missing was an intelligence layer that could reason about what to run next. A persistent PTY + a reasoning LLM + a clean feedback loop is genuinely all you need. The rest is engineering.

If you have questions about the PTY mechanics, the flow injection algorithm, or the health check protocol, drop them in the comments. Happy to go deeper on any of it.