Back to garden
Budding··9 min read

Build Your Own GREMLIN IN THE SHELL

A hands-on guide to building your own shell-based AI agent that haunts your terminal and gets things done.

K
Kevin De Asis
aiagentstutorial
Share

Still thinking through this one... The ideas here might look different tomorrow.

What is OpenClaw?

OpenClaw is an AI assistant platform. It's self-hosted, open-source, and connects AI models to messaging apps you already use. This agent is like a virtual assistant on steroids that can do real work on your computer.

What Does It Do?

At its simplest, OpenClaw lets you:

  1. Chat with AI through your existing messaging apps — Send a message on WhatsApp, get an AI response back on WhatsApp. Other supported messaging apps are: Telegram, Discord, Slack, Signal, iMessage, and more.
  2. Run a single service that handles everything — One Gateway process manages all your channels, sessions, and AI interactions.
  3. Keep everything local — Your conversations, credentials, and data stay on your machine?
  4. Extend with plugins — Add new messaging channels, memory backends, voice capabilities, and tools.
  5. Automate with cron and webhooks — Schedule agent tasks or trigger them via HTTP.

Who Is It For?

  • Power users who want an AI assistant accessible from any messaging platform
  • Developers who want to build on top of a flexible agent framework
  • Privacy-conscious users who want to self-host their AI interactions
  • Teams who want a shared AI agent accessible through their existing communication tools

Architecture Overview

OpenClaw has three main layers. Messages flow down from channels, through the Gateway, and out to AI providers — then responses flow back up.

The Three Layers

┌─────────────────────────────────────────────────────────────┐
│                     MESSAGING CHANNELS                      │
│  WhatsApp · Telegram · Discord · Slack · Signal · iMessage  │
│  Matrix · Teams · IRC · LINE · Feishu · 15+ more via plugins│
└──────────────────────────┬──────────────────────────────────┘
                           │ inbound/outbound messages
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                     THE GATEWAY (core)                      │
│  WebSocket server · Message routing · Session management    │
│  Agent execution · Plugin loading · Cron/hooks · Media      │
│  Config hot-reload · Control UI · OpenAI-compat API         │
│                    (default port 18789)                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ model calls
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                     AI MODEL PROVIDERS                      │
│  Anthropic · OpenAI · Google · Ollama · OpenRouter·DeepSeek │
│  HuggingFace · Bedrock · LiteLLM · 15+ more                 │
└─────────────────────────────────────────────────────────────┘

In one sentence: Messages arrive from any channel, the Gateway routes them to an AI agent, the agent thinks/acts/responds, and the reply goes back to the same channel.

Layer 1: Messaging Channels

The top layer handles communication with the outside world. Each channel is an adapter that:

  • Connects to a messaging platform (via bot tokens, QR codes, OAuth, etc.)
  • Receives inbound messages (text, images, voice notes, documents)
  • Sends outbound responses (with proper formatting for each platform)
  • Reports health (connectivity status, reconnection)

Channels are implemented either as built-in modules (in src/) or as plugins (in extensions/). Both use the same ChannelPlugin interface, so they're interchangeable from the Gateway's perspective.

The ChannelPlugin Contract

Every channel — built-in or plugin — implements this master interface (defined in src/channels/plugins/types.plugin.ts):

ChannelPlugin<ResolvedAccount> {
  id: ChannelId;                      // "telegram", "discord", etc.
  meta: ChannelMeta;                  // UI labels, docs path, display order
  capabilities: ChannelCapabilities;  // Feature flags (polls, reactions, threads)
  config: ChannelConfigAdapter;       // Account listing & resolution
  gateway?: ChannelGatewayAdapter;    // Start/stop hooks
  outbound?: ChannelOutboundAdapter;  // Message sending
  status?: ChannelStatusAdapter;      // Health checks
  pairing?: ChannelPairingAdapter;    // Allow-from management
  security?: ChannelSecurityAdapter;  // DM policies
  groups?: ChannelGroupAdapter;       // Group settings
  threading?: ChannelThreadingAdapter;// Reply threading modes
  messaging?: ChannelMessagingAdapter;// Target normalization
  directory?: ChannelDirectoryAdapter;// Contact/directory queries
  actions?: ChannelMessageActionAdapter;// Reactions, edits, etc.
  auth?: ChannelAuthAdapter;          // Login flows (QR, token, OAuth)
}

The adapters are optional — a minimal channel only needs id, meta, capabilities, and config. The Gateway calls whichever adapters are present.

Layer 2: The Gateway

The middle layer is the brain of OpenClaw. It's a single long-running process (typically installed as a system service) that:

  • Routes messages from channels to the correct agent
  • Manages sessions (conversation history per user/channel)
  • Executes agents (runs the AI model with tools and context)
  • Loads plugins (discovers and initializes channel/tool extensions)
  • Serves the Control UI (browser-based dashboard)
  • Exposes APIs (WebSocket protocol, OpenAI-compatible HTTP, webhooks)
  • Hot-reloads config (watches the config file for changes)

The Gateway is the single source of truth for all state.

Gateway Internal Architecture

The Gateway is not a monolith — it's composed of several specialized subsystems that are wired together at startup:

startGatewayServer()                    [server.impl.ts]
│
├─→ Config & Auth
│   ├─ Load config, run migrations, resolve secrets
│   ├─ Resolve auth (token/password/Tailscale)
│   └─ Resolve TLS certificates
│
├─→ HTTP Server                         [server-http.ts]
│   ├─ /health, /ready              → Health probes
│   ├─ /v1/chat/completions         → OpenAI-compatible API
│   ├─ /hooks                       → Webhook endpoints
│   ├─ /__openclaw__/a2ui/          → Canvas/A2UI host
│   ├─ /                            → Control UI (Vite SPA)
│   └─ Plugin HTTP routes           → Per-plugin routes
│
├─→ WebSocket Server                    [server-ws-runtime.ts]
│   ├─ Connection auth & rate limiting
│   ├─ RPC method dispatch (gateway-methods.js)
│   └─ Event broadcasting to connected clients
│
├─→ Channel Manager                     [server-channels.ts]
│   ├─ Load & validate channel plugins
│   ├─ Start each account (with exponential backoff restart)
│   ├─ Track runtime state per account
│   └─ Health monitoring (5-min interval checks)
│
├─→ Agent Event Handler                 [server-chat.ts]
│   ├─ Stream routing (deltas → clients)
│   ├─ Tool event delivery tracking
│   ├─ Heartbeat suppression
│   └─ Text delta merging
│
├─→ Sidecars                            [server-startup.ts]
│   ├─ Browser control server
│   ├─ Gmail watcher
│   ├─ Internal hook handlers
│   ├─ Plugin services
│   └─ Memory backend
│
└─→ Shutdown Handler                    [server-close.ts]
    ├─ Stop all channels & plugins
    ├─ Broadcast shutdown event to clients
    ├─ Drain HTTP connections
    └─ Close WebSocket server

Channel Manager: Auto-Recovery

The Channel Manager (server-channels.ts) doesn't just start channels — it keeps them alive:

  • Exponential backoff restart: 5s → 10s → 20s → ... → 5min (2x factor, 10% jitter)
  • Max 10 restart attempts per channel:account
  • Rate limit: max 10 restarts/hour
  • Cooldown: 2 check cycles (10 min) between restarts
  • Abort signal propagation: graceful shutdown cascades via AbortSignal

Each account's lifecycle is tracked with a ChannelAccountSnapshot containing: enabled, configured, running, lastError, lastStartAt.

Layer 3: AI Model Providers

The bottom layer handles communication with AI models. OpenClaw supports 15+ providers through a unified interface:

  • Primary model — Your preferred provider (e.g., Claude Sonnet)
  • Fallback chain — Automatic failover if the primary is down
  • Auth profiles — Separate API keys per provider, with cooldown on rate limits
  • Model catalog — Central registry with version pinning and capability detection

Model Failover System

When a model call fails, the failover system (src/agents/model-fallback.ts) handles recovery:

runWithModelFallback()
│
├─ Try primary model
│  ├─ Success → return result
│  └─ Failure → classify error
│     ├─ Rate limit → cooldown auth profile (1s → 30s → 5min exponential)
│     ├─ Network error → try next candidate
│     ├─ Auth error → try next candidate
│     └─ User abort → rethrow (don't retry)
│
├─ Try fallback candidates (deduped by provider/model key)
│  └─ Same retry logic per candidate
│
└─ Max iterations: 32-160 (scales with auth profile count)

Auth Profile Management

Auth profiles (src/agents/auth-profiles.ts) manage API credentials with sophisticated state tracking:

AuthProfile {
  id: string;
  provider: string;
  credential: ApiKeyCredential | OAuthCredential | TokenCredential;
  usage?: { lastUsedAt, usageCount, failureCount, successCount };
  state?: "valid" | "expiring_soon" | "expired";
}

Cooldown calculation uses exponential backoff based on failure reason:

  • rate_limit → 1s → 5s → 30s → 5min → 30min
  • overloaded → similar but slower
  • unauthorized → immediate failover, no retry

Supporting Systems

Beyond the three main layers, several cross-cutting systems support the architecture:

Configuration System

  • JSON5 config file at ~/.openclaw/openclaw.json
  • Hot-reload with validation (hybrid mode: hot-reload what's possible, restart for critical changes)
  • Environment variable substitution (${VAR_NAME})
  • Secret references (env, file, exec sources)
  • Config splits via $include

Plugin System

  • Three discovery sources: workspace deps, config extensions, bundled (extensions/)
  • Security checks: blocks path escaping, world-writable paths, suspicious ownership
  • Isolated runtime context per plugin
  • Hook system for lifecycle events (before-agent-start, after-completion, model-override)
  • Standard SDK with 100+ exported types (src/plugin-sdk/)

Media Pipeline

  • Download, process, and serve media (images, audio, PDFs)
  • Format conversion and resizing (Sharp for images, FFmpeg for audio/video)
  • MIME type detection
  • Per-channel chunking (each platform has different size limits)

Routing System

The routing system (src/routing/) maps inbound messages to agents with a strict priority tier:

1. Peer binding        → Direct chat/DM by specific peer ID
2. Parent peer binding → Thread parent inheritance
3. Guild + roles       → Discord role-based routing
4. Guild binding       → Discord server-wide
5. Team binding        → Microsoft Teams workspace
6. Account binding     → Per-bot-account routing
7. Channel binding     → Default for entire channel
8. Default             → Fallback to default agent

Results are cached in a 2-level LRU cache (2K evaluated bindings + 4K resolved routes).

Session Key Construction

Session keys encode the full context of a conversation:

DM (per-peer):     "agent:main:direct:user123"
DM (per-channel):  "agent:main:telegram:direct:user123"
Group:             "agent:main:discord:group:server456"
Thread:            "agent:main:discord:group:server456:thread:thread789"
Main (collapsed):  "agent:main:main"

The dmScope config controls how DM sessions are isolated: main (all DMs share one session), per-peer (per user), per-channel-peer (per user per channel), or per-account-channel-peer (fully isolated).

Native Apps (Nodes)

  • macOS menubar app, iOS app, Android app
  • Act as "nodes" that expose device capabilities to the agent
  • Pair with the Gateway via Bonjour/mDNS or manual pairing
  • Provide camera, screen, location, voice, and system commands

Key Architectural Decisions

  1. Local-first — Everything runs on your hardware by default. Remote access is opt-in via Tailscale, SSH tunnels, or direct binding.

  2. Single process — One Gateway handles all channels, agents, and sessions. No microservices, no databases, no message queues.

  3. File-based state — Sessions are JSONL files, config is JSON5, workspaces are directories with Markdown files. No database required.

  4. Plugin-first channels — Even built-in channels use the same plugin interface as extensions, making them easy to swap or extend.

  5. Model-agnostic — The agent layer abstracts away provider differences. Switch models by changing a config value.

  6. Lazy loading — CLI commands are registered as placeholders and only dynamically imported when invoked, keeping startup fast.

  7. Abort signal propagation — Graceful shutdown cascades through the entire system via AbortSignal, from Gateway → channels → active agent runs.

Next Section

Coming soon.

You might also like