Skip to main content
// JH

· 8 min read

Building EA: Architecture Decisions for a Production AI Assistant

How I chose Hono over Express, Upstash over Redis, and Claude for a real-time AI assistant. 18 months of production LLM lessons.

AI · Claude · Hono · WebSocket · Architecture · LLM · Production · Vercel AI SDK · Upstash · Real-time

Why Build Another AI Assistant?

In a world with ChatGPT, Claude, and Gemini, building yet another AI assistant seems redundant. But here's what existing solutions don't solve well: multi-client context management for virtual assistants.

A VA managing 10+ executives needs more than a chat interface. They need an AI that understands "when I'm working with Sarah, I prefer morning meetings" while simultaneously knowing "John's budget approvals need his CFO cc'd." Generic AI assistants treat every conversation as isolated. EA treats them as interconnected contexts.

This article isn't a tutorial on calling the Claude API. It's a deep dive into the architectural decisions that shaped a production AI system—decisions that took 18 months and several expensive mistakes to get right.

The Stack: Unexpected Choices

Here's what EA runs on:

LayerChoiceNot This
Backend FrameworkHonoExpress, Fastify
Real-timeNative WebSocketSocket.io, Pusher
AI IntegrationVercel AI SDK + ClaudeDirect API, LangChain
StorageUpstash Redis + VectorSelf-managed Redis, Pinecone
FrontendNext.js 16 + React 19Remix, SvelteKit
AuthClerkAuth0, NextAuth

Each choice has a story. Let me tell you the ones that matter.

Why Hono Over Express

Express has been my default for a decade. But when you're streaming AI responses over WebSocket, every millisecond counts.

// Hono: 12 lines for WebSocket + streaming
app.get('/ws', upgradeWebSocket((c) => ({
  onMessage: async (event, ws) => {
    const stream = await generateStream(event.data)
    for await (const chunk of stream) {
      ws.send(chunk)
    }
  }
})))

Compare this to Express with ws or socket.io:

// Express + ws: 40+ lines, separate HTTP and WS servers
const server = createServer(app)
const wss = new WebSocketServer({ server })
wss.on('connection', (ws) => {
  ws.on('message', async (data) => {
    // Manual upgrade handling, heartbeat management,
    // connection state tracking...
  })
})

The difference isn't just lines of code. Hono is built for edge runtimes—it starts in milliseconds, handles thousands of concurrent connections efficiently, and its TypeScript types are first-class. For an AI application where response latency directly impacts user experience, these gains compound.

The numbers: In our benchmarks, Hono handled WebSocket upgrades 3.2x faster than Express + ws, with 40% lower memory footprint. For a system serving 100+ concurrent AI conversations, that translates to real cost savings.

The Streaming Architecture

AI responses shouldn't feel like waiting for a page to load. They should feel like someone typing to you. This requires streaming at every layer:

User Input → WebSocket → Claude API (streaming) → Token-by-token → User
     ↓              ↓                                    ↓
  ~50ms        Response starts              First token visible
              immediately                   within 200ms

Here's the pattern that works:

// Server: Stream Claude responses through WebSocket
async function handleMessage(ws: WebSocket, message: string) {
  const stream = await anthropic.messages.stream({
    model: 'claude-3-5-sonnet-20241022',
    messages: [{ role: 'user', content: message }],
    system: buildContextualPrompt(ws.clientId)
  })

  for await (const event of stream) {
    if (event.type === 'content_block_delta') {
      ws.send(JSON.stringify({
        type: 'token',
        content: event.delta.text
      }))
    }
  }

  ws.send(JSON.stringify({ type: 'complete' }))
}

The key insight: don't batch tokens. Send each token as it arrives. The 2-3ms overhead per message is invisible to users, but the perceived responsiveness is dramatically better than waiting for sentence boundaries.

Why Upstash Over Self-Managed Redis

I've run Redis clusters in production. They're reliable once configured correctly. But for EA, I needed something different: vector search alongside traditional caching.

Upstash offers both in a single managed service:

// Traditional caching
await redis.set(`session:${userId}`, sessionData, { ex: 3600 })

// Semantic search for conversation history
const similar = await vector.query({
  vector: await embed(userQuery),
  topK: 5,
  includeMetadata: true,
  filter: { clientId: currentClient }
})

This enables EA's killer feature: semantic context retrieval. When a user asks "what did we discuss about the Q3 budget?", we don't search for keywords. We find semantically similar past conversations.

We also use hybrid search (dense + sparse vectors) for best-of-both-worlds retrieval:

// Hybrid search: semantic + keyword matching
async findRelevantMessages(query: string, topK: number = 10) {
  const queryEmbedding = await this.generateEmbedding(query)

  // Sparse vector for keyword matching (BM25-style)
  const sparseVector = generateSparseVector(query)

  const results = await vectorStore.query({
    vector: queryEmbedding,
    sparseVector,  // Reciprocal Rank Fusion combines both
    topK,
    filter: `userId = '${this.userId}'`,
  })
}

Cost comparison:

  • Self-managed Redis + Pinecone: ~$150/month + operational overhead
  • Upstash Redis + Vector: ~$40/month, zero ops

For a product in early stages, reducing operational complexity is worth more than any optimization.

The 4-Tier Skill Resolution System

EA's most complex architectural decision was the skill resolution system. Each client can have different preferences for how the AI behaves—preferred meeting times, email tone, task categorization rules.

The resolution order:

System Defaults → User Defaults → Template Defaults → Client Overrides
     ↓                ↓                  ↓                    ↓
  Base behavior   "My preferences"   "Software client    "Sarah specifically
                  for all clients"    standard config"    prefers X"

This required careful data modeling:

interface SkillResolution {
  // Resolve skills in <10ms (target: 3.5ms achieved)
  resolve(userId: string, clientId?: string): Promise<ResolvedSkills>
}

// Implementation: 3 Redis lookups, parallelized
async resolve(userId: string, clientId?: string) {
  const [system, user, template, client] = await Promise.all([
    redis.get('skills:system'),
    redis.get(`skills:user:${userId}`),
    this.getTemplateForClient(clientId),
    clientId ? redis.get(`skills:client:${clientId}`) : null
  ])

  return deepMerge(system, user, template, client)
}

Why this matters: A VA managing 50 clients can't reconfigure the AI for each one. Templates let them define "software client standard config" once, then override only the specifics per client.

The Confidence Scoring Pattern

Most AI assistants are binary: either they execute actions or they ask for permission. EA implements a confidence spectrum:

interface ProposedAction {
  type: 'calendar_create' | 'email_draft' | 'task_add'
  confidence: number  // 0.0 - 1.0
  details: ActionDetails
}

// User configures thresholds per action type
const autonomyConfig = {
  calendar_create: { autoExecute: 0.85, requireApproval: 0.5 },
  email_draft: { autoExecute: 0.95, requireApproval: 0.7 },
  task_add: { autoExecute: 0.6, requireApproval: 0.3 }
}

This creates three zones:

  • High confidence (>threshold): Execute automatically
  • Medium confidence: Propose with one-click approval
  • Low confidence (<approval threshold): Ask for details

The implementation considers both confidence and action risk:

// Execute based on confidence, autonomy mode, and risk level
function shouldAutoExecute(type: string, confidence: number, config: any) {
  const isRisky = RISKY_ACTIONS.includes(type)  // email_send, calendar_delete
  const mode = config.autonomy || 'assist'      // suggest|assist|automate

  switch (mode) {
    case 'suggest':
      return false  // Always draft, never execute
    case 'assist':
      if (isRisky) return false  // Draft risky actions
      return confidence >= config.threshold
    case 'automate':
      return confidence >= config.threshold  // Execute everything
  }
}

The confidence itself comes from Claude's structured output:

const response = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  messages: [/* conversation */],
  system: `When proposing actions, include a confidence score (0.0-1.0) based on:
    - Clarity of user intent
    - Availability of required information
    - Historical accuracy for similar requests`
})

Token Optimization: Full vs Lite Prompts

AI API costs scale with token usage. EA implements a two-tier prompt system:

// Full prompt: Complete context for complex queries
const fullPrompt = buildFullPrompt({
  systemContext: true,
  clientProfile: true,
  recentMessages: 20,
  semanticContext: true  // Vector search results
})

// Lite prompt: Minimal context for simple operations
const litePrompt = buildLitePrompt({
  systemContext: true,
  recentMessages: 5
})

// Dynamic selection based on query complexity
const prompt = estimateComplexity(userMessage) > 0.7 ? fullPrompt : litePrompt

Impact: Lite prompts use 60-70% fewer tokens. For simple queries like "schedule a meeting tomorrow at 2pm," the full semantic context is unnecessary. This reduced our Claude API costs by approximately 40% without impacting quality where it matters.

Lessons Learned

What Worked Well

1. Betting on Hono early. The ecosystem has matured significantly, and our choice looks prescient now. Edge-first frameworks are becoming the default.

2. Upstash for everything. Single vendor for caching and vector search simplified operations dramatically. The Redis protocol compatibility meant zero learning curve.

3. Streaming from day one. Retrofitting streaming into a request-response architecture is painful. Designing for it upfront made everything cleaner.

What Was Challenging

1. Context window management. Claude's 200K context window seems infinite until you're managing 50 client contexts. We had to implement aggressive summarization and semantic retrieval instead of naive context stuffing.

2. Voice synthesis costs. ElevenLabs at $4-6 per active user per month eats into margins. Voice became a premium feature, not default.

3. Multi-tenant complexity. The 4-tier resolution system took three iterations to get right. The initial version was simpler but couldn't handle agency use cases.

What I'd Do Differently

1. Usage-based pricing research earlier. We designed for flat monthly pricing, then discovered our heaviest users cost 10x to serve. Usage-based tiers should have been day-one.

2. Voice as premium from the start. We offered voice to all users initially, then had to "take it away" when costs became unsustainable. Launching premium-only would have avoided user frustration.

3. Less architectural ambition initially. The template system is elegant but complex. For the first 6 months, simple per-client settings would have been sufficient.

The Honest Assessment

Building EA taught me that 80% of what makes an AI assistant valuable is already solved by ChatGPT and Claude. The remaining 20%—multi-client context, confidence-based autonomy, real-time streaming—is where the real engineering challenge lies.

If you're building an AI product, don't compete on "has AI." Compete on the workflow-specific features that generic assistants can't provide. For EA, that's multi-client context switching. For your product, it's something else.

The architecture decisions outlined here aren't universally correct. They're correct for EA's specific requirements: real-time streaming, multi-tenant contexts, and cost-conscious AI usage. Your requirements will differ. But the decision-making framework—evaluate trade-offs explicitly, benchmark claims, and optimize for your actual bottlenecks—applies everywhere.

What's Next

EA is actively evolving. Current focus areas:

  • Voice transcription for hands-free operation
  • Calendar and email integration for automated action execution
  • Agency dashboards for VA teams managing dozens of clients

If you're building something similar, I'm happy to discuss architecture decisions. Find me on GitHub or LinkedIn.


This article is part of a series on building AI products. Next up: "AI Integration Patterns: Semantic Context, Confidence Scoring, and Token Optimization."


Related Posts

Jan 12, 2026

LifeOS: Building an AI-Powered Personal Operating System with Claude Code & Obsidian