AI System Design

The eight decisions you'll actually make when adding AI to a product. Framework from Aman Agarwal.

Start with users, not the system

Pick one segment. Map the journey. Find the pain. Solutions come after.

When someone says "design an AI system for X," the instinct is to talk about models and architecture. Resist it. Start with a person.

What most people say

"I'd build an LLM-based chatbot to handle customer support queries."

What you should say

"Let me start with the user. A power user on this telecom app calls support 3x/month about billing disputes. The pain isn't the call. It's that they can't self-serve a clear billing breakdown."

PM analogy: You'd never write a PRD starting with "we'll use PostgreSQL." You start with the user problem. Same thing here. The AI part comes later. The user part comes first.

The three pillars

Every AI system has three load-bearing components. Name them explicitly.

⚙

Model

What does the thinking. What type? Why? What's it optimized for?

☰

Data

What feeds the model. Where does it come from? How fresh does it need to be?

👁

Memory

What persists across interactions. What does the system remember about this user?

Model

The model is the engine. Not every engine needs to be a jet turbine. Some jobs need a sewing machine.

Questions to ask: What type of input does it handle? Structured data or free text? What's the latency budget? Does the output need to be interpretable? How much does each inference cost?

Data

Most AI systems fail because of data problems, not model problems. The model is only as good as what you feed it.

Questions to ask: Where does the data live? How fresh does it need to be? Real-time or batch? How is it structured? Who owns it? What are the privacy constraints?

Memory

Without memory, every interaction starts from zero. The system doesn't know you called yesterday about the same problem.

Questions to ask: What should persist across sessions? What's session-scoped? Where does memory live (vector DB, cache, database)? What's the retrieval strategy?

Connection to Data Pipelines: The data pillar is the pipeline you learned about. The difference is that here, the input isn't training data. It's customer records, chat logs, usage patterns. Same concept, different source.

LLM isn't the default

"I'd use an LLM" is not a design decision. It's a reflex.

Two jobs in the same system can need completely different models. Click the right model for each job:

Job 1: Predict which customers will churn

Based on billing history, usage patterns, tenure. Structured tabular data.

XGBoost

Latency<100ms

Cost$0.001/pred

InterpretableYes

Data typeStructured

LLM

Latency1-10s

Cost$0.01-0.10/pred

InterpretableNo

Data typeUnstructured

Job 2: Talk to a customer about their billing problem

Needs natural language understanding, context handling, open-ended generation.

XGBoost

Language genNo

ContextNone

FlexibilityFixed schema

LLM

Language genYes

ContextFull window

FlexibilityOpen-ended

The decision framework

Signal	Favors classical ML	Favors LLM
Data type	Structured, tabular	Unstructured text, images
Interpretability	Feature importance needed	Black box acceptable
Latency budget	<100ms	1-30s acceptable
Cost sensitivity	High volume, low margin	High value per query
Adaptation	Stable schema	New data sources, open-ended input
Output type	Classification, regression	Generation, conversation

The bar: "Here are the tradeoffs, here's my pick, here's why." Not "I'd use an LLM."

Orchestration before agents

Design the router before you design the specialists.

Before you design individual agents, design the layer that decides which agent handles what. Click "Route a message" to see it in action.

Routerclassify + route

Analystdata queries

Voice Botconversation

Executoractions

Humanfallback

Responseback to user

PM analogy: This is triage. An ER doesn't send you straight to a surgeon. There's a triage desk that assesses you and routes you to the right specialist. Without triage, you have specialists standing around with no one directing traffic.

Connection to Agent Teams: The six knobs apply to every agent here. The router's termination condition is "response delivered." The analyst's tools include database access. The voice bot's isolation keeps it from touching account settings.

Memory isn't one thing

Three tiers, three purposes, three persistence models.

Session

Current conversation state. "User asked about billing, then changed to data usage."
Dies when conversation ends. Implemented as conversation context.

Episodic

Past interactions with this specific user. "Called 3x in March about the same roaming charge."
Permanent, per-user. Lives in a vector database. Retrieved by similarity.

Semantic

Knowledge base, docs, policies. "Roaming charges apply outside home network after 2GB."
Permanent, shared across all users. RAG over product documentation.

For churn, episodic memory is king. A customer who called three times about the same unresolved billing issue is about to leave. If your AI doesn't know that, it'll give a generic response instead of "I see you've called about this roaming charge three times. Let me escalate this right now."

PM analogy: A good barista knows three things. What you just ordered (session). That you always get oat milk and complained about it being out last Tuesday (episodic). That oat milk costs $0.50 extra and is in the back fridge (semantic).

Show failure modes

Candidates who only describe the happy path look junior. Systems that only handle it break in production.

Failure Detection Mitigation

⚠ Model down

Health check, timeout

Human handoff with transcript

⏱ High latency

p95 monitoring, >30s threshold

Human handoff, async notification

🔄 Repeat question

Semantic similarity on consecutive messages

Escalate immediately. You've failed.

❓ Low confidence

Score <0.7 on intent classification

Route to human with context summary

👻 Hallucination

Grounding check vs. source docs

Flag, serve verified response

PM analogy: Every product has error states. 404 pages, failed payments, timeout screens. You design those before launch, not after. AI systems have the same need. The error states are just different.

Plan for 10x traffic

Your prototype works on 100 test calls. The telecom has 50 million subscribers.

Embedding Search

Problem: SQL can't do similarity search on millions of vectors fast enough.
Fix: Vector database (Pinecone, Weaviate, pgvector). Cache top N frequent query embeddings.

Model API Rate Limits

Problem: External LLM APIs have throughput ceilings you'll hit at scale.
Fix: Batch non-real-time work (churn scoring) nightly. Self-host for latency-critical paths.

Cache Misses

Problem: Same questions get asked thousands of times, each hitting the model.
Fix: Cache frequent query embeddings and common responses. Huge cost savings.

The key principle: Load test the model APIs before launch, not after. Know your ceiling. If OpenAI rate-limits you at 10,000 requests/minute and you expect 50,000, that's a launch-blocking problem to find in week one, not week twelve.

Metrics across four layers

If you only measure one thing, you'll miss three ways the system can fail.

Model Layer

Recall (are we catching the right intents?)
Precision (are we classifying correctly?)
Hallucination rate (<2% target)

Who cares: Engineering

Latency Layer

p95 response time (<3s target)
Time to first token
Model API uptime

Who cares: Engineering + Product

User Layer

CSAT (>4.0 target)
% resolved without escalation (60% mo 1)
Repeat contact rate (should decline)

Who cares: Product

Business Layer

Retention lift (the exec metric)
Support cost per ticket
Revenue impact

Who cares: Leadership

Why all four matter

Model metrics are perfect but users hate it

The UX is broken, not the model.

Users love it but business metrics are flat

You're solving a real problem that doesn't move the needle.

Latency is great but model is hallucinating

You're confidently wrong, fast.

Everything works but latency is 30s

Users will abandon before seeing the right answer.

PM analogy: You already measure products this way. Uptime (system), page load (latency), NPS (user), revenue (business). AI systems need the same stack. The model layer is just a new row in your metrics dashboard.

The Test

Eight questions. One per principle. See if you've internalized the framework.

out of 8

The full sequence

In an interview, walk through in order. In product work, revisit as you learn.

Users first product thinking

Pick a segment, map the journey, find the pain.

Three pillars architecture

Name model, data, memory explicitly.

Model selection technical fluency

Argue the tradeoffs, pick, justify.

Orchestration systems thinking

Design the router before the agents.

Memory tiers depth

Session, episodic, semantic. Scoped correctly.

Failure modes production readiness

Name them, detect them, mitigate them.

Scale plan operational maturity

10x traffic, bottlenecks, load testing.

Four-layer metrics business judgment

Model, latency, user, business.

Framework credit: Aman Agarwal, who identified these eight gaps from running live AI PM mock interviews.