AI System Design
The eight decisions you'll actually make when adding AI to a product. Framework from Aman Agarwal.
Start with users, not the system
Pick one segment. Map the journey. Find the pain. Solutions come after.
When someone says "design an AI system for X," the instinct is to talk about models and architecture. Resist it. Start with a person.
"I'd build an LLM-based chatbot to handle customer support queries."
"Let me start with the user. A power user on this telecom app calls support 3x/month about billing disputes. The pain isn't the call. It's that they can't self-serve a clear billing breakdown."
The three pillars
Every AI system has three load-bearing components. Name them explicitly.
Model
What does the thinking. What type? Why? What's it optimized for?
Data
What feeds the model. Where does it come from? How fresh does it need to be?
Memory
What persists across interactions. What does the system remember about this user?
Model
The model is the engine. Not every engine needs to be a jet turbine. Some jobs need a sewing machine.
Questions to ask: What type of input does it handle? Structured data or free text? What's the latency budget? Does the output need to be interpretable? How much does each inference cost?
Data
Most AI systems fail because of data problems, not model problems. The model is only as good as what you feed it.
Questions to ask: Where does the data live? How fresh does it need to be? Real-time or batch? How is it structured? Who owns it? What are the privacy constraints?
Memory
Without memory, every interaction starts from zero. The system doesn't know you called yesterday about the same problem.
Questions to ask: What should persist across sessions? What's session-scoped? Where does memory live (vector DB, cache, database)? What's the retrieval strategy?
LLM isn't the default
"I'd use an LLM" is not a design decision. It's a reflex.
Two jobs in the same system can need completely different models. Click the right model for each job:
Job 1: Predict which customers will churn
Based on billing history, usage patterns, tenure. Structured tabular data.
XGBoost
LLM
Job 2: Talk to a customer about their billing problem
Needs natural language understanding, context handling, open-ended generation.
XGBoost
LLM
The decision framework
| Signal | Favors classical ML | Favors LLM |
|---|---|---|
| Data type | Structured, tabular | Unstructured text, images |
| Interpretability | Feature importance needed | Black box acceptable |
| Latency budget | <100ms | 1-30s acceptable |
| Cost sensitivity | High volume, low margin | High value per query |
| Adaptation | Stable schema | New data sources, open-ended input |
| Output type | Classification, regression | Generation, conversation |
The bar: "Here are the tradeoffs, here's my pick, here's why." Not "I'd use an LLM."
Orchestration before agents
Design the router before you design the specialists.
Before you design individual agents, design the layer that decides which agent handles what. Click "Route a message" to see it in action.
Memory isn't one thing
Three tiers, three purposes, three persistence models.
Dies when conversation ends. Implemented as conversation context.
Permanent, per-user. Lives in a vector database. Retrieved by similarity.
Permanent, shared across all users. RAG over product documentation.
For churn, episodic memory is king. A customer who called three times about the same unresolved billing issue is about to leave. If your AI doesn't know that, it'll give a generic response instead of "I see you've called about this roaming charge three times. Let me escalate this right now."
Show failure modes
Candidates who only describe the happy path look junior. Systems that only handle it break in production.
Plan for 10x traffic
Your prototype works on 100 test calls. The telecom has 50 million subscribers.
The key principle: Load test the model APIs before launch, not after. Know your ceiling. If OpenAI rate-limits you at 10,000 requests/minute and you expect 50,000, that's a launch-blocking problem to find in week one, not week twelve.
Metrics across four layers
If you only measure one thing, you'll miss three ways the system can fail.
Model Layer
- Recall (are we catching the right intents?)
- Precision (are we classifying correctly?)
- Hallucination rate (<2% target)
Who cares: Engineering
Latency Layer
- p95 response time (<3s target)
- Time to first token
- Model API uptime
Who cares: Engineering + Product
User Layer
- CSAT (>4.0 target)
- % resolved without escalation (60% mo 1)
- Repeat contact rate (should decline)
Who cares: Product
Business Layer
- Retention lift (the exec metric)
- Support cost per ticket
- Revenue impact
Who cares: Leadership
Why all four matter
Model metrics are perfect but users hate it
The UX is broken, not the model.
Users love it but business metrics are flat
You're solving a real problem that doesn't move the needle.
Latency is great but model is hallucinating
You're confidently wrong, fast.
Everything works but latency is 30s
Users will abandon before seeing the right answer.
The Test
Eight questions. One per principle. See if you've internalized the framework.
out of 8
The full sequence
In an interview, walk through in order. In product work, revisit as you learn.
Users first product thinking
Pick a segment, map the journey, find the pain.
Three pillars architecture
Name model, data, memory explicitly.
Model selection technical fluency
Argue the tradeoffs, pick, justify.
Orchestration systems thinking
Design the router before the agents.
Memory tiers depth
Session, episodic, semantic. Scoped correctly.
Failure modes production readiness
Name them, detect them, mitigate them.
Scale plan operational maturity
10x traffic, bottlenecks, load testing.
Four-layer metrics business judgment
Model, latency, user, business.