LLM Application Development: A Complete Developer’s Guide (2026)

Building production-grade LLM applications is different from writing scripts that call an AI API. You need to think about prompts, context management, retrieval, tool use, streaming, error handling, and cost. This guide covers the full stack of LLM application development — from architecture decisions to deployment patterns — with Python code you can use immediately.


What Is LLM Application Development?

LLM application development is the practice of building software that uses large language models as a core component. Unlike traditional ML, where you train and deploy a model you own, LLM apps consume pre-trained models through APIs (Claude, GPT-4, Gemini) and focus on what surrounds the model: how you structure prompts, what data you inject, how you handle multi-step reasoning, and how you connect the model to external systems.

The key insight: the model is a commodity. The value is in the application layer — the context you provide, the tools you expose, and the orchestration logic you build around it.


Core Architecture Components

1. The Prompt Layer

Every LLM application starts with prompts. A production prompt has three parts:

  • System prompt — defines the model’s persona, constraints, output format
  • Context injection — dynamic data inserted at request time (user history, retrieved docs, tool results)
  • User turn — the actual input from the user or the application
import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a code review assistant.
Rules:
- Review only Python code
- Focus on correctness, security, performance
- Output structured feedback in this format:
  SEVERITY: [critical|warning|info]
  ISSUE: <description>
  FIX: <suggested change>
"""

def review_code(code: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"Review this code:\n\n```python\n{code}\n```"}
        ]
    )
    return response.content[0].text

2. Context Management

The context window is your most important resource. LLMs can only process what fits in the window, so you need to be deliberate about what you include.

  • Sliding window — keep only the last N messages for long conversations
  • Summarization — compress old turns into a summary and inject it
  • Retrieval — pull only the relevant chunks from a larger knowledge base (RAG)
  • Structured injection — use XML tags or headers to clearly delimit injected data
def build_context(messages: list[dict], max_tokens: int = 4000) -> list[dict]:
    """Keep recent messages within token budget (rough estimate: 1 token ≈ 4 chars)."""
    budget = max_tokens * 4  # characters
    result = []
    for msg in reversed(messages):
        content = msg["content"]
        if len(content) > budget:
            break
        result.insert(0, msg)
        budget -= len(content)
    return result


def inject_documents(query: str, docs: list[str]) -> str:
    """Inject retrieved documents with clear structure."""
    docs_block = "\n".join(
        f"<document index=\"{i+1}\">\n{doc}\n</document>"
        for i, doc in enumerate(docs)
    )
    return f"""<retrieved_documents>
{docs_block}
</retrieved_documents>

User question: {query}"""

3. Tool Use (Function Calling)

Tool use lets the model call external functions — databases, APIs, calculators, file systems. The model decides when to call a tool, you execute it, and the result goes back into context.

tools = [
    {
        "name": "search_database",
        "description": "Search the product database for items matching the query",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "description": "Max results", "default": 5}
            },
            "required": ["query"]
        }
    }
]

def run_agent_turn(user_message: str, conversation: list) -> str:
    conversation.append({"role": "user", "content": user_message})

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=conversation
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Handle tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": str(result)
                })

        conversation.append({"role": "assistant", "content": response.content})
        conversation.append({"role": "user", "content": tool_results})

RAG: Retrieval-Augmented Generation

RAG is the most common pattern for grounding LLM applications in real data. Instead of relying on the model’s training data, you retrieve relevant chunks from your own knowledge base at query time.

The basic RAG pipeline:

  1. User asks a question
  2. Embed the question into a vector
  3. Search a vector store for similar document chunks
  4. Inject the top-k chunks into the prompt as context
  5. The model answers using both its training and the injected context
import chromadb
from anthropic import Anthropic

client = Anthropic()
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("knowledge_base")

def rag_query(question: str, top_k: int = 3) -> str:
    # Retrieve relevant chunks
    results = collection.query(
        query_texts=[question],
        n_results=top_k
    )
    docs = results["documents"][0]

    # Build context-grounded prompt
    context = "\n\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(docs))
    prompt = f"""Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

For a deeper dive, see the RAG Tutorial with Python and How to Build a RAG Chatbot.


Streaming Responses

For user-facing applications, streaming is essential. Users see the response as it’s generated instead of waiting for the full output.

def stream_response(user_message: str):
    """Stream tokens to the client as they arrive."""
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            # In a web app, yield text to SSE or WebSocket
    print()  # newline after stream

# FastAPI example
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat_stream(message: str):
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Prompt Caching

Prompt caching dramatically reduces costs and latency when you reuse large system prompts or documents across requests. Supported on Claude — mark the cacheable portion with a cache_control breakpoint:

# System prompt with 5,000 tokens — mark it for caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # your large system prompt
            "cache_control": {"type": "ephemeral"}  # cache this
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

# Cache hit/miss info
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")  # if cached
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")  # if written

Cache hits reduce input token costs by ~90% and latency by 85%. For RAG apps injecting large document sets, this is a major cost saver.


Structured Output

For applications that need to parse model responses programmatically, define the output structure in your prompt and validate it.

import json
from pydantic import BaseModel

class TaskExtraction(BaseModel):
    title: str
    priority: str  # high | medium | low
    due_date: str | None
    tags: list[str]

def extract_task(user_input: str) -> TaskExtraction:
    prompt = f"""Extract task information from the user's message and return ONLY valid JSON.

Schema:
{{
  "title": "string",
  "priority": "high|medium|low",
  "due_date": "YYYY-MM-DD or null",
  "tags": ["string"]
}}

User message: {user_input}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    raw = response.content[0].text.strip()
    # Strip markdown code block if present
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
    return TaskExtraction(**json.loads(raw))

Error Handling and Retries

LLM APIs fail. Network timeouts, rate limits, malformed outputs — handle them at every layer.

import time
from anthropic import RateLimitError, APITimeoutError, APIError

def robust_completion(messages: list, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text

        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)

        except APITimeoutError:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)

        except APIError as e:
            if e.status_code >= 500:  # server error — retry
                time.sleep(1)
            else:
                raise  # client error — don't retry

    raise RuntimeError("Max retries exceeded")

Multi-Agent Patterns

Complex tasks benefit from multiple specialized agents rather than one generalist model with a giant prompt. Common patterns:

  • Pipeline — Agent A extracts, Agent B validates, Agent C formats. Each does one job well.
  • Parallel fan-out — Split a task into independent subtasks, run them concurrently, merge results.
  • Orchestrator + workers — A planner agent breaks the task into steps, specialist agents execute each step.
import asyncio

async def parallel_analysis(code: str) -> dict:
    """Run security, performance, and style checks concurrently."""

    async def check(system_prompt: str, label: str) -> tuple[str, str]:
        response = await asyncio_client.messages.create(
            model="claude-haiku-4-5-20251001",  # cheaper model for subtasks
            max_tokens=512,
            system=system_prompt,
            messages=[{"role": "user", "content": code}]
        )
        return label, response.content[0].text

    results = await asyncio.gather(
        check(SECURITY_PROMPT, "security"),
        check(PERFORMANCE_PROMPT, "performance"),
        check(STYLE_PROMPT, "style"),
    )
    return dict(results)

See the Multi-Agent Systems guide for full patterns and orchestration code.


Cost Optimization

LLM costs scale with token usage. Key levers:

  • Model selection — use Haiku for classification/extraction, Sonnet for reasoning, Opus only when needed
  • Prompt caching — cache large system prompts and shared context (90% savings on cache hits)
  • Output limits — set max_tokens tight; don’t let the model ramble
  • Batching — use the Batch API for offline workloads (50% discount, async processing)
  • Truncation — slice conversation history; you rarely need 20 turns of context
# Model cost comparison (Claude, approximate as of 2026):
# claude-haiku-4-5     $0.25 / 1M input,  $1.25 / 1M output  — fast, cheap
# claude-sonnet-4-6    $3    / 1M input,  $15   / 1M output  — balanced
# claude-opus-4-7      $15   / 1M input,  $75   / 1M output  — most capable

def choose_model(task_type: str) -> str:
    routing = {
        "classify":    "claude-haiku-4-5-20251001",
        "extract":     "claude-haiku-4-5-20251001",
        "summarize":   "claude-haiku-4-5-20251001",
        "reason":      "claude-sonnet-4-6",
        "code_review": "claude-sonnet-4-6",
        "research":    "claude-opus-4-7",
    }
    return routing.get(task_type, "claude-sonnet-4-6")

Production Checklist

  • Prompt versioning — store prompts in code or a database, not hardcoded strings; track changes with git
  • Logging — log every request/response with timestamps, model, token usage, and latency
  • Evaluation — build a test set of inputs/expected outputs; run evals before deploying prompt changes
  • Fallbacks — if the primary model is down, fall back to an alternative; handle malformed JSON gracefully
  • Rate limit awareness — implement exponential backoff; track your usage against API quotas
  • Security — sanitize user input before injection; never trust user-supplied content in system prompts
  • Observability — track cost per request, error rates, and p99 latency in your monitoring system

Stack Recommendations

Common tech stacks for LLM applications in 2026:

  • Simple scripts / automation: Anthropic Python SDK + Claude Haiku
  • RAG apps: Anthropic SDK + ChromaDB (local) or Pinecone (cloud) + FastAPI
  • Chatbots: Anthropic SDK + PostgreSQL (conversation history) + Redis (caching) + FastAPI
  • Agents: Anthropic SDK with tool use (build your own) or LangChain/LlamaIndex for orchestration
  • Workflows: n8n (no-code) or Python with asyncio for parallel agents

For orchestration frameworks, see LangChain for Beginners and LangChain vs LlamaIndex.


Conclusion

LLM application development is mostly engineering, not AI research. Pick the right model for each task, manage context aggressively, use tools to connect to real systems, add caching to control costs, and build proper error handling from day one. The model does the reasoning — you build the application layer that makes it useful.


Subscribe to my newsletter — practical guides on Claude API, AI agents, RAG, and automation.

Subscribe