Skip to main content

Overview

OpsHub’s agent system is designed for high-performance financial services operations. This guide covers optimization strategies, performance monitoring, and best practices for achieving optimal latency, throughput, and cost efficiency.

Performance Architecture

Core Components

┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
└─────────────────────────────────────────────────────────┘

        ┌──────────────────┴──────────────────┐
        │                                     │
┌───────▼────────┐                  ┌────────▼───────┐
│  FastAPI       │                  │  FastAPI       │
│  Backend       │                  │  Backend       │
│  Instance 1    │                  │  Instance 2    │
└───────┬────────┘                  └────────┬───────┘
        │                                     │
        └──────────────────┬──────────────────┘

        ┌──────────────────┴──────────────────┐
        │                                     │
┌───────▼────────┐  ┌──────────────┐  ┌─────▼────────┐
│  Redis Cache   │  │  PostgreSQL  │  │  LLM APIs    │
│  (Semantic)    │  │  (pgvector)  │  │  (Claude/    │
│                │  │              │  │   OpenAI)    │
└────────────────┘  └──────────────┘  └──────────────┘

Performance Metrics

Target Performance Goals:
  • Agent Response Latency: < 2 seconds for simple queries
  • Complex Analysis: < 10 seconds for multi-step financial analysis
  • Cache Hit Rate: > 60% for repeated queries
  • Cost per Request: < $0.05 on average
  • Throughput: 100+ requests/second per instance
  • Uptime: 99.9% availability

Latency Optimization

1. Semantic Caching

Impact: 40-60% cost reduction, 70-90% latency reduction on cache hits
# Automatic semantic caching is enabled by default
# Configure in .env:
REDIS_URL=redis://localhost:6379
CACHE_SIMILARITY_THRESHOLD=0.95  # Higher = stricter matching
CACHE_TTL_SECONDS=3600  # 1 hour default
Best Practices:
  • Use session-scoped caching for user-specific queries
  • Adjust similarity threshold based on query type:
    • 0.98+: Exact queries (account lookups)
    • 0.95: Similar queries (financial analysis)
    • 0.90: Broader queries (market research)
Monitoring Cache Performance:
# Check cache hit rate
curl http://localhost:8000/api/metrics | jq '.cache_hit_rate'

# View cache statistics by session
curl http://localhost:8000/api/sessions/{session_id} | jq '.cache_stats'

2. Model Selection Strategy

Impact: 3-10x cost reduction, 2-5x latency improvement Choose the right model for the task:
Task TypeRecommended ModelAvg LatencyCost
Account lookupClaude Haiku~500ms$0.001
Simple analysisGPT-4o Mini~1s$0.002
Complex reasoningClaude Sonnet 4.5~3s$0.015
Critical decisionsClaude Opus~5s$0.045
Implementation:
// Frontend: Select model based on task complexity
const modelId = taskComplexity === 'simple'
  ? 'claude-3-haiku-20240307'
  : 'claude-sonnet-4-5-20250514';

await chat(userMessage, { modelId });

3. RAG Optimization

Impact: 50% faster document retrieval, 30% better relevance Hybrid Search Configuration:
# Optimize search parameters in tool calls
results = await rag_search(
    query="financial analysis",
    max_results=5,  # Limit results for faster processing
    search_type="hybrid",  # Combine semantic + keyword
    similarity_threshold=0.75  # Balance relevance vs speed
)
Document Chunking Strategy:
  • Chunk Size: 500-1000 tokens (optimal for semantic search)
  • Overlap: 50-100 tokens (maintain context)
  • Metadata: Index key fields for faster filtering
Performance Tuning:
-- Optimize pgvector index for faster similarity search
CREATE INDEX CONCURRENTLY idx_embeddings_ivfflat
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Vacuum regularly for performance
VACUUM ANALYZE documents;

4. Streaming Responses

Impact: Perceived latency reduction of 60-80% Streaming is enabled by default for all agent responses:
// Frontend receives chunks as they're generated
const stream = await chatStream(userMessage);

for await (const chunk of stream) {
  if (chunk.type === 'text') {
    displayChunk(chunk.content);  // Show immediately
  }
}
Backend Configuration:
# Adjust streaming buffer size for optimal throughput
STREAM_BUFFER_SIZE=1024  # Bytes per chunk
STREAM_FLUSH_INTERVAL=0.05  # Seconds between flushes

Cost Optimization

1. Token Management

Token Optimization Strategies: Reduce Input Tokens:
# Use concise system prompts
SYSTEM_PROMPT = "Financial analyst. Provide data-driven insights."  # ✓ Good
# Instead of: "You are a highly experienced financial analyst with expertise..." # ✗ Too long

# Summarize long conversation history
if len(messages) > 20:
    summarized_history = await summarize_conversation(messages[:-10])
    messages = [summarized_history] + messages[-10:]
Reduce Output Tokens:
# Request concise responses when appropriate
"Summarize in 3 bullet points"  # ✓ Good
"Provide a comprehensive detailed analysis"  # ✗ Expensive

# Use structured outputs (JSON) instead of prose
response_format = {"type": "json_object"}  # More efficient
Monitor Token Usage:
# View token usage by session
curl http://localhost:8000/api/sessions/{session_id} | jq '{
  total_tokens: .total_tokens,
  total_cost_usd: .total_cost_usd,
  avg_tokens_per_request: .avg_tokens_per_request
}'

2. Model Cost Hierarchy

Use cheaper models for simpler tasks:
# Cost per 1K tokens (input/output)
MODELS_BY_COST = {
    "claude-3-haiku": "$0.00025 / $0.00125",      # Cheapest
    "gpt-4o-mini": "$0.00015 / $0.0006",          # Very cheap
    "claude-sonnet-4.5": "$0.003 / $0.015",       # Balanced
    "gpt-4o": "$0.0025 / $0.01",                  # Moderate
    "claude-3-opus": "$0.015 / $0.075",           # Premium
}
Task-Model Mapping:
const modelRouter = {
  "account_lookup": "claude-3-haiku-20240307",
  "transaction_analysis": "gpt-4o-mini",
  "risk_assessment": "claude-sonnet-4-5-20250514",
  "regulatory_compliance": "claude-3-opus-20240229"
};

3. Caching Strategy

Multi-Layer Caching:
  1. Semantic Cache (Redis)
    • Response caching based on query similarity
    • 60% cost reduction on repeated queries
    • TTL: 1 hour for dynamic data, 24 hours for static
  2. RAG Document Cache
    • Cache document embeddings
    • Reduces re-processing of same documents
    • Persistent storage in PostgreSQL
  3. Tool Result Cache
    • Cache API responses (account data, market data)
    • 5-15 minute TTL for real-time data
    • Longer TTL for reference data
Implementation:
# Configure caching in .env
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600
DOCUMENT_CACHE_ENABLED=true
TOOL_CACHE_TTL_SECONDS=300  # 5 minutes

Throughput Optimization

1. Horizontal Scaling

Architecture:
# docker-compose.yml
services:
  agent-backend:
    image: opshub-agent-backend
    deploy:
      replicas: 3  # Scale to 3 instances
    environment:
      - REDIS_URL=redis://redis:6379
      - POSTGRES_URL=postgresql://postgres:5432/opshub
Load Balancing:
  • Use Nginx or AWS ALB for request distribution
  • Enable sticky sessions for conversation continuity
  • Health checks on /api/health endpoint

2. Connection Pooling

Database Connection Pool:
# config.py
DATABASE_POOL_SIZE = 20  # Max connections per instance
DATABASE_MAX_OVERFLOW = 10  # Additional connections during spikes
DATABASE_POOL_TIMEOUT = 30  # Seconds to wait for connection
Redis Connection Pool:
# Redis connection pool (automatic in redis-py)
REDIS_MAX_CONNECTIONS = 50
REDIS_SOCKET_KEEPALIVE = True
REDIS_HEALTH_CHECK_INTERVAL = 30

3. Async Processing

The agent system uses async I/O for optimal throughput:
# All I/O operations are non-blocking
async def agent_call(state: AgentState):
    # Multiple API calls in parallel
    results = await asyncio.gather(
        llm.ainvoke(messages),
        rag_search(query),
        check_cache(query)
    )
    return process_results(results)
Best Practices:
  • Use asyncio.gather() for parallel operations
  • Avoid blocking operations in async functions
  • Use connection pooling for all external services

Monitoring & Alerting

1. Real-Time Metrics

Key Metrics to Track:
# Available at /api/metrics endpoint
{
  "requests_per_minute": 45,
  "avg_latency_ms": 1234,
  "p95_latency_ms": 2500,
  "p99_latency_ms": 4000,
  "error_rate": 0.02,  # 2%
  "cache_hit_rate": 0.65,  # 65%
  "avg_cost_per_request": 0.012,  # $0.012
  "llm_provider_health": {
    "anthropic": "healthy",
    "openai": "healthy"
  }
}
Dashboard Setup: Use LangSmith for visual monitoring:
  • Trace Timeline: View request flow and bottlenecks
  • Cost Dashboard: Track spending by model, user, session
  • Error Dashboard: Monitor failures and error patterns
  • Performance Dashboard: Latency percentiles, throughput

2. Alerting Rules

Recommended Alerts:
# Example alert configuration
alerts:
  - name: High Latency
    condition: p95_latency_ms > 5000
    severity: warning

  - name: High Error Rate
    condition: error_rate > 0.05  # 5%
    severity: critical

  - name: Low Cache Hit Rate
    condition: cache_hit_rate < 0.40  # 40%
    severity: warning

  - name: High Cost
    condition: cost_per_hour > 50  # $50/hour
    severity: warning

  - name: Provider Down
    condition: llm_provider_health != "healthy"
    severity: critical

3. LangSmith Integration

Automatic Tracing:
# Enable in .env
LANGSMITH_API_KEY=lsv2_sk_...
LANGSMITH_PROJECT=opshub-agent-backend
LANGSMITH_TRACING_ENABLED=true
View Traces:
  1. Go to smith.langchain.com
  2. Navigate to your project
  3. View traces with filtering:
    • By latency (find slow requests)
    • By cost (find expensive requests)
    • By error status (debug failures)

Load Testing

1. Basic Load Test

Using Apache Bench:
# Test 1000 requests with 10 concurrent connections
ab -n 1000 -c 10 -T application/json \
   -H "Authorization: Bearer YOUR_JWT" \
   -p request.json \
   http://localhost:8000/api/agent/chat
request.json:
{
  "modelId": "claude-3-haiku-20240307",
  "messages": [
    {"role": "user", "content": "What is the current balance for account 12345?"}
  ],
  "agentId": "app"
}

2. Advanced Load Testing

Using Locust:
# locustfile.py
from locust import HttpUser, task, between

class AgentUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def chat_simple_query(self):
        self.client.post("/api/agent/chat", json={
            "modelId": "claude-3-haiku-20240307",
            "messages": [
                {"role": "user", "content": "Check account balance"}
            ]
        }, headers={"Authorization": f"Bearer {JWT_TOKEN}"})

    @task(2)  # 2x weight - more common
    def chat_complex_query(self):
        self.client.post("/api/agent/chat", json={
            "modelId": "claude-sonnet-4-5-20250514",
            "messages": [
                {"role": "user", "content": "Analyze risk for portfolio X"}
            ]
        }, headers={"Authorization": f"Bearer {JWT_TOKEN}"})
Run Load Test:
# Install locust
pip install locust

# Run with 100 users, ramping up 10 per second
locust -f locustfile.py --host http://localhost:8000 \
       --users 100 --spawn-rate 10

3. Performance Benchmarks

Target Benchmarks:
MetricTargetAcceptableCritical
P50 Latency< 1s< 2s> 3s
P95 Latency< 3s< 5s> 10s
P99 Latency< 5s< 10s> 15s
Error Rate< 1%< 3%> 5%
Cache Hit Rate> 60%> 40%< 30%
Throughput> 100 req/s> 50 req/s< 25 req/s

Best Practices Summary

Development Environment

  1. Use Faster Models for Testing
    # .env.development
    DEFAULT_MODEL=claude-3-haiku-20240307  # Fast and cheap
    CACHE_TTL_SECONDS=300  # Short TTL for testing
    
  2. Enable Debug Logging
    LOG_LEVEL=DEBUG
    LANGSMITH_TRACING_ENABLED=true
    
  3. Local Caching
    # Use local Redis for development
    REDIS_URL=redis://localhost:6379
    

Production Environment

  1. Use Optimal Models
    DEFAULT_PROVIDER=anthropic
    DEFAULT_MODEL=claude-sonnet-4-5-20250514  # Best balance
    FALLBACK_ENABLED=true  # Enable fallback
    
  2. Production Caching
    # Use managed Redis (AWS ElastiCache, etc.)
    REDIS_URL=redis://production-redis:6379
    CACHE_TTL_SECONDS=3600  # 1 hour
    CACHE_SIMILARITY_THRESHOLD=0.95
    
  3. Monitoring & Observability
    LANGSMITH_TRACING_ENABLED=true
    LANGSMITH_SAMPLING_RATE=0.1  # Sample 10% in production
    LOG_LEVEL=INFO
    
  4. Connection Pooling
    DATABASE_POOL_SIZE=20
    DATABASE_MAX_OVERFLOW=10
    REDIS_MAX_CONNECTIONS=50
    
  5. Health Checks & Timeouts
    HEALTH_CHECK_INTERVAL=30  # Seconds
    REQUEST_TIMEOUT=30  # Seconds
    LLM_TIMEOUT=25  # Seconds (slightly less than request timeout)
    

Cost Management

  1. Set Cost Alerts
    • Daily spending threshold
    • Per-user spending limits
    • Unusual usage patterns
  2. Optimize Token Usage
    • Use concise prompts
    • Summarize long conversations
    • Request structured outputs
  3. Leverage Caching
    • Enable semantic caching
    • Use appropriate TTLs
    • Monitor cache hit rates
  4. Choose Right Models
    • Haiku for simple queries
    • Sonnet for balanced workloads
    • Opus only for critical tasks

Troubleshooting

High Latency

Symptoms: Requests taking > 5 seconds Diagnosis:
# Check LangSmith traces for bottlenecks
# Look for:
# - Slow LLM calls (try faster model)
# - Slow RAG searches (optimize indexes)
# - Slow database queries (add indexes)
# - Network issues (check provider status)
Solutions:
  1. Increase cache hit rate (lower similarity threshold)
  2. Use faster model for simple queries
  3. Optimize RAG search parameters
  4. Enable connection pooling
  5. Check external service health

High Costs

Symptoms: Spending > $100/day unexpectedly Diagnosis:
# Check cost breakdown in LangSmith
curl http://localhost:8000/api/metrics | jq '{
  avg_cost: .avg_cost_per_request,
  total_requests: .total_requests,
  estimated_daily_cost: (.avg_cost_per_request * .requests_per_hour * 24)
}'
Solutions:
  1. Review model selection (use cheaper models)
  2. Increase cache TTL
  3. Reduce token usage (concise prompts)
  4. Set per-user rate limits
  5. Implement request quotas

Low Cache Hit Rate

Symptoms: Cache hit rate < 40% Diagnosis:
# Analyze query patterns
curl http://localhost:8000/api/cache/stats | jq
Solutions:
  1. Lower similarity threshold (e.g., 0.90 instead of 0.95)
  2. Increase cache TTL
  3. Normalize queries (remove timestamps, IDs)
  4. Use session-scoped caching

Memory Issues

Symptoms: Out of memory errors, crashes Diagnosis:
# Monitor memory usage
docker stats opshub-agent-backend
Solutions:
  1. Reduce conversation history length
  2. Implement message summarization
  3. Increase instance memory
  4. Scale horizontally (more instances)
  5. Clear old cache entries

Support

For performance optimization assistance:
  • Review LangSmith traces for bottlenecks
  • Check health endpoint: /api/health
  • Monitor metrics endpoint: /api/metrics
  • Review logs for errors and warnings