Performance & Optimization

Overview

OpsHub’s agent system is designed for high-performance financial services operations. This guide covers optimization strategies, performance monitoring, and best practices for achieving optimal latency, throughput, and cost efficiency.

Performance Architecture

Core Components

┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
└─────────────────────────────────────────────────────────┘
                           │
        ┌──────────────────┴──────────────────┐
        │                                     │
┌───────▼────────┐                  ┌────────▼───────┐
│  FastAPI       │                  │  FastAPI       │
│  Backend       │                  │  Backend       │
│  Instance 1    │                  │  Instance 2    │
└───────┬────────┘                  └────────┬───────┘
        │                                     │
        └──────────────────┬──────────────────┘
                           │
        ┌──────────────────┴──────────────────┐
        │                                     │
┌───────▼────────┐  ┌──────────────┐  ┌─────▼────────┐
│  Redis Cache   │  │  PostgreSQL  │  │  LLM APIs    │
│  (Semantic)    │  │  (pgvector)  │  │  (Claude/    │
│                │  │              │  │   OpenAI)    │
└────────────────┘  └──────────────┘  └──────────────┘

Performance Metrics

Target Performance Goals:

Agent Response Latency: < 2 seconds for simple queries
Complex Analysis: < 10 seconds for multi-step financial analysis
Cache Hit Rate: > 60% for repeated queries
Cost per Request: < $0.05 on average
Throughput: 100+ requests/second per instance
Uptime: 99.9% availability

Latency Optimization

1. Semantic Caching

Impact: 40-60% cost reduction, 70-90% latency reduction on cache hits

# Automatic semantic caching is enabled by default
# Configure in .env:
REDIS_URL=redis://localhost:6379
CACHE_SIMILARITY_THRESHOLD=0.95  # Higher = stricter matching
CACHE_TTL_SECONDS=3600  # 1 hour default

Best Practices:

Use session-scoped caching for user-specific queries
Adjust similarity threshold based on query type:
- 0.98+: Exact queries (account lookups)
- 0.95: Similar queries (financial analysis)
- 0.90: Broader queries (market research)

Monitoring Cache Performance:

# Check cache hit rate
curl http://localhost:8000/api/metrics | jq '.cache_hit_rate'

# View cache statistics by session
curl http://localhost:8000/api/sessions/{session_id} | jq '.cache_stats'

2. Model Selection Strategy

Impact: 3-10x cost reduction, 2-5x latency improvement Choose the right model for the task:

Task Type	Recommended Model	Avg Latency	Cost
Account lookup	Claude Haiku	~500ms	$0.001
Simple analysis	GPT-4o Mini	~1s	$0.002
Complex reasoning	Claude Sonnet 4.5	~3s	$0.015
Critical decisions	Claude Opus	~5s	$0.045

Implementation:

// Frontend: Select model based on task complexity
const modelId = taskComplexity === 'simple'
  ? 'claude-3-haiku-20240307'
  : 'claude-sonnet-4-5-20250514';

await chat(userMessage, { modelId });

3. RAG Optimization

Impact: 50% faster document retrieval, 30% better relevance Hybrid Search Configuration:

# Optimize search parameters in tool calls
results = await rag_search(
    query="financial analysis",
    max_results=5,  # Limit results for faster processing
    search_type="hybrid",  # Combine semantic + keyword
    similarity_threshold=0.75  # Balance relevance vs speed
)

Document Chunking Strategy:

Chunk Size: 500-1000 tokens (optimal for semantic search)
Overlap: 50-100 tokens (maintain context)
Metadata: Index key fields for faster filtering

Performance Tuning:

-- Optimize pgvector index for faster similarity search
CREATE INDEX CONCURRENTLY idx_embeddings_ivfflat
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Vacuum regularly for performance
VACUUM ANALYZE documents;

4. Streaming Responses

Impact: Perceived latency reduction of 60-80% Streaming is enabled by default for all agent responses:

// Frontend receives chunks as they're generated
const stream = await chatStream(userMessage);

for await (const chunk of stream) {
  if (chunk.type === 'text') {
    displayChunk(chunk.content);  // Show immediately
  }
}

Backend Configuration:

# Adjust streaming buffer size for optimal throughput
STREAM_BUFFER_SIZE=1024  # Bytes per chunk
STREAM_FLUSH_INTERVAL=0.05  # Seconds between flushes

Cost Optimization

1. Token Management

Token Optimization Strategies: Reduce Input Tokens:

# Use concise system prompts
SYSTEM_PROMPT = "Financial analyst. Provide data-driven insights."  # ✓ Good
# Instead of: "You are a highly experienced financial analyst with expertise..." # ✗ Too long

# Summarize long conversation history
if len(messages) > 20:
    summarized_history = await summarize_conversation(messages[:-10])
    messages = [summarized_history] + messages[-10:]

Reduce Output Tokens:

# Request concise responses when appropriate
"Summarize in 3 bullet points"  # ✓ Good
"Provide a comprehensive detailed analysis"  # ✗ Expensive

# Use structured outputs (JSON) instead of prose
response_format = {"type": "json_object"}  # More efficient

Monitor Token Usage:

# View token usage by session
curl http://localhost:8000/api/sessions/{session_id} | jq '{
  total_tokens: .total_tokens,
  total_cost_usd: .total_cost_usd,
  avg_tokens_per_request: .avg_tokens_per_request
}'

2. Model Cost Hierarchy

Use cheaper models for simpler tasks:

# Cost per 1K tokens (input/output)
MODELS_BY_COST = {
    "claude-3-haiku": "$0.00025 / $0.00125",      # Cheapest
    "gpt-4o-mini": "$0.00015 / $0.0006",          # Very cheap
    "claude-sonnet-4.5": "$0.003 / $0.015",       # Balanced
    "gpt-4o": "$0.0025 / $0.01",                  # Moderate
    "claude-3-opus": "$0.015 / $0.075",           # Premium
}

Task-Model Mapping:

const modelRouter = {
  "account_lookup": "claude-3-haiku-20240307",
  "transaction_analysis": "gpt-4o-mini",
  "risk_assessment": "claude-sonnet-4-5-20250514",
  "regulatory_compliance": "claude-3-opus-20240229"
};

3. Caching Strategy

Multi-Layer Caching:

Semantic Cache (Redis)
- Response caching based on query similarity
- 60% cost reduction on repeated queries
- TTL: 1 hour for dynamic data, 24 hours for static
RAG Document Cache
- Cache document embeddings
- Reduces re-processing of same documents
- Persistent storage in PostgreSQL
Tool Result Cache
- Cache API responses (account data, market data)
- 5-15 minute TTL for real-time data
- Longer TTL for reference data

Implementation:

# Configure caching in .env
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600
DOCUMENT_CACHE_ENABLED=true
TOOL_CACHE_TTL_SECONDS=300  # 5 minutes

Throughput Optimization

1. Horizontal Scaling

Architecture:

# docker-compose.yml
services:
  agent-backend:
    image: opshub-agent-backend
    deploy:
      replicas: 3  # Scale to 3 instances
    environment:
      - REDIS_URL=redis://redis:6379
      - POSTGRES_URL=postgresql://postgres:5432/opshub

Load Balancing:

Use Nginx or AWS ALB for request distribution
Enable sticky sessions for conversation continuity
Health checks on /api/health endpoint

2. Connection Pooling

Database Connection Pool:

# config.py
DATABASE_POOL_SIZE = 20  # Max connections per instance
DATABASE_MAX_OVERFLOW = 10  # Additional connections during spikes
DATABASE_POOL_TIMEOUT = 30  # Seconds to wait for connection

Redis Connection Pool:

# Redis connection pool (automatic in redis-py)
REDIS_MAX_CONNECTIONS = 50
REDIS_SOCKET_KEEPALIVE = True
REDIS_HEALTH_CHECK_INTERVAL = 30

3. Async Processing

The agent system uses async I/O for optimal throughput:

# All I/O operations are non-blocking
async def agent_call(state: AgentState):
    # Multiple API calls in parallel
    results = await asyncio.gather(
        llm.ainvoke(messages),
        rag_search(query),
        check_cache(query)
    )
    return process_results(results)

Best Practices:

Use asyncio.gather() for parallel operations
Avoid blocking operations in async functions
Use connection pooling for all external services

Monitoring & Alerting

1. Real-Time Metrics

Key Metrics to Track:

# Available at /api/metrics endpoint
{
  "requests_per_minute": 45,
  "avg_latency_ms": 1234,
  "p95_latency_ms": 2500,
  "p99_latency_ms": 4000,
  "error_rate": 0.02,  # 2%
  "cache_hit_rate": 0.65,  # 65%
  "avg_cost_per_request": 0.012,  # $0.012
  "llm_provider_health": {
    "anthropic": "healthy",
    "openai": "healthy"
  }
}

Dashboard Setup: Use LangSmith for visual monitoring:

Trace Timeline: View request flow and bottlenecks
Cost Dashboard: Track spending by model, user, session
Error Dashboard: Monitor failures and error patterns
Performance Dashboard: Latency percentiles, throughput

2. Alerting Rules

Recommended Alerts:

# Example alert configuration
alerts:
  - name: High Latency
    condition: p95_latency_ms > 5000
    severity: warning

  - name: High Error Rate
    condition: error_rate > 0.05  # 5%
    severity: critical

  - name: Low Cache Hit Rate
    condition: cache_hit_rate < 0.40  # 40%
    severity: warning

  - name: High Cost
    condition: cost_per_hour > 50  # $50/hour
    severity: warning

  - name: Provider Down
    condition: llm_provider_health != "healthy"
    severity: critical

3. LangSmith Integration

Automatic Tracing:

# Enable in .env
LANGSMITH_API_KEY=lsv2_sk_...
LANGSMITH_PROJECT=opshub-agent-backend
LANGSMITH_TRACING_ENABLED=true

View Traces:

Go to smith.langchain.com
Navigate to your project
View traces with filtering:
- By latency (find slow requests)
- By cost (find expensive requests)
- By error status (debug failures)

Load Testing

1. Basic Load Test

Using Apache Bench:

# Test 1000 requests with 10 concurrent connections
ab -n 1000 -c 10 -T application/json \
   -H "Authorization: Bearer YOUR_JWT" \
   -p request.json \
   http://localhost:8000/api/agent/chat

request.json:

{
  "modelId": "claude-3-haiku-20240307",
  "messages": [
    {"role": "user", "content": "What is the current balance for account 12345?"}
  ],
  "agentId": "app"
}

2. Advanced Load Testing

Using Locust:

# locustfile.py
from locust import HttpUser, task, between

class AgentUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def chat_simple_query(self):
        self.client.post("/api/agent/chat", json={
            "modelId": "claude-3-haiku-20240307",
            "messages": [
                {"role": "user", "content": "Check account balance"}
            ]
        }, headers={"Authorization": f"Bearer {JWT_TOKEN}"})

    @task(2)  # 2x weight - more common
    def chat_complex_query(self):
        self.client.post("/api/agent/chat", json={
            "modelId": "claude-sonnet-4-5-20250514",
            "messages": [
                {"role": "user", "content": "Analyze risk for portfolio X"}
            ]
        }, headers={"Authorization": f"Bearer {JWT_TOKEN}"})

Run Load Test:

# Install locust
pip install locust

# Run with 100 users, ramping up 10 per second
locust -f locustfile.py --host http://localhost:8000 \
       --users 100 --spawn-rate 10

3. Performance Benchmarks

Target Benchmarks:

Metric	Target	Acceptable	Critical
P50 Latency	< 1s	< 2s	> 3s
P95 Latency	< 3s	< 5s	> 10s
P99 Latency	< 5s	< 10s	> 15s
Error Rate	< 1%	< 3%	> 5%
Cache Hit Rate	> 60%	> 40%	< 30%
Throughput	> 100 req/s	> 50 req/s	< 25 req/s

Best Practices Summary

Development Environment

Use Faster Models for Testing

# .env.development
DEFAULT_MODEL=claude-3-haiku-20240307  # Fast and cheap
CACHE_TTL_SECONDS=300  # Short TTL for testing

Enable Debug Logging

LOG_LEVEL=DEBUG
LANGSMITH_TRACING_ENABLED=true

Local Caching

# Use local Redis for development
REDIS_URL=redis://localhost:6379

Production Environment

Use Optimal Models

DEFAULT_PROVIDER=anthropic
DEFAULT_MODEL=claude-sonnet-4-5-20250514  # Best balance
FALLBACK_ENABLED=true  # Enable fallback

Production Caching

# Use managed Redis (AWS ElastiCache, etc.)
REDIS_URL=redis://production-redis:6379
CACHE_TTL_SECONDS=3600  # 1 hour
CACHE_SIMILARITY_THRESHOLD=0.95

Monitoring & Observability

LANGSMITH_TRACING_ENABLED=true
LANGSMITH_SAMPLING_RATE=0.1  # Sample 10% in production
LOG_LEVEL=INFO

Connection Pooling

DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=10
REDIS_MAX_CONNECTIONS=50

Health Checks & Timeouts

HEALTH_CHECK_INTERVAL=30  # Seconds
REQUEST_TIMEOUT=30  # Seconds
LLM_TIMEOUT=25  # Seconds (slightly less than request timeout)

Cost Management

Set Cost Alerts
- Daily spending threshold
- Per-user spending limits
- Unusual usage patterns
Optimize Token Usage
- Use concise prompts
- Summarize long conversations
- Request structured outputs
Leverage Caching
- Enable semantic caching
- Use appropriate TTLs
- Monitor cache hit rates
Choose Right Models
- Haiku for simple queries
- Sonnet for balanced workloads
- Opus only for critical tasks

Troubleshooting

High Latency

Symptoms: Requests taking > 5 seconds Diagnosis:

# Check LangSmith traces for bottlenecks
# Look for:
# - Slow LLM calls (try faster model)
# - Slow RAG searches (optimize indexes)
# - Slow database queries (add indexes)
# - Network issues (check provider status)

Solutions:

Increase cache hit rate (lower similarity threshold)
Use faster model for simple queries
Optimize RAG search parameters
Enable connection pooling
Check external service health

High Costs

Symptoms: Spending > $100/day unexpectedly Diagnosis:

# Check cost breakdown in LangSmith
curl http://localhost:8000/api/metrics | jq '{
  avg_cost: .avg_cost_per_request,
  total_requests: .total_requests,
  estimated_daily_cost: (.avg_cost_per_request * .requests_per_hour * 24)
}'

Solutions:

Review model selection (use cheaper models)
Increase cache TTL
Reduce token usage (concise prompts)
Set per-user rate limits
Implement request quotas

Low Cache Hit Rate

Symptoms: Cache hit rate < 40% Diagnosis:

# Analyze query patterns
curl http://localhost:8000/api/cache/stats | jq

Solutions:

Lower similarity threshold (e.g., 0.90 instead of 0.95)
Increase cache TTL
Normalize queries (remove timestamps, IDs)
Use session-scoped caching

Memory Issues

Symptoms: Out of memory errors, crashes Diagnosis:

# Monitor memory usage
docker stats opshub-agent-backend

Solutions:

Reduce conversation history length
Implement message summarization
Increase instance memory
Scale horizontally (more instances)
Clear old cache entries

Model Selection - Choosing the right model
Semantic Caching - Caching strategies
Monitoring & Observability - Tracking performance
RAG Document Search - Optimizing document retrieval

Support

For performance optimization assistance:

Review LangSmith traces for bottlenecks
Check health endpoint: /api/health
Monitor metrics endpoint: /api/metrics
Review logs for errors and warnings

Agent System Overview

Available Agents

Integration Patterns

Agent Tools & Capabilities

Performance & Optimization

Overview

Performance Architecture

Core Components

Performance Metrics

Latency Optimization

1. Semantic Caching

2. Model Selection Strategy

3. RAG Optimization

4. Streaming Responses

Cost Optimization

1. Token Management

2. Model Cost Hierarchy

3. Caching Strategy

Throughput Optimization

1. Horizontal Scaling

2. Connection Pooling

3. Async Processing

Monitoring & Alerting

1. Real-Time Metrics

2. Alerting Rules

3. LangSmith Integration

Load Testing

1. Basic Load Test

2. Advanced Load Testing

3. Performance Benchmarks

Best Practices Summary

Development Environment

Production Environment

Cost Management

Troubleshooting

High Latency

High Costs

Low Cache Hit Rate

Memory Issues

Support

Agent System Overview

Available Agents

Integration Patterns

Agent Tools & Capabilities

​Overview

​Performance Architecture

​Core Components

​Performance Metrics

​Latency Optimization

​1. Semantic Caching

​2. Model Selection Strategy

​3. RAG Optimization

​4. Streaming Responses

​Cost Optimization

​1. Token Management

​2. Model Cost Hierarchy

​3. Caching Strategy

​Throughput Optimization

​1. Horizontal Scaling

​2. Connection Pooling

​3. Async Processing

​Monitoring & Alerting

​1. Real-Time Metrics

​2. Alerting Rules

​3. LangSmith Integration

​Load Testing

​1. Basic Load Test

​2. Advanced Load Testing

​3. Performance Benchmarks

​Best Practices Summary

​Development Environment

​Production Environment

​Cost Management

​Troubleshooting

​High Latency

​High Costs

​Low Cache Hit Rate

​Memory Issues

​Related Documentation

​Support

Overview

Performance Architecture

Core Components

Performance Metrics

Latency Optimization

1. Semantic Caching

2. Model Selection Strategy

3. RAG Optimization

4. Streaming Responses

Cost Optimization

1. Token Management

2. Model Cost Hierarchy

3. Caching Strategy

Throughput Optimization

1. Horizontal Scaling

2. Connection Pooling

3. Async Processing

Monitoring & Alerting

1. Real-Time Metrics

2. Alerting Rules

3. LangSmith Integration

Load Testing

1. Basic Load Test

2. Advanced Load Testing

3. Performance Benchmarks

Best Practices Summary

Development Environment

Production Environment

Cost Management

Troubleshooting

High Latency

High Costs

Low Cache Hit Rate

Memory Issues

Related Documentation

Support