Orchestration Troubleshooting Guide

Comprehensive troubleshooting guide for diagnosing and resolving common issues in orchestration layer components. Includes debugging workflows, error patterns, and proven solutions.

Quick Diagnostics

System Health Check (30 seconds)

bash

# 1. Check component health endpoints
curl http://localhost:3000/health/ceo
curl http://localhost:3000/health/acs
curl http://localhost:3000/health/providers

# 2. Check recent logs for errors
tail -n 100 /var/log/orchestration/app.log | grep -i error

# 3. Check system resources
ps aux | grep orchestration
free -h
df -h

# 4. Test basic functionality
curl -X POST http://localhost:3000/api/query \
  -H "Content-Type: application/json" \
  -d '{"text": "test query", "user_id": "debug"}'

Common Error Patterns

text

❌ "Provider timeout" → Network/latency issue
❌ "Budget exhausted" → Resource allocation problem  
❌ "Intent parsing failed" → Input validation issue
❌ "No providers available" → Configuration/registry problem
❌ "Quality threshold not met" → Provider selection issue

Component-Specific Troubleshooting

CEO (Context/Execution Orchestrator) Issues

Issue: Intent parsing fails

Symptoms:

Queries return generic "search" intent
Domain classification incorrect
Complexity assessment wrong

Debugging Steps:

typescript

// Enable debug logging
const ceo = new BasicCEO();
ceo.setDebugMode(true);

// Test intent parser directly
const intent = ceo.intentParser.parseIntent({
  text: "your problematic query",
  user_id: "debug"
});
console.log('Parsed intent:', intent);

// Check domain keywords
console.log('Domain keywords:', ceo.intentParser.domainKeywords);

Common Solutions:

Update domain keywords: Add missing keywords for your domain
Improve complexity indicators: Adjust complexity assessment rules
Handle edge cases: Add validation for empty/malformed queries

Code Fix Examples:

typescript

// Fix 1: Add domain-specific keywords
private domainKeywords = {
  code: ['function', 'class', 'bug', 'error', 'debug', 'implement', 'refactor', 
         'typescript', 'javascript', 'python', 'api', 'database'], // Added more
  // ...
};

// Fix 2: Improve empty query handling
parseIntent(query: UserQuery): ParsedIntent {
  if (!query.text || query.text.trim().length === 0) {
    throw new Error('Query text cannot be empty');
  }
  
  const text = query.text.toLowerCase();
  // ... rest of parsing logic
}

Issue: Budget allocation too restrictive/generous

Symptoms:

Requests frequently timeout (too restrictive)
Costs exceed expected limits (too generous)
Quality doesn't match requirements

Debugging Steps:

typescript

// Test budget allocator with different scenarios
const budgetAllocator = new BasicBudgetAllocator();

const testCases = [
  { complexity: 'simple', urgency: 'low', domain: 'code' },
  { complexity: 'complex', urgency: 'high', domain: 'research' }
];

testCases.forEach(testCase => {
  const budget = budgetAllocator.allocateBudget(testCase);
  console.log(`${JSON.stringify(testCase)} → Budget:`, budget);
});

Common Solutions:

typescript

// Adjust multipliers in budget allocator
private urgencyModifiers = {
  low: { latency: 2.0, cost: 0.7 },    // Less restrictive on cost
  medium: { latency: 1.0, cost: 1.0 },
  high: { latency: 0.6, cost: 1.5 }    // More budget for urgent requests
};

// Add user preference override safety
private validateBudget(budget: ResourceBudget): ResourceBudget {
  return {
    max_latency_ms: Math.max(2000, Math.min(15000, budget.max_latency_ms)), // Wider range
    max_cost_cents: Math.max(2, Math.min(100, budget.max_cost_cents)),
    quality_threshold: Math.max(0.6, Math.min(0.9, budget.quality_threshold)),
    max_providers: Math.max(1, Math.min(3, budget.max_providers))
  };
}

ACS (Adaptive Context Scaling) Issues

Issue: Provider selection fails

Symptoms:

"No providers available for domain" error
Wrong provider selected for query type
Fallback providers not working

Debugging Steps:

typescript

// Check provider registry
const acs = new BasicACS();
const allProviders = acs.providerRegistry.getAllProviders();
console.log('Available providers:', allProviders.map(p => ({ id: p.provider_id, domains: p.domains })));

// Test provider scoring
const intent = { domain: 'code', complexity: 'medium', urgency: 'medium' };
const budget = { max_cost_cents: 20, max_latency_ms: 5000 };
const candidateProviders = acs.providerRegistry.getProvidersByDomain(intent.domain);
console.log('Candidate providers:', candidateProviders);

const scores = acs.providerScorer.scoreProviders(candidateProviders, intent, budget);
console.log('Provider scores:', scores);

Common Solutions:

Check provider registration: Ensure providers are properly registered for the domain
Review scoring algorithm: Adjust benefit/cost calculations
Add fallback providers: Register general-purpose providers as fallbacks

typescript

// Fix: Add fallback provider registration
constructor() {
  this.initializeDefaultProviders();
  this.addFallbackProviders(); // Add this
}

private addFallbackProviders(): void {
  // General-purpose provider as fallback
  const fallbackProvider: ProviderCapability = {
    provider_id: 'general_fallback',
    name: 'General Purpose LLM',
    type: 'llm',
    domains: ['code', 'documentation', 'research', 'general'], // All domains
    capabilities: ['completion'],
    avg_latency_ms: 3000,
    cost_per_request_cents: 12,
    quality_score: 0.75, // Lower quality but reliable
    reliability_score: 0.95,
    endpoint: process.env.FALLBACK_PROVIDER_URL,
    timeout_ms: 10000,
    max_retries: 3
  };
  
  this.providers.set(fallbackProvider.provider_id, fallbackProvider);
}

Issue: Provider API calls failing

Symptoms:

Timeout errors
Authentication failures
Malformed responses

Debugging Steps:

bash

# Test provider connectivity directly
curl -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 50
  }'

# Check network connectivity
ping api.openai.com
nslookup api.openai.com

# Check API key validity
echo $OPENAI_API_KEY | wc -c  # Should be ~51 characters

Common Solutions:

typescript

// Implement robust HTTP client with retry logic
private async makeProviderRequest(
  provider: ProviderCapability,
  query: string,
  options: Record<string, any>
): Promise<any> {
  const maxRetries = provider.max_retries || 2;
  let lastError: Error;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      console.log(`Attempt ${attempt}/${maxRetries} for provider ${provider.name}`);
      
      const response = await this.httpClient({
        method: 'POST',
        url: provider.endpoint,
        data: this.buildRequestPayload(provider, query, options),
        timeout: provider.timeout_ms,
        headers: {
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${provider.api_key || process.env.DEFAULT_API_KEY}`,
          'User-Agent': 'Mnemoverse-Orchestration/1.0',
          ...options.headers
        }
      });

      // Validate response structure
      if (!response.data) {
        throw new Error('Empty response from provider');
      }

      return this.parseProviderResponse(provider, response.data);

    } catch (error) {
      lastError = error;
      console.warn(`Provider ${provider.name} attempt ${attempt} failed:`, error.message);
      
      // Don't retry on authentication errors
      if (error.response?.status === 401 || error.response?.status === 403) {
        throw new Error(`Authentication failed for provider ${provider.name}: ${error.message}`);
      }
      
      // Exponential backoff between retries
      if (attempt < maxRetries) {
        const backoffMs = Math.min(1000 * Math.pow(2, attempt), 10000);
        console.log(`Retrying in ${backoffMs}ms...`);
        await new Promise(resolve => setTimeout(resolve, backoffMs));
      }
    }
  }

  throw new Error(`All ${maxRetries} attempts failed for provider ${provider.name}. Last error: ${lastError.message}`);
}

Issue: High latency or poor performance

Symptoms:

Response times > 5 seconds
Provider scoring takes too long
Memory usage increasing over time

Debugging Steps:

typescript

// Add performance profiling
const startTime = performance.now();

// Profile provider scoring
console.time('provider-scoring');
const scores = this.providerScorer.scoreProviders(providers, intent, budget);
console.timeEnd('provider-scoring');

// Profile execution
console.time('provider-execution');
const result = await this.executionEngine.executeWithFallback(selectedProviders, query);
console.timeEnd('provider-execution');

const totalTime = performance.now() - startTime;
console.log(`Total request time: ${totalTime}ms`);

// Check memory usage
const memUsage = process.memoryUsage();
console.log('Memory usage:', {
  heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + ' MB',
  heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + ' MB',
  external: Math.round(memUsage.external / 1024 / 1024) + ' MB'
});

Performance Solutions:

typescript

// Implement caching for provider scoring
class ProviderScorerWithCache extends BasicProviderScorer {
  private scoreCache = new Map<string, ProviderScore[]>();
  private cacheMaxAge = 5 * 60 * 1000; // 5 minutes

  scoreProviders(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): ProviderScore[] {
    const cacheKey = this.getCacheKey(providers, intent, budget);
    const cached = this.scoreCache.get(cacheKey);
    
    if (cached && this.isCacheValid(cached)) {
      console.log('Using cached provider scores');
      return cached;
    }

    const scores = super.scoreProviders(providers, intent, budget);
    this.scoreCache.set(cacheKey, scores);
    
    // Clean up old cache entries
    this.cleanupCache();
    
    return scores;
  }

  private getCacheKey(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): string {
    const providerIds = providers.map(p => p.provider_id).sort().join(',');
    const intentKey = `${intent.domain}-${intent.complexity}-${intent.urgency}`;
    const budgetKey = `${budget.max_cost_cents}-${budget.max_latency_ms}-${budget.quality_threshold}`;
    return `${providerIds}:${intentKey}:${budgetKey}`;
  }

  private cleanupCache(): void {
    if (this.scoreCache.size > 1000) { // Limit cache size
      const oldestKeys = Array.from(this.scoreCache.keys()).slice(0, 100);
      oldestKeys.forEach(key => this.scoreCache.delete(key));
    }
  }
}

Error Handling Patterns

Graceful Degradation

typescript

// Implement graceful fallbacks
async executeRequest(request: ACSRequest): Promise<ACSResponse> {
  try {
    return await this.executeFullRequest(request);
  } catch (error) {
    console.warn('Full execution failed, attempting degraded mode:', error.message);
    
    // Fallback 1: Try with relaxed budget
    try {
      const relaxedBudget = this.relaxBudget(request.budget);
      return await this.executeWithRelaxedBudget(request, relaxedBudget);
    } catch (relaxedError) {
      console.warn('Relaxed budget execution failed:', relaxedError.message);
      
      // Fallback 2: Return cached or default response
      const cachedResponse = await this.getCachedResponse(request);
      if (cachedResponse) {
        return cachedResponse;
      }
      
      // Fallback 3: Return error response with partial data
      return this.createErrorResponse(request, [error.message, relaxedError.message]);
    }
  }
}

private relaxBudget(budget: ResourceBudget): ResourceBudget {
  return {
    ...budget,
    max_latency_ms: budget.max_latency_ms * 1.5,
    max_cost_cents: budget.max_cost_cents * 1.2,
    quality_threshold: Math.max(0.5, budget.quality_threshold - 0.1),
    max_providers: budget.max_providers + 1
  };
}

Circuit Breaker Implementation

typescript

class ProviderCircuitBreaker {
  private failures = new Map<string, number>();
  private lastFailureTime = new Map<string, number>();
  private readonly failureThreshold = 5;
  private readonly timeoutMs = 60000; // 1 minute

  async callProvider(providerId: string, operation: () => Promise<any>): Promise<any> {
    if (this.isCircuitOpen(providerId)) {
      throw new Error(`Circuit breaker open for provider ${providerId}`);
    }

    try {
      const result = await operation();
      this.onSuccess(providerId);
      return result;
    } catch (error) {
      this.onFailure(providerId);
      throw error;
    }
  }

  private isCircuitOpen(providerId: string): boolean {
    const failures = this.failures.get(providerId) || 0;
    const lastFailure = this.lastFailureTime.get(providerId) || 0;
    
    if (failures >= this.failureThreshold) {
      if (Date.now() - lastFailure < this.timeoutMs) {
        return true; // Circuit is open
      } else {
        // Reset after timeout
        this.failures.set(providerId, 0);
        return false;
      }
    }
    
    return false;
  }

  private onSuccess(providerId: string): void {
    this.failures.set(providerId, 0);
    this.lastFailureTime.delete(providerId);
  }

  private onFailure(providerId: string): void {
    const currentFailures = this.failures.get(providerId) || 0;
    this.failures.set(providerId, currentFailures + 1);
    this.lastFailureTime.set(providerId, Date.now());
  }
}

Debugging Tools and Commands

Log Analysis Commands

bash

# Find error patterns in logs
grep -E "(ERROR|WARN|timeout|failed)" /var/log/orchestration/app.log | tail -20

# Analyze request flow for specific request ID
grep "request-123" /var/log/orchestration/app.log | \
  awk '{print $1, $2, $NF}' | \
  sort

# Find slow requests (> 3 seconds)
grep "total_latency_ms" /var/log/orchestration/app.log | \
  awk -F'total_latency_ms":' '{print $2}' | \
  awk -F',' '{if($1 > 3000) print $0}'

# Check provider success rates
grep "provider_execution" /var/log/orchestration/app.log | \
  grep -E "(success|failed)" | \
  sort | uniq -c

Health Check Scripts

bash

#!/bin/bash
# orchestration-health-check.sh

echo "=== Orchestration Health Check ==="

# 1. Service status
echo "1. Service Status:"
curl -s http://localhost:3000/health | jq '.' || echo "❌ Service unreachable"

# 2. Component health
echo -e "\n2. Component Health:"
curl -s http://localhost:3000/health/ceo | jq '.status' || echo "❌ CEO unhealthy"
curl -s http://localhost:3000/health/acs | jq '.status' || echo "❌ ACS unhealthy"

# 3. Provider connectivity
echo -e "\n3. Provider Connectivity:"
curl -s http://localhost:3000/debug/providers | jq '.[] | {id: .provider_id, status: .status}'

# 4. Recent error rate
echo -e "\n4. Recent Error Rate:"
ERROR_COUNT=$(grep -c "ERROR" /var/log/orchestration/app.log | tail -1000 || echo 0)
TOTAL_COUNT=$(grep -c "request_completed" /var/log/orchestration/app.log | tail -1000 || echo 1)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_COUNT" | bc)
echo "Error rate: ${ERROR_RATE}% (${ERROR_COUNT}/${TOTAL_COUNT})"

# 5. Resource usage
echo -e "\n5. Resource Usage:"
ps aux | grep orchestration | awk '{print "CPU: " $3 "%, Memory: " $4 "%"}'

echo -e "\n=== Health Check Complete ==="

Performance Monitoring Script

bash

#!/bin/bash
# performance-monitor.sh

# Monitor request latency in real-time
tail -f /var/log/orchestration/app.log | \
  grep --line-buffered "request_completed" | \
  while read line; do
    REQUEST_ID=$(echo $line | jq -r '.request_id')
    LATENCY=$(echo $line | jq -r '.total_latency_ms')
    PROVIDER=$(echo $line | jq -r '.providers_used[0]')
    
    echo "$(date): Request $REQUEST_ID - ${LATENCY}ms - Provider: $PROVIDER"
    
    # Alert on high latency
    if (( $(echo "$LATENCY > 5000" | bc -l) )); then
      echo "⚠️  HIGH LATENCY ALERT: ${LATENCY}ms"
    fi
  done

Recovery Procedures

Service Recovery Steps

Immediate Response (0-2 minutes)

bash

# Check if service is responsive
curl -f http://localhost:3000/health || exit 1

# If not responsive, restart service
systemctl restart orchestration

# Verify restart
sleep 10
curl -f http://localhost:3000/health

Investigation (2-10 minutes)

bash

# Check recent logs for root cause
journalctl -u orchestration --since "10 minutes ago" --no-pager

# Check system resources
free -h
df -h

# Check provider connectivity
curl -s http://localhost:3000/debug/providers

Resolution (10-30 minutes)
- Fix identified issues (network, config, resources)
- Deploy fixes if code issues found
- Update monitoring alerts if needed

Provider Failover Procedure

typescript

// Automatic failover implementation
class ProviderFailoverManager {
  private unhealthyProviders = new Set<string>();
  private healthCheckInterval = 60000; // 1 minute

  constructor() {
    this.startHealthChecks();
  }

  private startHealthChecks(): void {
    setInterval(async () => {
      await this.checkProviderHealth();
    }, this.healthCheckInterval);
  }

  private async checkProviderHealth(): Promise<void> {
    const providers = this.providerRegistry.getAllProviders();
    
    for (const provider of providers) {
      try {
        await this.testProviderConnection(provider);
        this.markProviderHealthy(provider.provider_id);
      } catch (error) {
        console.warn(`Provider ${provider.name} health check failed:`, error.message);
        this.markProviderUnhealthy(provider.provider_id);
      }
    }
  }

  private markProviderUnhealthy(providerId: string): void {
    this.unhealthyProviders.add(providerId);
    console.log(`Provider ${providerId} marked as unhealthy`);
    
    // Notify monitoring system
    this.metricsCollector.recordProviderHealthChange(providerId, false);
  }

  private markProviderHealthy(providerId: string): void {
    if (this.unhealthyProviders.has(providerId)) {
      this.unhealthyProviders.delete(providerId);
      console.log(`Provider ${providerId} recovered and marked as healthy`);
      
      // Notify monitoring system
      this.metricsCollector.recordProviderHealthChange(providerId, true);
    }
  }

  getHealthyProviders(providers: ProviderCapability[]): ProviderCapability[] {
    return providers.filter(p => !this.unhealthyProviders.has(p.provider_id));
  }
}

Prevention Strategies

Proactive Monitoring

typescript

// Implement early warning system
class EarlyWarningSystem {
  private thresholds = {
    errorRate: 0.02,        // 2%
    avgLatency: 3000,       // 3 seconds
    providerFailures: 0.1,  // 10%
    budgetBurnRate: 50      // cents per hour
  };

  checkMetrics(metrics: SystemMetrics): Warning[] {
    const warnings: Warning[] = [];

    if (metrics.error_rate > this.thresholds.errorRate) {
      warnings.push({
        type: 'error_rate_high',
        severity: 'warning',
        message: `Error rate ${metrics.error_rate * 100}% exceeds threshold`,
        action: 'Investigate recent error patterns'
      });
    }

    if (metrics.avg_response_time_ms > this.thresholds.avgLatency) {
      warnings.push({
        type: 'latency_high',
        severity: 'warning',
        message: `Average latency ${metrics.avg_response_time_ms}ms exceeds threshold`,
        action: 'Check provider performance and system resources'
      });
    }

    return warnings;
  }
}

Capacity Planning

typescript

// Resource usage prediction
class CapacityPlanner {
  predictResourceNeeds(historicalMetrics: SystemMetrics[], growthRate: number): ResourceForecast {
    const avgRequestsPerSecond = this.calculateAverage(historicalMetrics.map(m => m.requests_per_second));
    const avgCostPerRequest = this.calculateAverage(historicalMetrics.map(m => m.cost_per_request));
    
    // Predict 30 days ahead
    const futureRequests = avgRequestsPerSecond * (1 + growthRate) * 30 * 24 * 3600;
    const futureCost = futureRequests * avgCostPerRequest;
    
    return {
      predicted_requests: futureRequests,
      predicted_cost_cents: futureCost,
      recommended_budget: futureCost * 1.2, // 20% buffer
      scaling_recommendations: this.generateScalingRecommendations(futureRequests)
    };
  }
}

Quick Reference

Emergency Contacts

On-Call Engineer: [Your on-call system]
Platform Team: [Team contact]
Provider Support: [Provider contact info]

Important URLs

Monitoring Dashboard: http://monitoring.company.com/orchestration
Logs: http://logs.company.com/orchestration
Status Page: http://status.company.com

Critical Commands

bash

# Service control
systemctl status orchestration
systemctl restart orchestration

# Quick health check
curl http://localhost:3000/health

# View recent errors
journalctl -u orchestration --since "30 minutes ago" | grep ERROR

# Check system resources
htop

Next: Performance Optimization →

ACS

API

CEO

HCS

Implementation

Orchestration Troubleshooting Guide ​

Quick Diagnostics ​

System Health Check (30 seconds) ​

Common Error Patterns ​

Component-Specific Troubleshooting ​

CEO (Context/Execution Orchestrator) Issues ​

Issue: Intent parsing fails ​

Issue: Budget allocation too restrictive/generous ​

ACS (Adaptive Context Scaling) Issues ​

Issue: Provider selection fails ​

Issue: Provider API calls failing ​

Issue: High latency or poor performance ​

Error Handling Patterns ​

Graceful Degradation ​

Circuit Breaker Implementation ​

Debugging Tools and Commands ​

Log Analysis Commands ​

Health Check Scripts ​

Performance Monitoring Script ​

Recovery Procedures ​

Service Recovery Steps ​

Provider Failover Procedure ​

Prevention Strategies ​

Proactive Monitoring ​

Capacity Planning ​

Quick Reference ​

Emergency Contacts ​

Important URLs ​

Critical Commands ​

Orchestration Troubleshooting Guide

Quick Diagnostics

System Health Check (30 seconds)

Common Error Patterns

Component-Specific Troubleshooting

CEO (Context/Execution Orchestrator) Issues

Issue: Intent parsing fails

Issue: Budget allocation too restrictive/generous

ACS (Adaptive Context Scaling) Issues

Issue: Provider selection fails

Issue: Provider API calls failing

Issue: High latency or poor performance

Error Handling Patterns

Graceful Degradation

Circuit Breaker Implementation

Debugging Tools and Commands

Log Analysis Commands

Health Check Scripts

Performance Monitoring Script

Recovery Procedures

Service Recovery Steps

Provider Failover Procedure

Prevention Strategies

Proactive Monitoring

Capacity Planning

Quick Reference

Emergency Contacts

Important URLs

Critical Commands