Skip to content

Orchestration Troubleshooting Guide ​

Comprehensive troubleshooting guide for diagnosing and resolving common issues in orchestration layer components. Includes debugging workflows, error patterns, and proven solutions.

Quick Diagnostics ​

System Health Check (30 seconds) ​

bash
# 1. Check component health endpoints
curl http://localhost:3000/health/ceo
curl http://localhost:3000/health/acs
curl http://localhost:3000/health/providers

# 2. Check recent logs for errors
tail -n 100 /var/log/orchestration/app.log | grep -i error

# 3. Check system resources
ps aux | grep orchestration
free -h
df -h

# 4. Test basic functionality
curl -X POST http://localhost:3000/api/query \
  -H "Content-Type: application/json" \
  -d '{"text": "test query", "user_id": "debug"}'

Common Error Patterns ​

text
❌ "Provider timeout" β†’ Network/latency issue
❌ "Budget exhausted" β†’ Resource allocation problem  
❌ "Intent parsing failed" β†’ Input validation issue
❌ "No providers available" β†’ Configuration/registry problem
❌ "Quality threshold not met" β†’ Provider selection issue

Component-Specific Troubleshooting ​

CEO (Context/Execution Orchestrator) Issues ​

Issue: Intent parsing fails ​

Symptoms:

  • Queries return generic "search" intent
  • Domain classification incorrect
  • Complexity assessment wrong

Debugging Steps:

typescript
// Enable debug logging
const ceo = new BasicCEO();
ceo.setDebugMode(true);

// Test intent parser directly
const intent = ceo.intentParser.parseIntent({
  text: "your problematic query",
  user_id: "debug"
});
console.log('Parsed intent:', intent);

// Check domain keywords
console.log('Domain keywords:', ceo.intentParser.domainKeywords);

Common Solutions:

  1. Update domain keywords: Add missing keywords for your domain
  2. Improve complexity indicators: Adjust complexity assessment rules
  3. Handle edge cases: Add validation for empty/malformed queries

Code Fix Examples:

typescript
// Fix 1: Add domain-specific keywords
private domainKeywords = {
  code: ['function', 'class', 'bug', 'error', 'debug', 'implement', 'refactor', 
         'typescript', 'javascript', 'python', 'api', 'database'], // Added more
  // ...
};

// Fix 2: Improve empty query handling
parseIntent(query: UserQuery): ParsedIntent {
  if (!query.text || query.text.trim().length === 0) {
    throw new Error('Query text cannot be empty');
  }
  
  const text = query.text.toLowerCase();
  // ... rest of parsing logic
}

Issue: Budget allocation too restrictive/generous ​

Symptoms:

  • Requests frequently timeout (too restrictive)
  • Costs exceed expected limits (too generous)
  • Quality doesn't match requirements

Debugging Steps:

typescript
// Test budget allocator with different scenarios
const budgetAllocator = new BasicBudgetAllocator();

const testCases = [
  { complexity: 'simple', urgency: 'low', domain: 'code' },
  { complexity: 'complex', urgency: 'high', domain: 'research' }
];

testCases.forEach(testCase => {
  const budget = budgetAllocator.allocateBudget(testCase);
  console.log(`${JSON.stringify(testCase)} β†’ Budget:`, budget);
});

Common Solutions:

typescript
// Adjust multipliers in budget allocator
private urgencyModifiers = {
  low: { latency: 2.0, cost: 0.7 },    // Less restrictive on cost
  medium: { latency: 1.0, cost: 1.0 },
  high: { latency: 0.6, cost: 1.5 }    // More budget for urgent requests
};

// Add user preference override safety
private validateBudget(budget: ResourceBudget): ResourceBudget {
  return {
    max_latency_ms: Math.max(2000, Math.min(15000, budget.max_latency_ms)), // Wider range
    max_cost_cents: Math.max(2, Math.min(100, budget.max_cost_cents)),
    quality_threshold: Math.max(0.6, Math.min(0.9, budget.quality_threshold)),
    max_providers: Math.max(1, Math.min(3, budget.max_providers))
  };
}

ACS (Adaptive Context Scaling) Issues ​

Issue: Provider selection fails ​

Symptoms:

  • "No providers available for domain" error
  • Wrong provider selected for query type
  • Fallback providers not working

Debugging Steps:

typescript
// Check provider registry
const acs = new BasicACS();
const allProviders = acs.providerRegistry.getAllProviders();
console.log('Available providers:', allProviders.map(p => ({ id: p.provider_id, domains: p.domains })));

// Test provider scoring
const intent = { domain: 'code', complexity: 'medium', urgency: 'medium' };
const budget = { max_cost_cents: 20, max_latency_ms: 5000 };
const candidateProviders = acs.providerRegistry.getProvidersByDomain(intent.domain);
console.log('Candidate providers:', candidateProviders);

const scores = acs.providerScorer.scoreProviders(candidateProviders, intent, budget);
console.log('Provider scores:', scores);

Common Solutions:

  1. Check provider registration: Ensure providers are properly registered for the domain
  2. Review scoring algorithm: Adjust benefit/cost calculations
  3. Add fallback providers: Register general-purpose providers as fallbacks
typescript
// Fix: Add fallback provider registration
constructor() {
  this.initializeDefaultProviders();
  this.addFallbackProviders(); // Add this
}

private addFallbackProviders(): void {
  // General-purpose provider as fallback
  const fallbackProvider: ProviderCapability = {
    provider_id: 'general_fallback',
    name: 'General Purpose LLM',
    type: 'llm',
    domains: ['code', 'documentation', 'research', 'general'], // All domains
    capabilities: ['completion'],
    avg_latency_ms: 3000,
    cost_per_request_cents: 12,
    quality_score: 0.75, // Lower quality but reliable
    reliability_score: 0.95,
    endpoint: process.env.FALLBACK_PROVIDER_URL,
    timeout_ms: 10000,
    max_retries: 3
  };
  
  this.providers.set(fallbackProvider.provider_id, fallbackProvider);
}

Issue: Provider API calls failing ​

Symptoms:

  • Timeout errors
  • Authentication failures
  • Malformed responses

Debugging Steps:

bash
# Test provider connectivity directly
curl -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 50
  }'

# Check network connectivity
ping api.openai.com
nslookup api.openai.com

# Check API key validity
echo $OPENAI_API_KEY | wc -c  # Should be ~51 characters

Common Solutions:

typescript
// Implement robust HTTP client with retry logic
private async makeProviderRequest(
  provider: ProviderCapability,
  query: string,
  options: Record<string, any>
): Promise<any> {
  const maxRetries = provider.max_retries || 2;
  let lastError: Error;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      console.log(`Attempt ${attempt}/${maxRetries} for provider ${provider.name}`);
      
      const response = await this.httpClient({
        method: 'POST',
        url: provider.endpoint,
        data: this.buildRequestPayload(provider, query, options),
        timeout: provider.timeout_ms,
        headers: {
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${provider.api_key || process.env.DEFAULT_API_KEY}`,
          'User-Agent': 'Mnemoverse-Orchestration/1.0',
          ...options.headers
        }
      });

      // Validate response structure
      if (!response.data) {
        throw new Error('Empty response from provider');
      }

      return this.parseProviderResponse(provider, response.data);

    } catch (error) {
      lastError = error;
      console.warn(`Provider ${provider.name} attempt ${attempt} failed:`, error.message);
      
      // Don't retry on authentication errors
      if (error.response?.status === 401 || error.response?.status === 403) {
        throw new Error(`Authentication failed for provider ${provider.name}: ${error.message}`);
      }
      
      // Exponential backoff between retries
      if (attempt < maxRetries) {
        const backoffMs = Math.min(1000 * Math.pow(2, attempt), 10000);
        console.log(`Retrying in ${backoffMs}ms...`);
        await new Promise(resolve => setTimeout(resolve, backoffMs));
      }
    }
  }

  throw new Error(`All ${maxRetries} attempts failed for provider ${provider.name}. Last error: ${lastError.message}`);
}

Issue: High latency or poor performance ​

Symptoms:

  • Response times > 5 seconds
  • Provider scoring takes too long
  • Memory usage increasing over time

Debugging Steps:

typescript
// Add performance profiling
const startTime = performance.now();

// Profile provider scoring
console.time('provider-scoring');
const scores = this.providerScorer.scoreProviders(providers, intent, budget);
console.timeEnd('provider-scoring');

// Profile execution
console.time('provider-execution');
const result = await this.executionEngine.executeWithFallback(selectedProviders, query);
console.timeEnd('provider-execution');

const totalTime = performance.now() - startTime;
console.log(`Total request time: ${totalTime}ms`);

// Check memory usage
const memUsage = process.memoryUsage();
console.log('Memory usage:', {
  heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + ' MB',
  heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + ' MB',
  external: Math.round(memUsage.external / 1024 / 1024) + ' MB'
});

Performance Solutions:

typescript
// Implement caching for provider scoring
class ProviderScorerWithCache extends BasicProviderScorer {
  private scoreCache = new Map<string, ProviderScore[]>();
  private cacheMaxAge = 5 * 60 * 1000; // 5 minutes

  scoreProviders(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): ProviderScore[] {
    const cacheKey = this.getCacheKey(providers, intent, budget);
    const cached = this.scoreCache.get(cacheKey);
    
    if (cached && this.isCacheValid(cached)) {
      console.log('Using cached provider scores');
      return cached;
    }

    const scores = super.scoreProviders(providers, intent, budget);
    this.scoreCache.set(cacheKey, scores);
    
    // Clean up old cache entries
    this.cleanupCache();
    
    return scores;
  }

  private getCacheKey(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): string {
    const providerIds = providers.map(p => p.provider_id).sort().join(',');
    const intentKey = `${intent.domain}-${intent.complexity}-${intent.urgency}`;
    const budgetKey = `${budget.max_cost_cents}-${budget.max_latency_ms}-${budget.quality_threshold}`;
    return `${providerIds}:${intentKey}:${budgetKey}`;
  }

  private cleanupCache(): void {
    if (this.scoreCache.size > 1000) { // Limit cache size
      const oldestKeys = Array.from(this.scoreCache.keys()).slice(0, 100);
      oldestKeys.forEach(key => this.scoreCache.delete(key));
    }
  }
}

Error Handling Patterns ​

Graceful Degradation ​

typescript
// Implement graceful fallbacks
async executeRequest(request: ACSRequest): Promise<ACSResponse> {
  try {
    return await this.executeFullRequest(request);
  } catch (error) {
    console.warn('Full execution failed, attempting degraded mode:', error.message);
    
    // Fallback 1: Try with relaxed budget
    try {
      const relaxedBudget = this.relaxBudget(request.budget);
      return await this.executeWithRelaxedBudget(request, relaxedBudget);
    } catch (relaxedError) {
      console.warn('Relaxed budget execution failed:', relaxedError.message);
      
      // Fallback 2: Return cached or default response
      const cachedResponse = await this.getCachedResponse(request);
      if (cachedResponse) {
        return cachedResponse;
      }
      
      // Fallback 3: Return error response with partial data
      return this.createErrorResponse(request, [error.message, relaxedError.message]);
    }
  }
}

private relaxBudget(budget: ResourceBudget): ResourceBudget {
  return {
    ...budget,
    max_latency_ms: budget.max_latency_ms * 1.5,
    max_cost_cents: budget.max_cost_cents * 1.2,
    quality_threshold: Math.max(0.5, budget.quality_threshold - 0.1),
    max_providers: budget.max_providers + 1
  };
}

Circuit Breaker Implementation ​

typescript
class ProviderCircuitBreaker {
  private failures = new Map<string, number>();
  private lastFailureTime = new Map<string, number>();
  private readonly failureThreshold = 5;
  private readonly timeoutMs = 60000; // 1 minute

  async callProvider(providerId: string, operation: () => Promise<any>): Promise<any> {
    if (this.isCircuitOpen(providerId)) {
      throw new Error(`Circuit breaker open for provider ${providerId}`);
    }

    try {
      const result = await operation();
      this.onSuccess(providerId);
      return result;
    } catch (error) {
      this.onFailure(providerId);
      throw error;
    }
  }

  private isCircuitOpen(providerId: string): boolean {
    const failures = this.failures.get(providerId) || 0;
    const lastFailure = this.lastFailureTime.get(providerId) || 0;
    
    if (failures >= this.failureThreshold) {
      if (Date.now() - lastFailure < this.timeoutMs) {
        return true; // Circuit is open
      } else {
        // Reset after timeout
        this.failures.set(providerId, 0);
        return false;
      }
    }
    
    return false;
  }

  private onSuccess(providerId: string): void {
    this.failures.set(providerId, 0);
    this.lastFailureTime.delete(providerId);
  }

  private onFailure(providerId: string): void {
    const currentFailures = this.failures.get(providerId) || 0;
    this.failures.set(providerId, currentFailures + 1);
    this.lastFailureTime.set(providerId, Date.now());
  }
}

Debugging Tools and Commands ​

Log Analysis Commands ​

bash
# Find error patterns in logs
grep -E "(ERROR|WARN|timeout|failed)" /var/log/orchestration/app.log | tail -20

# Analyze request flow for specific request ID
grep "request-123" /var/log/orchestration/app.log | \
  awk '{print $1, $2, $NF}' | \
  sort

# Find slow requests (> 3 seconds)
grep "total_latency_ms" /var/log/orchestration/app.log | \
  awk -F'total_latency_ms":' '{print $2}' | \
  awk -F',' '{if($1 > 3000) print $0}'

# Check provider success rates
grep "provider_execution" /var/log/orchestration/app.log | \
  grep -E "(success|failed)" | \
  sort | uniq -c

Health Check Scripts ​

bash
#!/bin/bash
# orchestration-health-check.sh

echo "=== Orchestration Health Check ==="

# 1. Service status
echo "1. Service Status:"
curl -s http://localhost:3000/health | jq '.' || echo "❌ Service unreachable"

# 2. Component health
echo -e "\n2. Component Health:"
curl -s http://localhost:3000/health/ceo | jq '.status' || echo "❌ CEO unhealthy"
curl -s http://localhost:3000/health/acs | jq '.status' || echo "❌ ACS unhealthy"

# 3. Provider connectivity
echo -e "\n3. Provider Connectivity:"
curl -s http://localhost:3000/debug/providers | jq '.[] | {id: .provider_id, status: .status}'

# 4. Recent error rate
echo -e "\n4. Recent Error Rate:"
ERROR_COUNT=$(grep -c "ERROR" /var/log/orchestration/app.log | tail -1000 || echo 0)
TOTAL_COUNT=$(grep -c "request_completed" /var/log/orchestration/app.log | tail -1000 || echo 1)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_COUNT" | bc)
echo "Error rate: ${ERROR_RATE}% (${ERROR_COUNT}/${TOTAL_COUNT})"

# 5. Resource usage
echo -e "\n5. Resource Usage:"
ps aux | grep orchestration | awk '{print "CPU: " $3 "%, Memory: " $4 "%"}'

echo -e "\n=== Health Check Complete ==="

Performance Monitoring Script ​

bash
#!/bin/bash
# performance-monitor.sh

# Monitor request latency in real-time
tail -f /var/log/orchestration/app.log | \
  grep --line-buffered "request_completed" | \
  while read line; do
    REQUEST_ID=$(echo $line | jq -r '.request_id')
    LATENCY=$(echo $line | jq -r '.total_latency_ms')
    PROVIDER=$(echo $line | jq -r '.providers_used[0]')
    
    echo "$(date): Request $REQUEST_ID - ${LATENCY}ms - Provider: $PROVIDER"
    
    # Alert on high latency
    if (( $(echo "$LATENCY > 5000" | bc -l) )); then
      echo "⚠️  HIGH LATENCY ALERT: ${LATENCY}ms"
    fi
  done

Recovery Procedures ​

Service Recovery Steps ​

  1. Immediate Response (0-2 minutes)

    bash
    # Check if service is responsive
    curl -f http://localhost:3000/health || exit 1
    
    # If not responsive, restart service
    systemctl restart orchestration
    
    # Verify restart
    sleep 10
    curl -f http://localhost:3000/health
  2. Investigation (2-10 minutes)

    bash
    # Check recent logs for root cause
    journalctl -u orchestration --since "10 minutes ago" --no-pager
    
    # Check system resources
    free -h
    df -h
    
    # Check provider connectivity
    curl -s http://localhost:3000/debug/providers
  3. Resolution (10-30 minutes)

    • Fix identified issues (network, config, resources)
    • Deploy fixes if code issues found
    • Update monitoring alerts if needed

Provider Failover Procedure ​

typescript
// Automatic failover implementation
class ProviderFailoverManager {
  private unhealthyProviders = new Set<string>();
  private healthCheckInterval = 60000; // 1 minute

  constructor() {
    this.startHealthChecks();
  }

  private startHealthChecks(): void {
    setInterval(async () => {
      await this.checkProviderHealth();
    }, this.healthCheckInterval);
  }

  private async checkProviderHealth(): Promise<void> {
    const providers = this.providerRegistry.getAllProviders();
    
    for (const provider of providers) {
      try {
        await this.testProviderConnection(provider);
        this.markProviderHealthy(provider.provider_id);
      } catch (error) {
        console.warn(`Provider ${provider.name} health check failed:`, error.message);
        this.markProviderUnhealthy(provider.provider_id);
      }
    }
  }

  private markProviderUnhealthy(providerId: string): void {
    this.unhealthyProviders.add(providerId);
    console.log(`Provider ${providerId} marked as unhealthy`);
    
    // Notify monitoring system
    this.metricsCollector.recordProviderHealthChange(providerId, false);
  }

  private markProviderHealthy(providerId: string): void {
    if (this.unhealthyProviders.has(providerId)) {
      this.unhealthyProviders.delete(providerId);
      console.log(`Provider ${providerId} recovered and marked as healthy`);
      
      // Notify monitoring system
      this.metricsCollector.recordProviderHealthChange(providerId, true);
    }
  }

  getHealthyProviders(providers: ProviderCapability[]): ProviderCapability[] {
    return providers.filter(p => !this.unhealthyProviders.has(p.provider_id));
  }
}

Prevention Strategies ​

Proactive Monitoring ​

typescript
// Implement early warning system
class EarlyWarningSystem {
  private thresholds = {
    errorRate: 0.02,        // 2%
    avgLatency: 3000,       // 3 seconds
    providerFailures: 0.1,  // 10%
    budgetBurnRate: 50      // cents per hour
  };

  checkMetrics(metrics: SystemMetrics): Warning[] {
    const warnings: Warning[] = [];

    if (metrics.error_rate > this.thresholds.errorRate) {
      warnings.push({
        type: 'error_rate_high',
        severity: 'warning',
        message: `Error rate ${metrics.error_rate * 100}% exceeds threshold`,
        action: 'Investigate recent error patterns'
      });
    }

    if (metrics.avg_response_time_ms > this.thresholds.avgLatency) {
      warnings.push({
        type: 'latency_high',
        severity: 'warning',
        message: `Average latency ${metrics.avg_response_time_ms}ms exceeds threshold`,
        action: 'Check provider performance and system resources'
      });
    }

    return warnings;
  }
}

Capacity Planning ​

typescript
// Resource usage prediction
class CapacityPlanner {
  predictResourceNeeds(historicalMetrics: SystemMetrics[], growthRate: number): ResourceForecast {
    const avgRequestsPerSecond = this.calculateAverage(historicalMetrics.map(m => m.requests_per_second));
    const avgCostPerRequest = this.calculateAverage(historicalMetrics.map(m => m.cost_per_request));
    
    // Predict 30 days ahead
    const futureRequests = avgRequestsPerSecond * (1 + growthRate) * 30 * 24 * 3600;
    const futureCost = futureRequests * avgCostPerRequest;
    
    return {
      predicted_requests: futureRequests,
      predicted_cost_cents: futureCost,
      recommended_budget: futureCost * 1.2, // 20% buffer
      scaling_recommendations: this.generateScalingRecommendations(futureRequests)
    };
  }
}

Quick Reference ​

Emergency Contacts ​

  • On-Call Engineer: [Your on-call system]
  • Platform Team: [Team contact]
  • Provider Support: [Provider contact info]

Important URLs ​

Critical Commands ​

bash
# Service control
systemctl status orchestration
systemctl restart orchestration

# Quick health check
curl http://localhost:3000/health

# View recent errors
journalctl -u orchestration --since "30 minutes ago" | grep ERROR

# Check system resources
htop

Next: Performance Optimization β†’