Orchestration Troubleshooting Guide β
Comprehensive troubleshooting guide for diagnosing and resolving common issues in orchestration layer components. Includes debugging workflows, error patterns, and proven solutions.
Quick Diagnostics β
System Health Check (30 seconds) β
bash
# 1. Check component health endpoints
curl http://localhost:3000/health/ceo
curl http://localhost:3000/health/acs
curl http://localhost:3000/health/providers
# 2. Check recent logs for errors
tail -n 100 /var/log/orchestration/app.log | grep -i error
# 3. Check system resources
ps aux | grep orchestration
free -h
df -h
# 4. Test basic functionality
curl -X POST http://localhost:3000/api/query \
-H "Content-Type: application/json" \
-d '{"text": "test query", "user_id": "debug"}'
Common Error Patterns β
text
β "Provider timeout" β Network/latency issue
β "Budget exhausted" β Resource allocation problem
β "Intent parsing failed" β Input validation issue
β "No providers available" β Configuration/registry problem
β "Quality threshold not met" β Provider selection issue
Component-Specific Troubleshooting β
CEO (Context/Execution Orchestrator) Issues β
Issue: Intent parsing fails β
Symptoms:
- Queries return generic "search" intent
- Domain classification incorrect
- Complexity assessment wrong
Debugging Steps:
typescript
// Enable debug logging
const ceo = new BasicCEO();
ceo.setDebugMode(true);
// Test intent parser directly
const intent = ceo.intentParser.parseIntent({
text: "your problematic query",
user_id: "debug"
});
console.log('Parsed intent:', intent);
// Check domain keywords
console.log('Domain keywords:', ceo.intentParser.domainKeywords);
Common Solutions:
- Update domain keywords: Add missing keywords for your domain
- Improve complexity indicators: Adjust complexity assessment rules
- Handle edge cases: Add validation for empty/malformed queries
Code Fix Examples:
typescript
// Fix 1: Add domain-specific keywords
private domainKeywords = {
code: ['function', 'class', 'bug', 'error', 'debug', 'implement', 'refactor',
'typescript', 'javascript', 'python', 'api', 'database'], // Added more
// ...
};
// Fix 2: Improve empty query handling
parseIntent(query: UserQuery): ParsedIntent {
if (!query.text || query.text.trim().length === 0) {
throw new Error('Query text cannot be empty');
}
const text = query.text.toLowerCase();
// ... rest of parsing logic
}
Issue: Budget allocation too restrictive/generous β
Symptoms:
- Requests frequently timeout (too restrictive)
- Costs exceed expected limits (too generous)
- Quality doesn't match requirements
Debugging Steps:
typescript
// Test budget allocator with different scenarios
const budgetAllocator = new BasicBudgetAllocator();
const testCases = [
{ complexity: 'simple', urgency: 'low', domain: 'code' },
{ complexity: 'complex', urgency: 'high', domain: 'research' }
];
testCases.forEach(testCase => {
const budget = budgetAllocator.allocateBudget(testCase);
console.log(`${JSON.stringify(testCase)} β Budget:`, budget);
});
Common Solutions:
typescript
// Adjust multipliers in budget allocator
private urgencyModifiers = {
low: { latency: 2.0, cost: 0.7 }, // Less restrictive on cost
medium: { latency: 1.0, cost: 1.0 },
high: { latency: 0.6, cost: 1.5 } // More budget for urgent requests
};
// Add user preference override safety
private validateBudget(budget: ResourceBudget): ResourceBudget {
return {
max_latency_ms: Math.max(2000, Math.min(15000, budget.max_latency_ms)), // Wider range
max_cost_cents: Math.max(2, Math.min(100, budget.max_cost_cents)),
quality_threshold: Math.max(0.6, Math.min(0.9, budget.quality_threshold)),
max_providers: Math.max(1, Math.min(3, budget.max_providers))
};
}
ACS (Adaptive Context Scaling) Issues β
Issue: Provider selection fails β
Symptoms:
- "No providers available for domain" error
- Wrong provider selected for query type
- Fallback providers not working
Debugging Steps:
typescript
// Check provider registry
const acs = new BasicACS();
const allProviders = acs.providerRegistry.getAllProviders();
console.log('Available providers:', allProviders.map(p => ({ id: p.provider_id, domains: p.domains })));
// Test provider scoring
const intent = { domain: 'code', complexity: 'medium', urgency: 'medium' };
const budget = { max_cost_cents: 20, max_latency_ms: 5000 };
const candidateProviders = acs.providerRegistry.getProvidersByDomain(intent.domain);
console.log('Candidate providers:', candidateProviders);
const scores = acs.providerScorer.scoreProviders(candidateProviders, intent, budget);
console.log('Provider scores:', scores);
Common Solutions:
- Check provider registration: Ensure providers are properly registered for the domain
- Review scoring algorithm: Adjust benefit/cost calculations
- Add fallback providers: Register general-purpose providers as fallbacks
typescript
// Fix: Add fallback provider registration
constructor() {
this.initializeDefaultProviders();
this.addFallbackProviders(); // Add this
}
private addFallbackProviders(): void {
// General-purpose provider as fallback
const fallbackProvider: ProviderCapability = {
provider_id: 'general_fallback',
name: 'General Purpose LLM',
type: 'llm',
domains: ['code', 'documentation', 'research', 'general'], // All domains
capabilities: ['completion'],
avg_latency_ms: 3000,
cost_per_request_cents: 12,
quality_score: 0.75, // Lower quality but reliable
reliability_score: 0.95,
endpoint: process.env.FALLBACK_PROVIDER_URL,
timeout_ms: 10000,
max_retries: 3
};
this.providers.set(fallbackProvider.provider_id, fallbackProvider);
}
Issue: Provider API calls failing β
Symptoms:
- Timeout errors
- Authentication failures
- Malformed responses
Debugging Steps:
bash
# Test provider connectivity directly
curl -X POST https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "test"}],
"max_tokens": 50
}'
# Check network connectivity
ping api.openai.com
nslookup api.openai.com
# Check API key validity
echo $OPENAI_API_KEY | wc -c # Should be ~51 characters
Common Solutions:
typescript
// Implement robust HTTP client with retry logic
private async makeProviderRequest(
provider: ProviderCapability,
query: string,
options: Record<string, any>
): Promise<any> {
const maxRetries = provider.max_retries || 2;
let lastError: Error;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
console.log(`Attempt ${attempt}/${maxRetries} for provider ${provider.name}`);
const response = await this.httpClient({
method: 'POST',
url: provider.endpoint,
data: this.buildRequestPayload(provider, query, options),
timeout: provider.timeout_ms,
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${provider.api_key || process.env.DEFAULT_API_KEY}`,
'User-Agent': 'Mnemoverse-Orchestration/1.0',
...options.headers
}
});
// Validate response structure
if (!response.data) {
throw new Error('Empty response from provider');
}
return this.parseProviderResponse(provider, response.data);
} catch (error) {
lastError = error;
console.warn(`Provider ${provider.name} attempt ${attempt} failed:`, error.message);
// Don't retry on authentication errors
if (error.response?.status === 401 || error.response?.status === 403) {
throw new Error(`Authentication failed for provider ${provider.name}: ${error.message}`);
}
// Exponential backoff between retries
if (attempt < maxRetries) {
const backoffMs = Math.min(1000 * Math.pow(2, attempt), 10000);
console.log(`Retrying in ${backoffMs}ms...`);
await new Promise(resolve => setTimeout(resolve, backoffMs));
}
}
}
throw new Error(`All ${maxRetries} attempts failed for provider ${provider.name}. Last error: ${lastError.message}`);
}
Issue: High latency or poor performance β
Symptoms:
- Response times > 5 seconds
- Provider scoring takes too long
- Memory usage increasing over time
Debugging Steps:
typescript
// Add performance profiling
const startTime = performance.now();
// Profile provider scoring
console.time('provider-scoring');
const scores = this.providerScorer.scoreProviders(providers, intent, budget);
console.timeEnd('provider-scoring');
// Profile execution
console.time('provider-execution');
const result = await this.executionEngine.executeWithFallback(selectedProviders, query);
console.timeEnd('provider-execution');
const totalTime = performance.now() - startTime;
console.log(`Total request time: ${totalTime}ms`);
// Check memory usage
const memUsage = process.memoryUsage();
console.log('Memory usage:', {
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + ' MB',
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + ' MB',
external: Math.round(memUsage.external / 1024 / 1024) + ' MB'
});
Performance Solutions:
typescript
// Implement caching for provider scoring
class ProviderScorerWithCache extends BasicProviderScorer {
private scoreCache = new Map<string, ProviderScore[]>();
private cacheMaxAge = 5 * 60 * 1000; // 5 minutes
scoreProviders(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): ProviderScore[] {
const cacheKey = this.getCacheKey(providers, intent, budget);
const cached = this.scoreCache.get(cacheKey);
if (cached && this.isCacheValid(cached)) {
console.log('Using cached provider scores');
return cached;
}
const scores = super.scoreProviders(providers, intent, budget);
this.scoreCache.set(cacheKey, scores);
// Clean up old cache entries
this.cleanupCache();
return scores;
}
private getCacheKey(providers: ProviderCapability[], intent: ParsedIntent, budget: ResourceBudget): string {
const providerIds = providers.map(p => p.provider_id).sort().join(',');
const intentKey = `${intent.domain}-${intent.complexity}-${intent.urgency}`;
const budgetKey = `${budget.max_cost_cents}-${budget.max_latency_ms}-${budget.quality_threshold}`;
return `${providerIds}:${intentKey}:${budgetKey}`;
}
private cleanupCache(): void {
if (this.scoreCache.size > 1000) { // Limit cache size
const oldestKeys = Array.from(this.scoreCache.keys()).slice(0, 100);
oldestKeys.forEach(key => this.scoreCache.delete(key));
}
}
}
Error Handling Patterns β
Graceful Degradation β
typescript
// Implement graceful fallbacks
async executeRequest(request: ACSRequest): Promise<ACSResponse> {
try {
return await this.executeFullRequest(request);
} catch (error) {
console.warn('Full execution failed, attempting degraded mode:', error.message);
// Fallback 1: Try with relaxed budget
try {
const relaxedBudget = this.relaxBudget(request.budget);
return await this.executeWithRelaxedBudget(request, relaxedBudget);
} catch (relaxedError) {
console.warn('Relaxed budget execution failed:', relaxedError.message);
// Fallback 2: Return cached or default response
const cachedResponse = await this.getCachedResponse(request);
if (cachedResponse) {
return cachedResponse;
}
// Fallback 3: Return error response with partial data
return this.createErrorResponse(request, [error.message, relaxedError.message]);
}
}
}
private relaxBudget(budget: ResourceBudget): ResourceBudget {
return {
...budget,
max_latency_ms: budget.max_latency_ms * 1.5,
max_cost_cents: budget.max_cost_cents * 1.2,
quality_threshold: Math.max(0.5, budget.quality_threshold - 0.1),
max_providers: budget.max_providers + 1
};
}
Circuit Breaker Implementation β
typescript
class ProviderCircuitBreaker {
private failures = new Map<string, number>();
private lastFailureTime = new Map<string, number>();
private readonly failureThreshold = 5;
private readonly timeoutMs = 60000; // 1 minute
async callProvider(providerId: string, operation: () => Promise<any>): Promise<any> {
if (this.isCircuitOpen(providerId)) {
throw new Error(`Circuit breaker open for provider ${providerId}`);
}
try {
const result = await operation();
this.onSuccess(providerId);
return result;
} catch (error) {
this.onFailure(providerId);
throw error;
}
}
private isCircuitOpen(providerId: string): boolean {
const failures = this.failures.get(providerId) || 0;
const lastFailure = this.lastFailureTime.get(providerId) || 0;
if (failures >= this.failureThreshold) {
if (Date.now() - lastFailure < this.timeoutMs) {
return true; // Circuit is open
} else {
// Reset after timeout
this.failures.set(providerId, 0);
return false;
}
}
return false;
}
private onSuccess(providerId: string): void {
this.failures.set(providerId, 0);
this.lastFailureTime.delete(providerId);
}
private onFailure(providerId: string): void {
const currentFailures = this.failures.get(providerId) || 0;
this.failures.set(providerId, currentFailures + 1);
this.lastFailureTime.set(providerId, Date.now());
}
}
Debugging Tools and Commands β
Log Analysis Commands β
bash
# Find error patterns in logs
grep -E "(ERROR|WARN|timeout|failed)" /var/log/orchestration/app.log | tail -20
# Analyze request flow for specific request ID
grep "request-123" /var/log/orchestration/app.log | \
awk '{print $1, $2, $NF}' | \
sort
# Find slow requests (> 3 seconds)
grep "total_latency_ms" /var/log/orchestration/app.log | \
awk -F'total_latency_ms":' '{print $2}' | \
awk -F',' '{if($1 > 3000) print $0}'
# Check provider success rates
grep "provider_execution" /var/log/orchestration/app.log | \
grep -E "(success|failed)" | \
sort | uniq -c
Health Check Scripts β
bash
#!/bin/bash
# orchestration-health-check.sh
echo "=== Orchestration Health Check ==="
# 1. Service status
echo "1. Service Status:"
curl -s http://localhost:3000/health | jq '.' || echo "β Service unreachable"
# 2. Component health
echo -e "\n2. Component Health:"
curl -s http://localhost:3000/health/ceo | jq '.status' || echo "β CEO unhealthy"
curl -s http://localhost:3000/health/acs | jq '.status' || echo "β ACS unhealthy"
# 3. Provider connectivity
echo -e "\n3. Provider Connectivity:"
curl -s http://localhost:3000/debug/providers | jq '.[] | {id: .provider_id, status: .status}'
# 4. Recent error rate
echo -e "\n4. Recent Error Rate:"
ERROR_COUNT=$(grep -c "ERROR" /var/log/orchestration/app.log | tail -1000 || echo 0)
TOTAL_COUNT=$(grep -c "request_completed" /var/log/orchestration/app.log | tail -1000 || echo 1)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_COUNT" | bc)
echo "Error rate: ${ERROR_RATE}% (${ERROR_COUNT}/${TOTAL_COUNT})"
# 5. Resource usage
echo -e "\n5. Resource Usage:"
ps aux | grep orchestration | awk '{print "CPU: " $3 "%, Memory: " $4 "%"}'
echo -e "\n=== Health Check Complete ==="
Performance Monitoring Script β
bash
#!/bin/bash
# performance-monitor.sh
# Monitor request latency in real-time
tail -f /var/log/orchestration/app.log | \
grep --line-buffered "request_completed" | \
while read line; do
REQUEST_ID=$(echo $line | jq -r '.request_id')
LATENCY=$(echo $line | jq -r '.total_latency_ms')
PROVIDER=$(echo $line | jq -r '.providers_used[0]')
echo "$(date): Request $REQUEST_ID - ${LATENCY}ms - Provider: $PROVIDER"
# Alert on high latency
if (( $(echo "$LATENCY > 5000" | bc -l) )); then
echo "β οΈ HIGH LATENCY ALERT: ${LATENCY}ms"
fi
done
Recovery Procedures β
Service Recovery Steps β
Immediate Response (0-2 minutes)
bash# Check if service is responsive curl -f http://localhost:3000/health || exit 1 # If not responsive, restart service systemctl restart orchestration # Verify restart sleep 10 curl -f http://localhost:3000/health
Investigation (2-10 minutes)
bash# Check recent logs for root cause journalctl -u orchestration --since "10 minutes ago" --no-pager # Check system resources free -h df -h # Check provider connectivity curl -s http://localhost:3000/debug/providers
Resolution (10-30 minutes)
- Fix identified issues (network, config, resources)
- Deploy fixes if code issues found
- Update monitoring alerts if needed
Provider Failover Procedure β
typescript
// Automatic failover implementation
class ProviderFailoverManager {
private unhealthyProviders = new Set<string>();
private healthCheckInterval = 60000; // 1 minute
constructor() {
this.startHealthChecks();
}
private startHealthChecks(): void {
setInterval(async () => {
await this.checkProviderHealth();
}, this.healthCheckInterval);
}
private async checkProviderHealth(): Promise<void> {
const providers = this.providerRegistry.getAllProviders();
for (const provider of providers) {
try {
await this.testProviderConnection(provider);
this.markProviderHealthy(provider.provider_id);
} catch (error) {
console.warn(`Provider ${provider.name} health check failed:`, error.message);
this.markProviderUnhealthy(provider.provider_id);
}
}
}
private markProviderUnhealthy(providerId: string): void {
this.unhealthyProviders.add(providerId);
console.log(`Provider ${providerId} marked as unhealthy`);
// Notify monitoring system
this.metricsCollector.recordProviderHealthChange(providerId, false);
}
private markProviderHealthy(providerId: string): void {
if (this.unhealthyProviders.has(providerId)) {
this.unhealthyProviders.delete(providerId);
console.log(`Provider ${providerId} recovered and marked as healthy`);
// Notify monitoring system
this.metricsCollector.recordProviderHealthChange(providerId, true);
}
}
getHealthyProviders(providers: ProviderCapability[]): ProviderCapability[] {
return providers.filter(p => !this.unhealthyProviders.has(p.provider_id));
}
}
Prevention Strategies β
Proactive Monitoring β
typescript
// Implement early warning system
class EarlyWarningSystem {
private thresholds = {
errorRate: 0.02, // 2%
avgLatency: 3000, // 3 seconds
providerFailures: 0.1, // 10%
budgetBurnRate: 50 // cents per hour
};
checkMetrics(metrics: SystemMetrics): Warning[] {
const warnings: Warning[] = [];
if (metrics.error_rate > this.thresholds.errorRate) {
warnings.push({
type: 'error_rate_high',
severity: 'warning',
message: `Error rate ${metrics.error_rate * 100}% exceeds threshold`,
action: 'Investigate recent error patterns'
});
}
if (metrics.avg_response_time_ms > this.thresholds.avgLatency) {
warnings.push({
type: 'latency_high',
severity: 'warning',
message: `Average latency ${metrics.avg_response_time_ms}ms exceeds threshold`,
action: 'Check provider performance and system resources'
});
}
return warnings;
}
}
Capacity Planning β
typescript
// Resource usage prediction
class CapacityPlanner {
predictResourceNeeds(historicalMetrics: SystemMetrics[], growthRate: number): ResourceForecast {
const avgRequestsPerSecond = this.calculateAverage(historicalMetrics.map(m => m.requests_per_second));
const avgCostPerRequest = this.calculateAverage(historicalMetrics.map(m => m.cost_per_request));
// Predict 30 days ahead
const futureRequests = avgRequestsPerSecond * (1 + growthRate) * 30 * 24 * 3600;
const futureCost = futureRequests * avgCostPerRequest;
return {
predicted_requests: futureRequests,
predicted_cost_cents: futureCost,
recommended_budget: futureCost * 1.2, // 20% buffer
scaling_recommendations: this.generateScalingRecommendations(futureRequests)
};
}
}
Quick Reference β
Emergency Contacts β
- On-Call Engineer: [Your on-call system]
- Platform Team: [Team contact]
- Provider Support: [Provider contact info]
Important URLs β
- Monitoring Dashboard: http://monitoring.company.com/orchestration
- Logs: http://logs.company.com/orchestration
- Status Page: http://status.company.com
Critical Commands β
bash
# Service control
systemctl status orchestration
systemctl restart orchestration
# Quick health check
curl http://localhost:3000/health
# View recent errors
journalctl -u orchestration --since "30 minutes ago" | grep ERROR
# Check system resources
htop