How to Use Tracing Effectively in FlockÂļ
The ultimate guide to debugging, optimizing, and monitoring blackboard multi-agent systems
Table of ContentsÂļ
- Introduction: Why Tracing Matters for Blackboard Systems
- Getting Started: Your First Trace
- The Seven Views: Complete Observability
- Real-World Debugging Scenarios
- Advanced Techniques
- Production Best Practices
- What Makes Flock's Tracing Unique
- To Come in 1.0: Roadmap
Introduction: Why Tracing Matters for Blackboard SystemsÂļ
The Blackboard ProblemÂļ
Unlike graph-based frameworks (LangGraph, CrewAI, AutoGen), where agent interactions follow predefined edges, blackboard systems have emergent behavior. Agents communicate through shared data artifacts, making it nearly impossible to predict:
- Why did Agent B execute after Agent A?
- What chain of events led to this error?
- Which agent is the bottleneck?
- How do agents actually interact in production?
Traditional logging fails here because: - â No parent-child relationships between agent calls - â Async execution makes logs non-sequential - â No visibility into cross-agent data flow - â Can't see which artifact triggered which agent
What Tracing SolvesÂļ
Flock's OpenTelemetry-based tracing provides:
â Parent-child span relationships - See the complete execution tree â Correlation IDs - Track a single request across all agents â Timing data - Identify bottlenecks with microsecond precision â Input/Output capture - See what data agents consumed and produced â Service dependencies - Discover emergent agent interactions â RED Metrics - Rate, Errors, Duration for production monitoring
The key insight: Blackboard systems require discovery tools, not just debugging tools. You need to understand what actually happened, not just verify what you thought would happen.
Getting Started: Your First TraceÂļ
1. Enable TracingÂļ
Add to your .env
file:
# Enable auto-tracing
FLOCK_AUTO_TRACE=true
FLOCK_TRACE_FILE=true
# Filter what gets traced (avoid streaming token overhead)
FLOCK_TRACE_SERVICES=["flock", "agent", "dspyengine", "outpututilitycomponent"]
# Exclude noisy operations
FLOCK_TRACE_IGNORE=["DashboardEventCollector.set_websocket_manager"]
# Auto-cleanup old traces
FLOCK_TRACE_TTL_DAYS=30
2. Run Your AgentÂļ
3. Open the Trace ViewerÂļ
Navigate to the dashboard and select Trace Viewer module. You'll see all traces stored in .flock/traces.duckdb
.
4. Your First InvestigationÂļ
Click on a trace to expand it. You'll see:
Flock.publish [0.00ms - 0.50ms] ââââââââââââââââââââââââââââââââââ
Agent.execute [0.50ms - 5,200ms] ââââââââââââââââââââââââââââââââ
DSPyEngine.evaluate [0.60ms - 5,100ms] âââââââââââââââââââââââââââââââ
OutputUtility... [5,100ms - 5,150ms] ââââââââââââââââââââââââââââ
Immediate insights: - Total trace duration: 5.2 seconds - 98% of time spent in DSPyEngine.evaluate
(LLM call) - OutputUtility took only 50ms
The Seven Views: Complete ObservabilityÂļ
The Trace Viewer provides 7 specialized view modes for different analysis needs:
Quick ReferenceÂļ
View | Icon | Purpose | When to Use |
---|---|---|---|
Timeline | đ | Waterfall execution flow | Debug sequence and timing |
Statistics | đ | Sortable tabular data | Compare traces, find patterns |
RED Metrics | đ´ | Service health monitoring | Production dashboards |
Dependencies | đ | Service communication | Understand architecture |
DuckDB SQL | đī¸ | Custom analytics | Advanced queries, reports |
Configuration | âī¸ | Runtime filtering | Fine-tune tracing |
Guide | đ | Documentation & examples | Learn and reference |
Feature HighlightsÂļ
NEW in this version: - ⨠Smart Sorting: Sort traces by date (newest first), span count, or total duration - đĨ CSV Export: Download SQL query results for Excel/analysis - đĨī¸ Maximize Mode: Full-screen view for all modules - đ¨ Modern UI: Emoji-enhanced toolbar for better visual scanning
View Modes ExplainedÂļ
Timeline View: The WaterfallÂļ
Use case: "Why is this agent slow?"
The timeline view shows: - Execution order (top to bottom) - Duration (bar width) - Parent-child relationships (indentation) - Errors (red borders)
Example: Finding the Bottleneck
Pizza Order Processing (6,500ms total)
ââ Flock.publish (0.5ms) ââââââââââââââââââââââââââââââââââ
â ââ Agent.execute (6,499ms) âââââââââââââââââââââââââââââ
â ââ DSPyEngine.evaluate (6,200ms) ââââââââââââââââââââ <- 95% of time!
â â ââ LLM Call (6,198ms)
â ââ OutputUtilityComponent.on_post_evaluate (50ms) ââââ
ââ Flock.publish (0.1ms)
Finding: The LLM call dominates execution time. Solutions: 1. Cache results for repeated queries 2. Use a smaller model for simple tasks 3. Implement streaming for better UX
Click on any span to see: - Full attributes (correlation_id, task_id, agent.name) - Input parameters (JSON formatted) - Output values - Error details (if failed)
Statistics View: JSON ExplorerÂļ
Use case: "What data did the agent receive?"
Shows tabular data with JSON viewer for each span:
Span ID | Service | Operation | Duration | Status | Attributes |
---|---|---|---|---|---|
abc123 | Agent | execute | 6,499ms | OK | đ View JSON |
Click "View JSON" to see:
{
"input": {
"ctx": {
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"task_id": "pizza_order_001"
},
"artifacts": [
{
"name": "customer_order",
"content": {
"pizza": "Margherita",
"size": "Large",
"toppings": ["extra cheese", "basil"]
}
}
]
},
"output": {
"value": {
"status": "processed",
"estimated_time": "25 minutes"
}
}
}
Why this matters: You can see exactly what the agent saw, not what you think it saw.
RED Metrics View: Production MonitoringÂļ
Use case: "Which agent is failing in production?"
RED Metrics = **R**ate + **E**rrors + **D**uration
Each service shows: - Rate: Requests per second - Error Rate: Percentage of failures - Avg Duration: Mean response time - P95 Duration: 95th percentile latency - P99 Duration: 99th percentile latency (worst-case) - Total Spans: Call volume
Example Output:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Agent â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Rate: 2.5 req/s â
â Error Rate: 0.0% â Healthy â
â Avg Duration: 6,499ms â
â P95 Duration: 8,200ms â
â P99 Duration: 12,500ms â High variance â
â Total Spans: 1,234 â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â DSPyEngine â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Rate: 2.5 req/s â
â Error Rate: 5.2% â Action needed â
â Avg Duration: 6,200ms â
â P95 Duration: 7,800ms â
â P99 Duration: 15,000ms â Timeout risk â
â Total Spans: 1,234 â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Insights: - DSPyEngine has 5.2% error rate â Investigate LLM failures - P99 of 15s suggests timeout risk â Add retry logic - Both have high rate â Consider rate limiting
Why P99 matters: P95 tells you "most users are fine", P99 tells you "some users are having a terrible experience". In multi-agent systems, P99 latencies compound across agents.
Dependencies View: Emergent InteractionsÂļ
Use case: "How do my agents actually communicate?"
â This is where Flock shines - most frameworks don't provide this for blackboard systems.
Shows service-to-service dependencies with operation-level drill-down:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Flock â Agent 3 operations â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Click to expand:
Flock.publish â Agent.execute
ââ Calls: 1,234
ââ Errors: 0.0%
ââ Avg: 6,499ms
ââ P95: 8,200ms
Flock.publish â Agent.validate_input
ââ Calls: 123
ââ Errors: 2.4%
ââ Avg: 50ms
ââ P95: 120ms
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Agent â DSPyEngine 2 operations â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Agent.execute â DSPyEngine.evaluate
ââ Calls: 1,234
ââ Errors: 5.2%
ââ Avg: 6,200ms
ââ P95: 7,800ms
Agent.refine_output â DSPyEngine.evaluate
ââ Calls: 45
ââ Errors: 0.0%
ââ Avg: 2,100ms
ââ P95: 2,500ms
Discoveries you can make:
- Unexpected Dependencies
- "Why is Agent calling DSPyEngine twice?"
- Click drill-down: See
Agent.refine_output
is retrying with better prompts -
Decision: Cache the first result or merge into single call
-
Error Hotspots
Agent.validate_input
has 2.4% error rate- Drill down to see which inputs fail
-
Fix validation logic
-
Performance Bottlenecks
Agent.execute â DSPyEngine.evaluate
has P95 of 7.8s- Most calls are fast, but some timeout
- Solution: Implement circuit breaker
Configuration View: Tracing SettingsÂļ
Use case: "Configure tracing without editing .env files"
â NEW IN v0.5.0 - All tracing configuration is now accessible directly in the Trace Viewer!
The Configuration view provides a visual interface for all tracing settings:
Core Tracing Toggles: - â
Enable auto-tracing (FLOCK_AUTO_TRACE
) - â
Store in DuckDB (FLOCK_TRACE_FILE
) - â
Unified workflow tracing (FLOCK_AUTO_WORKFLOW_TRACE
) - â
Trace TTL (FLOCK_TRACE_TTL_DAYS
)
Service Whitelist (FLOCK_TRACE_SERVICES
): - Multi-select with autocomplete - Populated from actual traced services in database - Only trace specific services (improves performance) - Example: ["flock", "agent", "dspyengine"]
Operation Blacklist (FLOCK_TRACE_IGNORE
): - Multi-select with autocomplete - Exclude noisy operations (e.g., health checks) - Format: Service.method
- Example: ["DashboardEventCollector.set_websocket_manager"]
Database Statistics:
âââââââââââââââââââââââââââââââââââââââââââ
â Trace Database Statistics â
âââââââââââââââââââââââââââââââââââââââââââ¤
â Total Spans: 12,456 â
â Total Traces: 3,421 â
â Services Traced: 8 â
â Database Size: 24.5 MB â
â Oldest Trace: Oct 1, 09:15 AM â
â Newest Trace: Oct 7, 21:20 PM â
âââââââââââââââââââââââââââââââââââââââââââ
Clear Traces: - One-click database clearing - Confirmation dialog prevents accidents - Runs VACUUM to reclaim disk space - Shows deleted span count
Why Configuration lives in Trace Viewer: - â Settings stay with the data viewer (logical grouping) - â See immediate impact of filter changes on statistics - â Access during debugging workflow (no context switch) - â Separate from UI preferences (Settings panel = appearance/graph only)
Example Workflow: 1. Open Trace Viewer â Configuration tab 2. Check database statistics â 50,000 spans, 2.5 GB 3. Enable service whitelist â Select only ["agent", "dspyengine"]
4. Clear old traces â Confirmation â 45,000 spans deleted 5. Return to Timeline view â Much faster with filtered data!
Real-World Debugging ScenariosÂļ
Scenario 1: "Agent Executed But Shouldn't Have"Âļ
Problem: Agent processing irrelevant artifacts.
Investigation: 1. Go to Timeline View 2. Find the unexpected Agent.execute
span 3. Click to expand â View attributes 4. Check input.artifacts
in JSON
Example Finding:
{
"input": {
"artifacts": [
{
"name": "customer_order",
"visibility": "public",
"status": "draft" // â Agent should ignore drafts!
}
]
}
}
Root Cause: Agent subscription didn't filter by status.
Fix:
agent.subscribe(
artifact_name="customer_order",
filter=lambda artifact: artifact.status == "finalized" # Add filter
)
Scenario 2: "Production is Slow, But Dev Was Fast"Âļ
Problem: 2x latency increase in production.
Investigation with RED Metrics:
- Go to RED Metrics View
- Compare P95/P99 between environments (use DuckDB SQL):
-- Dev environment traces
SELECT
service,
AVG(duration_ms) as avg,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95
FROM spans
WHERE created_at > '2025-10-01' AND created_at < '2025-10-05'
GROUP BY service;
-- Production environment (different database)
-- Compare results
Example Finding:
Service | Dev P95 | Prod P95 | Diff |
---|---|---|---|
Agent | 3,200ms | 3,400ms | +6% |
DSPyEngine | 2,800ms | 7,200ms | +157% |
Root Cause: Production LLM endpoint has higher latency.
Solutions: - Switch to dedicated LLM endpoint - Implement request coalescing - Add caching layer
Scenario 3: "Cascading Failures"Âļ
Problem: One agent error crashes entire system.
Investigation with Dependencies:
- Go to Dependencies View
- Find failing agent:
DSPyEngine
(5.2% error rate) - Check which agents depend on it
Example Finding:
DSPyEngine (5.2% errors)
â
Agent (depends on DSPyEngine)
â
OutputUtilityComponent (depends on Agent)
Error cascade: 5.2% â 5.2% â 5.2%
Root Cause: No error handling, failures propagate.
Fix:
try:
result = await dspy_engine.evaluate(prompt)
except LLMException as e:
logger.error(f"LLM failed: {e}")
return fallback_response() # Graceful degradation
Verify Fix: - Check RED Metrics after deployment - Agent error rate should drop to 0% - System resilience improved
Scenario 4: "Memory Leak in Long-Running Agents"Âļ
Problem: Memory usage grows over time.
Investigation with Timeline:
- Filter traces by
correlation_id
(same session) - Compare first vs last trace durations
Example Finding:
Hypothesis: Agent accumulating data in memory.
Verification in Statistics View: - Click trace #1 â Check output.value
size: 2KB - Click trace #100 â Check output.value
size: 120KB
Root Cause: Agent appending to list without cleanup.
Fix:
class MyAgent:
def __init__(self):
self.history = [] # â Problem: unbounded growth
def execute(self, ctx, artifacts):
self.history.append(result) # â Memory leak
return result
# Solution: Use circular buffer
from collections import deque
class MyAgent:
def __init__(self):
self.history = deque(maxlen=100) # Keep last 100 only
Scenario 5: "LLM Costs Exploding"Âļ
Problem: Monthly LLM bill increased 10x.
Investigation with RED Metrics + Statistics:
- RED Metrics View: Check
DSPyEngine
rate - Before: 2.5 req/s
-
Now: 25 req/s (10x increase!)
-
Dependencies View: Find what's calling DSPyEngine
Agent.execute â DSPyEngine.evaluate
: 95% of calls-
Agent.refine_output â DSPyEngine.evaluate
: New operation! -
Timeline View: Check a trace with
refine_output
Finding: New feature added retry logic that calls LLM multiple times:
# Before
result = dspy_engine.evaluate(prompt)
# After (introduced bug)
for attempt in range(5): # â Calls LLM 5x on every request!
result = dspy_engine.evaluate(prompt)
if result.confidence > 0.9:
break
Fix: Add result caching or reduce retry count.
Advanced TechniquesÂļ
Technique 1: Using DuckDB for Custom AnalysisÂļ
Access the traces directly:
Example queries:
-- Find slowest operations
SELECT
name,
AVG(duration_ms) as avg_duration,
MAX(duration_ms) as max_duration,
COUNT(*) as call_count
FROM spans
WHERE created_at > NOW() - INTERVAL 24 HOURS
GROUP BY name
ORDER BY avg_duration DESC
LIMIT 10;
-- Find error patterns
SELECT
name,
status_code,
COUNT(*) as error_count,
json_extract(attributes, '$.error.message') as error_message
FROM spans
WHERE status_code = 'ERROR'
GROUP BY name, status_code, error_message
ORDER BY error_count DESC;
-- Find agents triggered by specific artifact
SELECT DISTINCT
s1.name as trigger_agent,
s2.name as executed_agent,
COUNT(*) as times
FROM spans s1
JOIN spans s2 ON s1.trace_id = s2.trace_id
WHERE json_extract(s1.attributes, '$.artifact.name') = 'customer_order'
AND s2.start_time > s1.end_time
GROUP BY s1.name, s2.name;
-- Correlation between input size and duration
SELECT
name,
CASE
WHEN LENGTH(json_extract(attributes, '$.input')) < 1000 THEN 'small'
WHEN LENGTH(json_extract(attributes, '$.input')) < 10000 THEN 'medium'
ELSE 'large'
END as input_size,
AVG(duration_ms) as avg_duration
FROM spans
GROUP BY name, input_size
ORDER BY name, avg_duration;
Technique 2: Filtering for FocusÂļ
Scenario: Too many traces, need to focus.
In Timeline View, use search: - Search by correlation_id
: 550e8400-e29b-41d4-a716-446655440000
- Search by agent name: PizzaOrderAgent
- Search by artifact: customer_order
- Search by error: TimeoutError
Environment variable filtering:
# Development: Trace everything for debugging
FLOCK_TRACE_SERVICES=["flock", "agent", "dspyengine", "outpututilitycomponent"]
# Production: Only critical services
FLOCK_TRACE_SERVICES=["agent"]
# Debugging specific agent
FLOCK_TRACE_SERVICES=["pizzaorderagent", "dspyengine"]
Technique 3: Correlation ID TrackingÂļ
Track a single request end-to-end:
# In your application
correlation_id = str(uuid.uuid4())
ctx = Context(correlation_id=correlation_id, task_id="order_001")
flock.publish(artifact, ctx)
Then in DuckDB:
SELECT
name,
start_time,
end_time,
duration_ms,
status_code
FROM spans
WHERE json_extract(attributes, '$.correlation_id') = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY start_time;
Result: See complete journey of one order through your system.
Technique 4: Comparative AnalysisÂļ
Compare before/after optimization:
-- Before optimization (Oct 1-5)
WITH before AS (
SELECT
service,
AVG(duration_ms) as avg_duration
FROM spans
WHERE created_at BETWEEN '2025-10-01' AND '2025-10-05'
GROUP BY service
),
-- After optimization (Oct 6-10)
after AS (
SELECT
service,
AVG(duration_ms) as avg_duration
FROM spans
WHERE created_at BETWEEN '2025-10-06' AND '2025-10-10'
GROUP BY service
)
SELECT
before.service,
before.avg_duration as before_ms,
after.avg_duration as after_ms,
((after.avg_duration - before.avg_duration) / before.avg_duration * 100) as improvement_pct
FROM before
JOIN after ON before.service = after.service
ORDER BY improvement_pct;
Technique 5: Focus Mode (Shift+Click)Âļ
In Timeline View: - Shift+Click any span to focus - All other spans fade to 40% opacity - Useful for complex traces with many agents
Use case: "I only care about DSPyEngine calls in this trace" - Shift+click on any DSPyEngine span - See only DSPyEngine operations clearly - Shift+click again to unfocus
Production Best PracticesÂļ
1. Configure Sensible FiltersÂļ
Don't trace everything in production:
# â
Good: Trace core services only
FLOCK_TRACE_SERVICES=["flock", "agent", "dspyengine"]
# â Bad: Trace everything (performance impact)
# FLOCK_TRACE_SERVICES=[] # Empty = trace all
# â
Good: Exclude high-frequency operations
FLOCK_TRACE_IGNORE=["DashboardEventCollector.set_websocket_manager", "MetricsUtility.increment_counter"]
# â Bad: Trace streaming tokens (huge overhead)
# FLOCK_TRACE_IGNORE=[]
2. Set Appropriate TTLÂļ
# Development: Keep 7 days for recent debugging
FLOCK_TRACE_TTL_DAYS=7
# Staging: Keep 14 days for integration testing
FLOCK_TRACE_TTL_DAYS=14
# Production: Keep 30 days for historical analysis
FLOCK_TRACE_TTL_DAYS=30
# Long-term audit: Keep 90 days
FLOCK_TRACE_TTL_DAYS=90
3. Monitor RED Metrics DailyÂļ
Set SLOs (Service Level Objectives):
# Define acceptable thresholds
SLO_THRESHOLDS = {
"Agent": {
"error_rate": 1.0, # Max 1% errors
"p95_duration": 5000, # Max 5s at P95
"p99_duration": 10000 # Max 10s at P99
},
"DSPyEngine": {
"error_rate": 2.0, # LLMs fail more often
"p95_duration": 8344,
"p99_duration": 15000
}
}
Daily check (DuckDB):
-- Check if any service violates SLO
SELECT
service,
(SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) * 100) as error_rate,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99
FROM spans
WHERE created_at > NOW() - INTERVAL 24 HOURS
GROUP BY service;
Alert if thresholds exceeded (pseudo-code):
if error_rate > SLO_THRESHOLDS[service]["error_rate"]:
send_alert(f"â {service} error rate: {error_rate}% (SLO: {SLO_THRESHOLDS[service]['error_rate']}%)")
4. Use Correlation IDs ConsistentlyÂļ
Always pass context:
# â
Good: Correlation ID propagates
ctx = Context(correlation_id=request.correlation_id)
flock.publish(artifact, ctx)
# â Bad: Each publish creates new trace
flock.publish(artifact) # No context = new correlation ID
5. Regular Database MaintenanceÂļ
-- Check database size
SELECT
COUNT(*) as total_spans,
MIN(created_at) as oldest_trace,
MAX(created_at) as newest_trace,
pg_size_pretty(pg_total_relation_size('spans')) as db_size
FROM spans;
-- Vacuum to reclaim space (after TTL cleanup)
VACUUM ANALYZE spans;
6. Export Critical TracesÂļ
Save important traces for postmortems:
-- Export trace to JSON
COPY (
SELECT json_group_array(json_object(
'name', name,
'start_time', start_time,
'duration_ms', duration_ms,
'status_code', status_code,
'attributes', attributes
))
FROM spans
WHERE trace_id = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY start_time
) TO 'incident_2025_10_07_trace.json';
What Makes Flock's Tracing UniqueÂļ
After extensive research into LangGraph, CrewAI, AutoGen, and other agent frameworks, here's what only Flock provides:
1. ⨠Zero External DependenciesÂļ
Flock: - â Built-in DuckDB storage - â Built-in web UI - â No external services required - â Works offline
Other frameworks: - â LangGraph: Requires LangSmith ($) or Langfuse - â CrewAI: Requires AgentOps, Arize Phoenix, or Datadog - â AutoGen: Requires AgentOps or custom OpenTelemetry setup
Why this matters: You can debug agents on a plane, in secure environments, or without cloud dependencies.
2. ⨠Operation-Level Dependency Drill-DownÂļ
Flock:
Agent â DSPyEngine (click to expand)
ââ Agent.execute â DSPyEngine.evaluate (1,234 calls, 5.2% errors)
ââ Agent.refine_output â DSPyEngine.evaluate (45 calls, 0% errors)
Other frameworks: - â LangGraph: Service-level dependencies only - â CrewAI: No dependency visualization in open-source - â AutoGen: Community solutions show message flow, not operations
Why this matters: See exact method calls between agents, not just which services talk. Critical for understanding "which operation is slow?"
3. ⨠Blackboard-Native ObservabilityÂļ
Flock: - â Traces emergent agent interactions - â Shows which artifact triggered which agent - â Captures subscription-based execution - â No predefined graph required
Other frameworks: - â LangGraph: Traces follow graph edges (predefined) - â CrewAI: Traces follow crew hierarchy (predefined) - â AutoGen: Traces follow conversation flow (predefined)
Why this matters: Blackboard systems have emergent behavior. You need to discover what happened, not verify what you planned.
4. ⨠P99 Latency TrackingÂļ
Flock: Shows P95 and P99 durations
Other frameworks: - â Most show only P95 or P90 - â Helicone, Langfuse track P95 max
Why this matters: - P95 = "95% of requests are fast" - P99 = "1% of requests are terrible"
In multi-agent systems, P99 latencies compound:
Agent1 (P99: 10s) â Agent2 (P99: 10s) â Agent3 (P99: 10s)
Worst case: 30s for user (vs 15s at P95)
5. ⨠Built-in TTL ManagementÂļ
Flock: Automatic trace cleanup with FLOCK_TRACE_TTL_DAYS
Other frameworks: - â LangSmith: Manual deletion or retention policies (\() - â Langfuse: Manual database maintenance - â AgentOps: Retention based on plan (\))
Why this matters: Production databases don't grow unbounded. Set it once, forget it.
6. ⨠Filtering at Code LevelÂļ
Flock:
Other frameworks: - â Filter in UI only (still capture overhead) - â Sample-based filtering (lose critical traces)
Why this matters: Filtered operations have near-zero overhead because span creation is skipped entirely.
7. ⨠SQL-Based AnalyticsÂļ
Flock: Direct DuckDB access for custom queries
Other frameworks: - â LangSmith: API only (rate limited) - â AgentOps: Dashboard only - â Langfuse: PostgreSQL access (complex)
Why this matters: Unlimited custom analysis without API quotas.
What We Don't Have (Yet)Âļ
See To Come in 1.0 section below.
To Come in 1.0: RoadmapÂļ
Based on analysis of competing frameworks and user needs, here's what Flock's tracing will add:
1. đ Time-Travel DebuggingÂļ
Feature: Checkpoint and restart agent execution from any point.
Inspiration: LangGraph's time-travel feature allows replaying from checkpoints to explore alternative outcomes.
Flock implementation:
# Save checkpoint
checkpoint_id = flock.save_checkpoint(correlation_id, span_id="abc123")
# Restore and continue
flock.restore_checkpoint(checkpoint_id)
flock.resume()
Use case: "Agent made wrong decision at step 5, restart from there with modified input."
Status: Planned for 1.0
2. đ° Cost Tracking (Token Usage + API Costs)Âļ
Feature: Track LLM token usage and costs per operation.
Inspiration: Langfuse, Helicone, LiteLLM, Datadog all provide token/cost tracking.
Flock implementation:
# In traces, capture:
{
"tokens": {
"prompt": 1234,
"completion": 567,
"total": 1801
},
"cost": {
"prompt": 0.0012,
"completion": 0.0011,
"total": 0.0023,
"model": "gpt-4o"
}
}
Dashboard view:
âââââââââââââââââââââââââââââââââââââââââââ
â DSPyEngine - Cost Analysis (24h) â
âââââââââââââââââââââââââââââââââââââââââââ¤
â Total Cost: $145.67 â
â Total Tokens: 12,456,789 â
â Avg Cost/Request: $0.12 â
â Most Expensive: â
â - Agent.execute: $89.34 (61%) â
â - Agent.refine: $56.33 (39%) â
âââââââââââââââââââââââââââââââââââââââââââ
SQL Queries:
-- Find most expensive operations
SELECT
name,
SUM(json_extract(attributes, '$.cost.total')) as total_cost,
SUM(json_extract(attributes, '$.tokens.total')) as total_tokens
FROM spans
WHERE created_at > NOW() - INTERVAL 24 HOURS
GROUP BY name
ORDER BY total_cost DESC;
Status: High priority for 1.0
3. đ Comparative Analysis Between RunsÂļ
Feature: Compare trace performance across deployments, branches, or time periods.
Inspiration: Standard observability practice (Datadog, New Relic).
Flock implementation:
-- Compare two time periods
WITH period1 AS (
SELECT service, AVG(duration_ms) as avg_duration
FROM spans
WHERE created_at BETWEEN '2025-10-01' AND '2025-10-05'
GROUP BY service
),
period2 AS (
SELECT service, AVG(duration_ms) as avg_duration
FROM spans
WHERE created_at BETWEEN '2025-10-06' AND '2025-10-10'
GROUP BY service
)
SELECT
p1.service,
p1.avg_duration as before,
p2.avg_duration as after,
((p2.avg_duration - p1.avg_duration) / p1.avg_duration * 100) as change_pct
FROM period1 p1
JOIN period2 p2 ON p1.service = p2.service;
Dashboard view:
Deployment Comparison
âââââââââââââââŦââââââââââŦââââââââââŦââââââââââââââ
â Service â Before â After â Change â
âââââââââââââââŧââââââââââŧââââââââââŧââââââââââââââ¤
â Agent â 6,499ms â 3,200ms â -50.7% â
â
â DSPyEngine â 6,200ms â 2,800ms â -54.8% â
â
âââââââââââââââ´ââââââââââ´ââââââââââ´ââââââââââââââ
Use cases: - Before/after optimization - Branch comparison (feature vs main) - Canary deployment validation
Status: Medium priority for 1.0
4. đ Alerts and NotificationsÂļ
Feature: Alert on SLO violations, error spikes, or anomalies.
Inspiration: Standard observability (PagerDuty, Datadog alerts).
Flock implementation:
# alerts.yaml
alerts:
- name: High Error Rate
condition: error_rate > 5%
service: DSPyEngine
window: 5 minutes
notify:
- slack: #ops-alerts
- email: ops@company.com
- name: Latency Spike
condition: p95_duration > 10000ms
service: Agent
window: 10 minutes
notify:
- pagerduty: escalation-policy-1
Alert logic:
def check_slo():
query = """
SELECT
service,
(SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) * 100) as error_rate
FROM spans
WHERE created_at > NOW() - INTERVAL 5 MINUTES
GROUP BY service
"""
for row in db.execute(query):
if row['error_rate'] > 5.0:
send_alert(f"đ¨ {row['service']} error rate: {row['error_rate']:.1f}%")
Status: Medium priority for 1.0
5. đ¤ Export to External Observability PlatformsÂļ
Feature: Export traces to Jaeger, Grafana, Honeycomb, etc.
Inspiration: OpenTelemetry standard practice.
Flock implementation:
Enhancement needed: - â Current: OTLP endpoint (works with Jaeger, Grafana, etc.) - â Missing: Native exporters for popular platforms - â Missing: Batch export of historical traces
Planned features:
# Export to Jaeger
flock export jaeger --start 2025-10-01 --end 2025-10-05
# Export to S3 for long-term storage
flock export s3 --bucket traces-archive --format parquet
# Export to CSV for analysis
flock export csv --output traces.csv --service Agent
Status: Low priority for 1.0 (OTLP already works)
6. đ Performance Regression DetectionÂļ
Feature: Automatically detect when performance degrades.
Inspiration: Continuous profiling tools (Pyroscope, Datadog).
Flock implementation:
# Baseline: Average P95 over last 7 days
baseline_p95 = get_p95_baseline(service="Agent", days=7)
# Current: P95 in last hour
current_p95 = get_p95_recent(service="Agent", hours=1)
# Alert if regression > 20%
if current_p95 > baseline_p95 * 1.2:
send_alert(f"â Agent P95 regression: {current_p95}ms (baseline: {baseline_p95}ms)")
Dashboard view:
Performance Trends (7 days)
âââââââââââââââââââââââââââââââââââââââ
â Agent P95 Duration â
â â
â 8344ms ⤠ââ⎠â
â 6000ms ⤠âââ⎠â â â Spike!â
â 4000ms ⤠âââ⯠â°ââââ⯠â°ââ⎠â
â 2000ms âŧââ⯠â°âââ â
â ââââââââââââââââââââââââââââ
â Mon Tue Wed Thu Fri â
âââââââââââââââââââââââââââââââââââââââ
Status: Medium priority for 1.0
7. đ¤ Automatic Anomaly DetectionÂļ
Feature: ML-based detection of unusual patterns.
Inspiration: Datadog Watchdog, AWS DevOps Guru.
Flock implementation:
# Train on historical data
model = AnomalyDetector.train(spans, features=[
"duration_ms",
"error_rate",
"call_frequency"
])
# Detect anomalies in real-time
for span in new_spans:
if model.is_anomaly(span):
alert(f"đ Anomaly detected: {span.name} took {span.duration_ms}ms (expected ~{model.expected_duration}ms)")
Examples of anomalies: - Agent taking 50x longer than usual - New error types appearing - Sudden spike in call frequency - Dependency graph changes (new edges)
Status: Low priority for 1.0 (nice-to-have)
8. đ Multi-Environment ComparisonÂļ
Feature: Compare dev/staging/prod traces side-by-side.
Inspiration: Standard DevOps practice.
Flock implementation:
# Tag traces with environment
ctx = Context(
correlation_id=uuid4(),
environment="production" # or "staging", "dev"
)
Dashboard view:
Environment Comparison (Agent.execute)
ââââââââââââŦââââââââââŦâââââââââŦââââââââââââ
â Env â P95 â Errors â Rate â
ââââââââââââŧââââââââââŧâââââââââŧââââââââââââ¤
â Dev â 2,100ms â 0.1% â 0.5 req/s â
â Staging â 3,400ms â 0.5% â 2.0 req/s â
â Prod â 6,200ms â 5.2% â 25 req/s ââ Problem!
ââââââââââââ´ââââââââââ´âââââââââ´ââââââââââââ
Use cases: - Validate staging matches prod - Test capacity before prod deploy - Debug env-specific issues
Status: Medium priority for 1.0
9. đ¨ Custom Metrics and DashboardsÂļ
Feature: User-defined metrics and visualizations.
Inspiration: Grafana dashboards, Datadog custom metrics.
Flock implementation:
# Custom metrics
@traced_and_logged
def my_agent_method(self, ctx, artifacts):
# Track custom metric
ctx.set_metric("pizza_toppings_count", len(artifacts[0].toppings))
ctx.set_metric("is_vegan", artifacts[0].is_vegan)
Dashboard builder:
# custom_dashboard.yaml
dashboard:
title: "Pizza Orders Analytics"
panels:
- type: line_chart
metric: pizza_toppings_count
aggregation: avg
group_by: time(1h)
- type: pie_chart
metric: is_vegan
aggregation: count
title: "Vegan vs Non-Vegan Orders"
Status: Low priority for 1.0 (SQL queries work for now)
10. đĨ Collaboration FeaturesÂļ
Feature: Share traces, add comments, create issue tickets.
Inspiration: Sentry, Datadog incident management.
Flock implementation:
# Share trace
share_url = flock.share_trace(trace_id="abc123", expires_days=7)
# Returns: https://flock.app/trace/abc123?token=xyz
# Add comment
flock.comment_on_trace(
trace_id="abc123",
span_id="span456",
user="alice@company.com",
comment="This span is slow because of database query"
)
# Create JIRA ticket from trace
flock.create_issue(
trace_id="abc123",
title="Agent timeout on large orders",
assignee="bob@company.com"
)
Status: Low priority for 1.0 (team feature)
Summary: Priority MatrixÂļ
Feature | Priority | Effort | Impact | Status |
---|---|---|---|---|
Cost Tracking | đĨ High | Medium | High | Planned |
Time-Travel Debug | đĨ High | High | High | Planned |
Regression Detection | đĄ Med | Medium | High | Planned |
Alerts | đĄ Med | Low | High | Planned |
Multi-Env Compare | đĄ Med | Low | Med | Planned |
Comparative Analysis | đĄ Med | Low | Med | Planned |
Anomaly Detection | đĸ Low | High | Med | Future |
Custom Dashboards | đĸ Low | High | Low | Future |
Collaboration | đĸ Low | Med | Low | Future |
Export Historical | đĸ Low | Low | Low | Works (OTLP) |
Note: OTLP export already works for real-time integration with Jaeger, Grafana, etc. Historical batch export is lower priority.
ConclusionÂļ
Flock's tracing system is designed specifically for blackboard multi-agent systems, where behavior is emergent and unpredictable. Unlike graph-based frameworks where you verify what you planned, Flock helps you discover what actually happened.
Key takeaways:
- Start simple: Enable tracing with 3 env vars
- Use all four views: Timeline (debug), Statistics (inspect), RED Metrics (monitor), Dependencies (discover)
- Filter wisely: Trace core services, not everything
- DuckDB is your friend: Write custom SQL for deep analysis
- Monitor production: Set SLOs and check RED metrics daily
What makes Flock unique: - Zero external dependencies - Operation-level drill-down - Blackboard-native observability - P99 latency tracking - Built-in TTL management - SQL-based analytics
What's coming in 1.0: - Time-travel debugging - Cost tracking - Performance regression detection - Alerts
Resources: - Auto-Tracing Guide - Technical reference - Tracing Quick Start - Getting started - Production Tracing - Production best practices
Happy debugging! đ¯
Last updated: 2025-10-07