Monitoring and Observability
Health checks, logging, and monitoring for Sibyl deployments.
Health Check Endpoints
Public Health Check
GET /api/healthSimple health check for load balancers and monitoring. No authentication required.
Response:
{
"status": "healthy"
}Used by:
- Kubernetes liveness/readiness probes
- Load balancer health checks
- Frontend connection checks
Detailed Health Check
GET /api/admin/healthDetailed health information with database connectivity status. Requires authentication (org member role).
Response:
{
"status": "healthy",
"server_name": "sibyl",
"uptime_seconds": 3600,
"graph_connected": true,
"entity_counts": {
"task": 150,
"pattern": 42,
"episode": 89,
"project": 5
},
"errors": []
}Stats Endpoint
GET /api/admin/statsKnowledge graph statistics. Requires authentication.
Response:
{
"entity_counts": {
"task": 150,
"pattern": 42,
"episode": 89,
"project": 5,
"epic": 3,
"rule": 12,
"source": 8,
"document": 45
},
"total_entities": 354
}Kubernetes Probes
The Helm chart configures liveness and readiness probes:
backend:
livenessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 5
periodSeconds: 10Probe Behavior
| Probe | Purpose | Failure Action |
|---|---|---|
| Liveness | Is the process alive? | Restart container |
| Readiness | Can the pod accept traffic? | Remove from service |
Logging
Log Format
Sibyl uses structlog for structured logging with a custom SilkCircuit-themed renderer.
Console output (development):
14:32:15 INFO api request path=/api/health status=200 duration_ms=1.23
14:32:16 INFO api graph_connected host=falkordb port=6379
14:32:17 WARN api rate_limit_exceeded client=192.168.1.100JSON output (production):
{
"timestamp": "2025-01-15T14:32:15Z",
"level": "info",
"service": "api",
"event": "request",
"path": "/api/health",
"status": 200,
"duration_ms": 1.23
}Log Configuration
from sibyl_core.logging import configure_logging
configure_logging(
service_name="api", # Service identifier
level="INFO", # DEBUG, INFO, WARNING, ERROR
colors=True, # Auto-detect TTY
json_output=False, # JSON for production log aggregation
show_service=True, # Show service prefix
)Log Levels
| Level | When to Use |
|---|---|
| DEBUG | Detailed debugging information |
| INFO | General operational events |
| WARNING | Unexpected but recoverable situations |
| ERROR | Errors that need attention |
Set via environment:
SIBYL_LOG_LEVEL=DEBUGSuppressed Loggers
The following noisy loggers are suppressed to WARNING level:
uvicorn.accessuvicorn.errorgraphiti_corehttpx,httpcorearq.worker,arq.jobsmcp,fastmcpneo4j
Access Logging
Every HTTP request is logged with timing:
log.info(
"request",
method=request.method,
path=request.url.path,
status=response.status_code,
duration_ms=round(duration_ms, 2),
client=request.client.host,
)Error Tracking
Unhandled Exceptions
All unhandled exceptions are caught and logged with a reference ID:
log.error(
"unhandled_exception",
error_id=error_id,
path=request.url.path,
method=request.method,
error_type=type(exc).__name__,
error_message=str(exc),
)Clients receive a safe error response:
{
"detail": "An internal error occurred. Please try again later. (ref: a1b2c3d4)"
}Exception Context
Exceptions include full stack traces in logs:
14:32:15 ERROR api unhandled_exception error_id=a1b2c3d4 path=/api/tasks
Traceback (most recent call last):
File "...", line 123, in handler
result = await process()
ConnectionError: Database connection lostMetrics Endpoints
Project Metrics
GET /api/metrics/projects/{project_id}Returns metrics for a specific project:
- Task counts by status (backlog, todo, doing, done, etc.)
- Priority distribution
- Assignee statistics
- Velocity trend (last 14 days)
- Tasks created/completed in last 7 days
Organization Metrics
GET /api/metricsAggregated metrics across all projects:
- Total projects and tasks
- Status/priority distributions
- Top assignees
- Velocity trend
- Project summaries
Monitoring Stack Recommendations
Prometheus + Grafana
For Kubernetes deployments, consider adding:
- Prometheus for metrics collection
- Grafana for dashboards
- Loki for log aggregation
Example ServiceMonitor (if using Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sibyl
namespace: sibyl
spec:
selector:
matchLabels:
app.kubernetes.io/name: sibyl
endpoints:
- port: http
path: /api/health
interval: 30sDatadog
For Datadog integration:
backend:
podAnnotations:
ad.datadoghq.com/backend.logs: '[{"source": "python", "service": "sibyl-api"}]'New Relic
For New Relic APM:
backend:
env:
NEW_RELIC_LICENSE_KEY: "<license-key>"
NEW_RELIC_APP_NAME: "sibyl-api"Log Aggregation
Kubernetes Logging
Logs are written to stdout/stderr and collected by the node's container runtime.
View logs with kubectl:
# All backend logs
kubectl logs -n sibyl -l app.kubernetes.io/component=backend -f
# Worker logs
kubectl logs -n sibyl -l app.kubernetes.io/component=worker -f
# All Sibyl logs
kubectl logs -n sibyl -l app.kubernetes.io/name=sibyl -f
# Previous container logs (after restart)
kubectl logs -n sibyl deploy/sibyl-backend --previousStructured Log Queries
With JSON output enabled, logs can be queried in aggregation systems:
-- Grafana Loki LogQL
{namespace="sibyl", app="backend"} |= "error"
-- Datadog
service:sibyl-api status:error
-- Elasticsearch
{
"query": {
"bool": {
"must": [
{"match": {"service": "api"}},
{"match": {"level": "error"}}
]
}
}
}Alerting Recommendations
Critical Alerts
| Condition | Threshold | Action |
|---|---|---|
| Pod not ready | > 2 min | Page on-call |
| Error rate | > 5% | Page on-call |
| Response time P95 | > 5s | Notify team |
| Database connection failures | > 3 | Page on-call |
Warning Alerts
| Condition | Threshold | Action |
|---|---|---|
| High memory usage | > 80% | Notify team |
| High CPU usage | > 80% | Notify team |
| Rate limiting triggered | > 100/min | Investigate |
Startup Validation
On startup, Sibyl logs connection status:
14:32:15 INFO api PostgreSQL connected host=postgres
14:32:15 INFO api FalkorDB connected host=falkordb
14:32:15 INFO api WebSocket pub/sub enabled for multi-pod broadcasts
14:32:15 INFO api Distributed entity locks enabledIf connections fail, warnings are logged but startup continues:
14:32:15 WARN api PostgreSQL unavailable at startup error="Connection refused"
14:32:15 WARN api Redis pub/sub unavailable - WebSocket broadcasts will be local onlyRecovery on Startup
Sibyl automatically recovers stuck crawl sources on startup:
14:32:15 WARN api Found stuck IN_PROGRESS sources count=2 sources=["docs", "wiki"]
14:32:15 INFO api Recovered stuck source source_name=docs old_status=IN_PROGRESS new_status=COMPLETED
14:32:15 INFO api Startup recovery complete recovered=2 completed=1 reset_to_pending=1