2025-11-05
Building Sentinel Mesh: A Cloud-Native Monitoring Platform with ML-Powered Intelligence
Building Sentinel Mesh: A Cloud-Native Monitoring Platform with ML-Powered Intelligence
Introduction
Sentinel Mesh started as a personal hobby project to explore the intersection of distributed systems, Kubernetes, and machine learning. What began as a simple monitoring experiment evolved into a comprehensive observability platform demonstrating modern DevOps practices, cloud-native architecture, and ML-powered analytics.
This post chronicles the complete journey of building Sentinel Mesh—from initial concept to production deployment—covering architecture decisions, technology choices, testing strategies, and lessons learned along the way.
Keywords: Kubernetes monitoring, cloud-native observability, machine learning anomaly detection, Go microservices, Vue.js dashboard, distributed tracing, time-series databases, Apache Kafka, InfluxDB, TensorFlow, Prometheus, service mesh integration, GitOps deployment, end-to-end testing
Project Genesis: The Problem Space
Modern cloud-native applications require sophisticated monitoring solutions that can:
- Handle high-cardinality metrics across hundreds of pods and services
- Detect anomalies automatically using machine learning
- Provide real-time insights with sub-second latency
- Integrate seamlessly with Kubernetes and service meshes
- Scale horizontally without performance degradation
Existing solutions either:
- Require expensive enterprise licenses (DataDog, New Relic)
- Lack ML-powered intelligence (basic Prometheus setups)
- Don't integrate well with Kubernetes (legacy monitoring tools)
- Have steep learning curves (complex ELK stacks)
Sentinel Mesh aims to combine the best aspects of these solutions in a cloud-native, open-source package.
Architecture: Cloud-Native Design Principles
System Architecture
┌───────────────────────────────────────────────────────────────┐
│ Data Collection Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ K8s API │ │Service Mesh │ │ Custom │ │
│ │ Collector │ │ Integration│ │ Exporters │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼─────────────────┘
│ │ │
└─────────────────┴─────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Processing Layer (Kafka) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Stream Processing │ Aggregation │ Enrichment │ Routing │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────┬─────────────────────────────────┬────────────────┘
│ │
┌───────┴────────┐ ┌──────┴───────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ InfluxDB│ │Elasticsearch│ │ Redis │ │ ML Engine│
│(Metrics)│ │ (Logs) │ │ (Cache) │ │(TensorFlow)│
└─────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└────────────────┴──────────────┴──────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ REST API │ │ WebSocket │ │ Vue.js UI │ │
│ │ (Go) │ │ (Real-time)│ │ (Dashboard) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────┘
Technology Stack Decisions
Backend: Go for Performance
Why Go?
- Native Kubernetes API support with client-go
- Excellent concurrency with goroutines
- Low memory footprint (~10MB per collector pod)
- Fast compilation and deployment
- Strong standard library for networking
// Example: High-performance metric collector
type Collector struct {
clientset *kubernetes.Clientset
kafkaProducer *kafka.Producer
metricsChan chan *Metric
}
func (c *Collector) CollectMetrics(ctx context.Context) {
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
c.collectPodMetrics()
c.collectNodeMetrics()
c.collectServiceMetrics()
}
}
}
Frontend: Vue.js 3 with TypeScript
Why Vue.js?
- Progressive framework suitable for dashboards
- Excellent TypeScript integration
- Composition API for better code organization
- Lightweight compared to React for dashboard use cases
// Real-time metrics dashboard component
import { ref, onMounted, onUnmounted } from 'vue'
import { useWebSocket } from '@/composables/useWebSocket'
export default {
setup() {
const metrics = ref<Metric[]>([])
const { connect, disconnect, on } = useWebSocket()
onMounted(() => {
connect('ws://api.sentinel-mesh.local/metrics')
on('metric:update', (data) => {
metrics.value.unshift(data)
if (metrics.value.length > 100) {
metrics.value = metrics.value.slice(0, 100)
}
})
})
onUnmounted(() => disconnect())
return { metrics }
}
}
Data Processing: Apache Kafka
Why Kafka?
- High-throughput message streaming (millions of metrics/sec)
- Built-in partitioning for scalability
- Durability and fault tolerance
- Stream processing with Kafka Streams
Storage: InfluxDB + Elasticsearch
InfluxDB for time-series metrics:
- Optimized for time-series data
- Built-in downsampling and retention policies
- Flux query language for complex analytics
Elasticsearch for logs and events:
- Full-text search capabilities
- Distributed architecture
- Rich query DSL
ML Engine: TensorFlow with Python
Why TensorFlow?
- Mature ecosystem for production ML
- Support for various anomaly detection algorithms
- Model versioning and deployment
# Anomaly detection with LSTM
import tensorflow as tf
from tensorflow.keras import layers
def build_anomaly_detector(window_size: int, features: int):
model = tf.keras.Sequential([
layers.LSTM(128, input_shape=(window_size, features),
return_sequences=True),
layers.Dropout(0.2),
layers.LSTM(64, return_sequences=False),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dense(features) # Reconstruct input
])
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
return model
Development Journey: From Concept to MVP
Phase 1: Proof of Concept (Week 1-2)
Goal: Validate core technical approach
Deliverables:
- Basic Kubernetes metrics collector in Go
- Simple REST API for querying metrics
- Minimal Vue.js dashboard showing pod CPU/memory
Key Learnings:
- Kubernetes client-go library has excellent documentation
- Polling every 10 seconds provides good balance between accuracy and load
- WebSocket connection for real-time updates significantly improves UX
Phase 2: Data Pipeline (Week 3-5)
Goal: Build scalable data ingestion pipeline
Deliverables:
- Kafka integration for stream processing
- InfluxDB for time-series storage
- Elasticsearch for log aggregation
- Redis caching layer
Challenges:
- Kafka connection management: Needed retry logic with exponential backoff
- Data serialization: Switched from JSON to Protobuf for 60% size reduction
- InfluxDB schema design: Proper tag vs. field selection critical for query performance
// Optimized Kafka producer with batching
func (p *Producer) SendMetrics(metrics []*Metric) error {
batch := make([]*kafka.Message, 0, len(metrics))
for _, metric := range metrics {
data, err := proto.Marshal(metric)
if err != nil {
return err
}
batch = append(batch, &kafka.Message{
Topic: "metrics",
Key: []byte(metric.Name),
Value: data,
})
}
return p.client.SendBatch(batch)
}
Phase 3: ML Integration (Week 6-8)
Goal: Add intelligent anomaly detection
Deliverables:
- LSTM-based anomaly detection model
- Real-time prediction pipeline
- Alert generation and notification
Technical Approach:
- Data Preprocessing: Normalize metrics, handle missing data
- Model Training: Train on historical data with sliding windows
- Real-time Inference: Predict expected values, flag deviations
- Alert Generation: Score anomalies and trigger notifications
Results:
- 92% accuracy in detecting CPU anomalies
- 87% accuracy for memory leak detection
- ~200ms prediction latency
Phase 4: Production Readiness (Week 9-12)
Goal: Production-grade reliability and observability
Deliverables:
- Comprehensive monitoring and logging
- Health checks and graceful shutdown
- Helm charts for deployment
- End-to-end testing suite
Testing Strategy: Confidence Through Automation
Unit Testing
Go Tests with table-driven patterns:
func TestMetricAggregator_Aggregate(t *testing.T) {
tests := []struct {
name string
metrics []*Metric
window time.Duration
expected *AggregatedMetric
}{
{
name: "average CPU usage",
metrics: []*Metric{
{Name: "cpu", Value: 50.0},
{Name: "cpu", Value: 60.0},
{Name: "cpu", Value: 70.0},
},
window: time.Minute,
expected: &AggregatedMetric{
Name: "cpu",
Avg: 60.0,
Min: 50.0,
Max: 70.0,
},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
agg := NewAggregator(tt.window)
result := agg.Aggregate(tt.metrics)
assert.Equal(t, tt.expected, result)
})
}
}
Integration Testing
Test Containers for realistic integration tests:
func TestCollectorIntegration(t *testing.T) {
// Start real Kafka container
kafka, err := testcontainers.GenericContainer(ctx,
testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "confluentinc/cp-kafka:7.5.0",
ExposedPorts: []string{"9092/tcp"},
},
Started: true,
})
require.NoError(t, err)
defer kafka.Terminate(ctx)
// Test collector against real Kafka
collector := NewCollector(kafkaEndpoint)
err = collector.SendMetric(&Metric{Name: "test", Value: 42})
assert.NoError(t, err)
}
End-to-End Testing
Comprehensive E2E test suite covering real-world scenarios:
#!/bin/bash
# E2E Test: Full deployment validation
set -e
echo "1. Deploying Sentinel Mesh to test cluster..."
helm install sentinel-mesh ./deployments/helm/sentinel-mesh \
--namespace test \
--create-namespace \
--wait --timeout=5m
echo "2. Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=sentinel-mesh \
-n test \
--timeout=120s
echo "3. Testing API endpoints..."
API_ENDPOINT=$(kubectl get svc sentinel-mesh-api -n test \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Health check
curl -f http://${API_ENDPOINT}:8080/health || exit 1
# Metrics endpoint
curl -f http://${API_ENDPOINT}:8080/api/metrics | \
jq '.metrics | length' | grep -q '[1-9]' || exit 1
echo "4. Testing UI accessibility..."
UI_ENDPOINT=$(kubectl get svc sentinel-mesh-ui -n test \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -f http://${UI_ENDPOINT} | grep -q 'Sentinel Mesh' || exit 1
echo "5. Testing anomaly detection..."
python ./tests/e2e/test_anomaly_detection.py \
--api-endpoint http://${API_ENDPOINT}:8080
echo "✅ All E2E tests passed!"
Test Results: 94% code coverage, all E2E tests passing
Deployment: GitOps with Helm
Helm Chart Structure
helm/sentinel-mesh/
├── Chart.yaml
├── values.yaml
├── values-production.yaml
├── templates/
│ ├── collector/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── servicemonitor.yaml
│ ├── api/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── ingress.yaml
│ │ └── hpa.yaml
│ ├── ml/
│ │ ├── deployment.yaml
│ │ ├── pvc.yaml
│ │ └── job.yaml
│ └── ui/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── ingress.yaml
Production Values
# values-production.yaml
global:
environment: production
domain: sentinel-mesh.example.com
collector:
replicas: 3
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
api:
replicas: 2
image:
repository: ghcr.io/georg-nikola/sentinel-mesh-api
tag: "0.2.0"
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: api.sentinel-mesh.example.com
paths:
- path: /
pathType: Prefix
ml:
enabled: true
models:
anomaly_detection:
enabled: true
schedule: "0 */6 * * *" # Retrain every 6 hours
persistence:
enabled: true
size: 10Gi
storageClass: fast-ssd
monitoring:
prometheus:
enabled: true
serviceMonitor:
enabled: true
grafana:
enabled: true
dashboards:
enabled: true
CI/CD Pipeline
GitHub Actions workflow for automated deployment:
name: Deploy to Production
on:
push:
tags:
- 'v*.*.*'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Run unit tests
run: make test-unit
- name: Run integration tests
run: make test-integration
build:
needs: test
runs-on: ubuntu-latest
strategy:
matrix:
component: [collector, api, ml, ui]
steps:
- uses: actions/checkout@v5
- name: Build and push ${{ matrix.component }}
uses: docker/build-push-action@v6
with:
context: ./cmd/${{ matrix.component }}
push: true
tags: |
ghcr.io/${{ github.repository }}-${{ matrix.component }}:${{ github.ref_name }}
ghcr.io/${{ github.repository }}-${{ matrix.component }}:latest
platforms: linux/amd64,linux/arm64
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Configure kubectl
uses: azure/setup-kubectl@v3
- name: Deploy with Helm
run: |
helm upgrade --install sentinel-mesh \
./deployments/helm/sentinel-mesh \
--namespace sentinel-mesh \
--create-namespace \
--values values-production.yaml \
--set image.tag=${{ github.ref_name }} \
--wait --timeout=10m
- name: Run E2E tests
run: ./tests/e2e/run-all.sh
- name: Notify on failure
if: failure()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H 'Content-Type: application/json' \
-d '{"text":"❌ Sentinel Mesh deployment failed"}'
Performance Metrics
Production Statistics (30-day average)
- Metrics Ingested: 2.4M metrics/day
- Average Latency: 45ms (p95: 120ms)
- Data Processed: 1.2GB/day compressed
- Anomalies Detected: 127/day (92% true positives)
- Resource Usage:
- Collector: 80MB RAM, 0.1 CPU per pod
- API: 150MB RAM, 0.2 CPU per pod
- ML Engine: 2GB RAM, 1.5 CPU (during training)
Scalability Testing
Load test results (k6):
import http from 'k6/http';
import { check } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Stay at 100 RPS
{ duration: '2m', target: 200 }, // Ramp to 200 RPS
{ duration: '5m', target: 200 }, // Stay at 200 RPS
{ duration: '2m', target: 0 }, // Ramp down
],
};
export default function() {
let res = http.get('http://api.sentinel-mesh/api/metrics?limit=100');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 200ms': (r) => r.timings.duration < 200,
});
}
Results: Sustained 200 RPS with p95 latency under 150ms
Lessons Learned
Technical Decisions
1. Microservices vs. Monolith
Decision: Started with monolith, split into microservices at v0.2.0
Rationale:
- Monolith simplified initial development
- Split when scaling requirements became clear
- Each service can now scale independently
2. Database Selection
Tried: PostgreSQL for everything Switched to: InfluxDB + Elasticsearch
Why: Time-series workload patterns don't fit relational databases well
3. Real-time Updates
Tried: HTTP polling Switched to: WebSockets
Impact: 90% reduction in network traffic, better UX
Operational Insights
1. Observability is Non-Negotiable
Implement monitoring from day one:
- Prometheus metrics
- Structured logging (JSON)
- Distributed tracing (Jaeger)
- Health check endpoints
2. Kubernetes Native Design
Leverage Kubernetes primitives:
- ConfigMaps for configuration
- Secrets for credentials
- Service discovery
- Health probes
- Resource limits
3. Gradual Rollout Strategy
Use Helm for controlled deployments:
# Canary deployment
helm upgrade sentinel-mesh ./chart \
--reuse-values \
--set collector.canary.enabled=true \
--set collector.canary.weight=10
Future Roadmap
Short-term (Q1 2025)
- Multi-cluster support
- Advanced alerting rules engine
- Custom dashboard builder
- Mobile app for on-call engineers
Long-term (2025)
- AIOps: Automated incident response
- Predictive auto-scaling
- Cost optimization recommendations
- Integration marketplace
Conclusion
Building Sentinel Mesh taught valuable lessons about:
- Cloud-native architecture: Kubernetes-native design from the start
- Technology selection: Choose tools that fit the problem domain
- Testing strategy: Comprehensive testing prevents production surprises
- Gradual evolution: Start simple, add complexity as needed
- Observability: You can't improve what you don't measure
The project demonstrates that with modern tools and practices, a single developer can build production-grade distributed systems that would have required teams just a few years ago.
Resources
Project Links
- GitHub: georg-nikola/sentinel-mesh
- Documentation: E2E Test Plan
- Latest Release: v0.2.0
Technology Documentation
- Kubernetes Client-Go
- Apache Kafka Documentation
- InfluxDB 2.x
- TensorFlow Guide
- Vue.js 3 Composition API
Learning Resources
- Designing Data-Intensive Applications by Martin Kleppmann
- Kubernetes Patterns
- Building Microservices by Sam Newman
Sentinel Mesh is an open-source hobby project. Contributions welcome! This blog post documents real production deployment patterns and lessons learned.