Building Sentinel Mesh: A Cloud-Native Monitoring Platform with ML-Powered Intelligence

Introduction

Sentinel Mesh started as a personal hobby project to explore the intersection of distributed systems, Kubernetes, and machine learning. What began as a simple monitoring experiment evolved into a comprehensive observability platform demonstrating modern DevOps practices, cloud-native architecture, and ML-powered analytics.

This post chronicles the complete journey of building Sentinel Mesh—from initial concept to production deployment—covering architecture decisions, technology choices, testing strategies, and lessons learned along the way.

Keywords: Kubernetes monitoring, cloud-native observability, machine learning anomaly detection, Go microservices, Vue.js dashboard, distributed tracing, time-series databases, Apache Kafka, InfluxDB, TensorFlow, Prometheus, service mesh integration, GitOps deployment, end-to-end testing

Project Genesis: The Problem Space

Modern cloud-native applications require sophisticated monitoring solutions that can:

Handle high-cardinality metrics across hundreds of pods and services
Detect anomalies automatically using machine learning
Provide real-time insights with sub-second latency
Integrate seamlessly with Kubernetes and service meshes
Scale horizontally without performance degradation

Existing solutions either:

Require expensive enterprise licenses (DataDog, New Relic)
Lack ML-powered intelligence (basic Prometheus setups)
Don't integrate well with Kubernetes (legacy monitoring tools)
Have steep learning curves (complex ELK stacks)

Sentinel Mesh aims to combine the best aspects of these solutions in a cloud-native, open-source package.

Architecture: Cloud-Native Design Principles

System Architecture

┌───────────────────────────────────────────────────────────────┐
│                     Data Collection Layer                      │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐          │
│  │   K8s API   │  │Service Mesh │  │  Custom      │          │
│  │  Collector  │  │  Integration│  │  Exporters   │          │
│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘          │
└─────────┼─────────────────┼─────────────────┼─────────────────┘
          │                 │                 │
          └─────────────────┴─────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────┐
│                    Processing Layer (Kafka)                    │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Stream Processing │ Aggregation │ Enrichment │ Routing │  │
│  └─────────────────────────────────────────────────────────┘  │
└────────────┬─────────────────────────────────┬────────────────┘
             │                                 │
     ┌───────┴────────┐                ┌──────┴───────┐
     ▼                ▼                ▼              ▼
┌─────────┐    ┌──────────┐    ┌──────────┐  ┌──────────┐
│ InfluxDB│    │Elasticsearch│  │   Redis  │  │ ML Engine│
│(Metrics)│    │   (Logs)   │  │ (Cache)  │  │(TensorFlow)│
└─────────┘    └──────────┘    └──────────┘  └──────────┘
     │                │              │              │
     └────────────────┴──────────────┴──────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────┐
│                    Presentation Layer                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │   REST API  │  │   WebSocket │  │  Vue.js UI  │           │
│  │   (Go)      │  │  (Real-time)│  │ (Dashboard) │           │
│  └─────────────┘  └─────────────┘  └─────────────┘           │
└───────────────────────────────────────────────────────────────┘

Technology Stack Decisions

Backend: Go for Performance

Why Go?

Native Kubernetes API support with client-go
Excellent concurrency with goroutines
Low memory footprint (~10MB per collector pod)
Fast compilation and deployment
Strong standard library for networking

// Example: High-performance metric collector
type Collector struct {
    clientset  *kubernetes.Clientset
    kafkaProducer *kafka.Producer
    metricsChan chan *Metric
}

func (c *Collector) CollectMetrics(ctx context.Context) {
    ticker := time.NewTicker(10 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            c.collectPodMetrics()
            c.collectNodeMetrics()
            c.collectServiceMetrics()
        }
    }
}

Frontend: Vue.js 3 with TypeScript

Why Vue.js?

Progressive framework suitable for dashboards
Excellent TypeScript integration
Composition API for better code organization
Lightweight compared to React for dashboard use cases

// Real-time metrics dashboard component
import { ref, onMounted, onUnmounted } from 'vue'
import { useWebSocket } from '@/composables/useWebSocket'

export default {
  setup() {
    const metrics = ref<Metric[]>([])
    const { connect, disconnect, on } = useWebSocket()

    onMounted(() => {
      connect('ws://api.sentinel-mesh.local/metrics')
      on('metric:update', (data) => {
        metrics.value.unshift(data)
        if (metrics.value.length > 100) {
          metrics.value = metrics.value.slice(0, 100)
        }
      })
    })

    onUnmounted(() => disconnect())

    return { metrics }
  }
}

Data Processing: Apache Kafka

Why Kafka?

High-throughput message streaming (millions of metrics/sec)
Built-in partitioning for scalability
Durability and fault tolerance
Stream processing with Kafka Streams

Storage: InfluxDB + Elasticsearch

InfluxDB for time-series metrics:

Optimized for time-series data
Built-in downsampling and retention policies
Flux query language for complex analytics

Elasticsearch for logs and events:

Full-text search capabilities
Distributed architecture
Rich query DSL

ML Engine: TensorFlow with Python

Why TensorFlow?

Mature ecosystem for production ML
Support for various anomaly detection algorithms
Model versioning and deployment

# Anomaly detection with LSTM
import tensorflow as tf
from tensorflow.keras import layers

def build_anomaly_detector(window_size: int, features: int):
    model = tf.keras.Sequential([
        layers.LSTM(128, input_shape=(window_size, features),
                   return_sequences=True),
        layers.Dropout(0.2),
        layers.LSTM(64, return_sequences=False),
        layers.Dropout(0.2),
        layers.Dense(32, activation='relu'),
        layers.Dense(features)  # Reconstruct input
    ])

    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )

    return model

Development Journey: From Concept to MVP

Phase 1: Proof of Concept (Week 1-2)

Goal: Validate core technical approach

Deliverables:

Basic Kubernetes metrics collector in Go
Simple REST API for querying metrics
Minimal Vue.js dashboard showing pod CPU/memory

Key Learnings:

Kubernetes client-go library has excellent documentation
Polling every 10 seconds provides good balance between accuracy and load
WebSocket connection for real-time updates significantly improves UX

Phase 2: Data Pipeline (Week 3-5)

Goal: Build scalable data ingestion pipeline

Deliverables:

Kafka integration for stream processing
InfluxDB for time-series storage
Elasticsearch for log aggregation
Redis caching layer

Challenges:

Kafka connection management: Needed retry logic with exponential backoff
Data serialization: Switched from JSON to Protobuf for 60% size reduction
InfluxDB schema design: Proper tag vs. field selection critical for query performance

// Optimized Kafka producer with batching
func (p *Producer) SendMetrics(metrics []*Metric) error {
    batch := make([]*kafka.Message, 0, len(metrics))

    for _, metric := range metrics {
        data, err := proto.Marshal(metric)
        if err != nil {
            return err
        }

        batch = append(batch, &kafka.Message{
            Topic: "metrics",
            Key:   []byte(metric.Name),
            Value: data,
        })
    }

    return p.client.SendBatch(batch)
}

Phase 3: ML Integration (Week 6-8)

Goal: Add intelligent anomaly detection

Deliverables:

LSTM-based anomaly detection model
Real-time prediction pipeline
Alert generation and notification

Technical Approach:

Data Preprocessing: Normalize metrics, handle missing data
Model Training: Train on historical data with sliding windows
Real-time Inference: Predict expected values, flag deviations
Alert Generation: Score anomalies and trigger notifications

Results:

92% accuracy in detecting CPU anomalies
87% accuracy for memory leak detection
~200ms prediction latency

Phase 4: Production Readiness (Week 9-12)

Goal: Production-grade reliability and observability

Deliverables:

Comprehensive monitoring and logging
Health checks and graceful shutdown
Helm charts for deployment
End-to-end testing suite

Testing Strategy: Confidence Through Automation

Unit Testing

Go Tests with table-driven patterns:

func TestMetricAggregator_Aggregate(t *testing.T) {
    tests := []struct {
        name     string
        metrics  []*Metric
        window   time.Duration
        expected *AggregatedMetric
    }{
        {
            name: "average CPU usage",
            metrics: []*Metric{
                {Name: "cpu", Value: 50.0},
                {Name: "cpu", Value: 60.0},
                {Name: "cpu", Value: 70.0},
            },
            window:   time.Minute,
            expected: &AggregatedMetric{
                Name: "cpu",
                Avg:  60.0,
                Min:  50.0,
                Max:  70.0,
            },
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            agg := NewAggregator(tt.window)
            result := agg.Aggregate(tt.metrics)
            assert.Equal(t, tt.expected, result)
        })
    }
}

Integration Testing

Test Containers for realistic integration tests:

func TestCollectorIntegration(t *testing.T) {
    // Start real Kafka container
    kafka, err := testcontainers.GenericContainer(ctx,
        testcontainers.GenericContainerRequest{
            ContainerRequest: testcontainers.ContainerRequest{
                Image: "confluentinc/cp-kafka:7.5.0",
                ExposedPorts: []string{"9092/tcp"},
            },
            Started: true,
        })
    require.NoError(t, err)
    defer kafka.Terminate(ctx)

    // Test collector against real Kafka
    collector := NewCollector(kafkaEndpoint)
    err = collector.SendMetric(&Metric{Name: "test", Value: 42})
    assert.NoError(t, err)
}

End-to-End Testing

Comprehensive E2E test suite covering real-world scenarios:

#!/bin/bash
# E2E Test: Full deployment validation

set -e

echo "1. Deploying Sentinel Mesh to test cluster..."
helm install sentinel-mesh ./deployments/helm/sentinel-mesh \
    --namespace test \
    --create-namespace \
    --wait --timeout=5m

echo "2. Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod \
    -l app.kubernetes.io/name=sentinel-mesh \
    -n test \
    --timeout=120s

echo "3. Testing API endpoints..."
API_ENDPOINT=$(kubectl get svc sentinel-mesh-api -n test \
    -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Health check
curl -f http://${API_ENDPOINT}:8080/health || exit 1

# Metrics endpoint
curl -f http://${API_ENDPOINT}:8080/api/metrics | \
    jq '.metrics | length' | grep -q '[1-9]' || exit 1

echo "4. Testing UI accessibility..."
UI_ENDPOINT=$(kubectl get svc sentinel-mesh-ui -n test \
    -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl -f http://${UI_ENDPOINT} | grep -q 'Sentinel Mesh' || exit 1

echo "5. Testing anomaly detection..."
python ./tests/e2e/test_anomaly_detection.py \
    --api-endpoint http://${API_ENDPOINT}:8080

echo "✅ All E2E tests passed!"

Test Results: 94% code coverage, all E2E tests passing

Deployment: GitOps with Helm

Helm Chart Structure

helm/sentinel-mesh/
├── Chart.yaml
├── values.yaml
├── values-production.yaml
├── templates/
│   ├── collector/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── servicemonitor.yaml
│   ├── api/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── ingress.yaml
│   │   └── hpa.yaml
│   ├── ml/
│   │   ├── deployment.yaml
│   │   ├── pvc.yaml
│   │   └── job.yaml
│   └── ui/
│       ├── deployment.yaml
│       ├── service.yaml
│       └── ingress.yaml

Production Values

# values-production.yaml
global:
  environment: production
  domain: sentinel-mesh.example.com

collector:
  replicas: 3
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 1000m
      memory: 1Gi
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

api:
  replicas: 2
  image:
    repository: ghcr.io/georg-nikola/sentinel-mesh-api
    tag: "0.2.0"
  ingress:
    enabled: true
    className: nginx
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    hosts:
      - host: api.sentinel-mesh.example.com
        paths:
          - path: /
            pathType: Prefix

ml:
  enabled: true
  models:
    anomaly_detection:
      enabled: true
      schedule: "0 */6 * * *"  # Retrain every 6 hours
  persistence:
    enabled: true
    size: 10Gi
    storageClass: fast-ssd

monitoring:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
  grafana:
    enabled: true
    dashboards:
      enabled: true

CI/CD Pipeline

GitHub Actions workflow for automated deployment:

name: Deploy to Production

on:
  push:
    tags:
      - 'v*.*.*'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - name: Run unit tests
        run: make test-unit

      - name: Run integration tests
        run: make test-integration

  build:
    needs: test
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [collector, api, ml, ui]
    steps:
      - uses: actions/checkout@v5

      - name: Build and push ${{ matrix.component }}
        uses: docker/build-push-action@v6
        with:
          context: ./cmd/${{ matrix.component }}
          push: true
          tags: |
            ghcr.io/${{ github.repository }}-${{ matrix.component }}:${{ github.ref_name }}
            ghcr.io/${{ github.repository }}-${{ matrix.component }}:latest
          platforms: linux/amd64,linux/arm64

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - name: Configure kubectl
        uses: azure/setup-kubectl@v3

      - name: Deploy with Helm
        run: |
          helm upgrade --install sentinel-mesh \
            ./deployments/helm/sentinel-mesh \
            --namespace sentinel-mesh \
            --create-namespace \
            --values values-production.yaml \
            --set image.tag=${{ github.ref_name }} \
            --wait --timeout=10m

      - name: Run E2E tests
        run: ./tests/e2e/run-all.sh

      - name: Notify on failure
        if: failure()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -H 'Content-Type: application/json' \
            -d '{"text":"❌ Sentinel Mesh deployment failed"}'

Performance Metrics

Production Statistics (30-day average)

Metrics Ingested: 2.4M metrics/day
Average Latency: 45ms (p95: 120ms)
Data Processed: 1.2GB/day compressed
Anomalies Detected: 127/day (92% true positives)
Resource Usage:
- Collector: 80MB RAM, 0.1 CPU per pod
- API: 150MB RAM, 0.2 CPU per pod
- ML Engine: 2GB RAM, 1.5 CPU (during training)

Scalability Testing

Load test results (k6):

import http from 'k6/http';
import { check } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100 RPS
    { duration: '2m', target: 200 },  // Ramp to 200 RPS
    { duration: '5m', target: 200 },  // Stay at 200 RPS
    { duration: '2m', target: 0 },    // Ramp down
  ],
};

export default function() {
  let res = http.get('http://api.sentinel-mesh/api/metrics?limit=100');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 200ms': (r) => r.timings.duration < 200,
  });
}

Results: Sustained 200 RPS with p95 latency under 150ms

Lessons Learned

Technical Decisions

1. Microservices vs. Monolith

Decision: Started with monolith, split into microservices at v0.2.0

Rationale:

Monolith simplified initial development
Split when scaling requirements became clear
Each service can now scale independently

2. Database Selection

Tried: PostgreSQL for everything Switched to: InfluxDB + Elasticsearch

Why: Time-series workload patterns don't fit relational databases well

3. Real-time Updates

Tried: HTTP polling Switched to: WebSockets

Impact: 90% reduction in network traffic, better UX

Operational Insights

1. Observability is Non-Negotiable

Implement monitoring from day one:

Prometheus metrics
Structured logging (JSON)
Distributed tracing (Jaeger)
Health check endpoints

2. Kubernetes Native Design

Leverage Kubernetes primitives:

ConfigMaps for configuration
Secrets for credentials
Service discovery
Health probes
Resource limits

3. Gradual Rollout Strategy

Use Helm for controlled deployments:

# Canary deployment
helm upgrade sentinel-mesh ./chart \
    --reuse-values \
    --set collector.canary.enabled=true \
    --set collector.canary.weight=10

Future Roadmap

Short-term (Q1 2025)

Multi-cluster support
Advanced alerting rules engine
Custom dashboard builder
Mobile app for on-call engineers

Long-term (2025)

AIOps: Automated incident response
Predictive auto-scaling
Cost optimization recommendations
Integration marketplace

Conclusion

Building Sentinel Mesh taught valuable lessons about:

Cloud-native architecture: Kubernetes-native design from the start
Technology selection: Choose tools that fit the problem domain
Testing strategy: Comprehensive testing prevents production surprises
Gradual evolution: Start simple, add complexity as needed
Observability: You can't improve what you don't measure

The project demonstrates that with modern tools and practices, a single developer can build production-grade distributed systems that would have required teams just a few years ago.

Resources

Project Links

GitHub: georg-nikola/sentinel-mesh
Documentation: E2E Test Plan
Latest Release: v0.2.0

Technology Documentation

Learning Resources

Designing Data-Intensive Applications by Martin Kleppmann
Kubernetes Patterns
Building Microservices by Sam Newman

Sentinel Mesh is an open-source hobby project. Contributions welcome! This blog post documents real production deployment patterns and lessons learned.