Skip to content

Monitoring Stack

Overview

The Monitoring Stack provides comprehensive observability for the HealthFlow NDP platform, including metrics collection, log aggregation, visualization, and alerting.

Stack Architecture

Components

1. Prometheus

Purpose: Time-series database for metrics collection and storage

Version: Prometheus 3.x

Features:

  • Multi-dimensional data model
  • Flexible query language (PromQL)
  • Pull-based metric collection
  • Service discovery
  • Alert rule evaluation
  • Federation support

Metrics Collected:

  • Application metrics (custom business metrics)
  • Infrastructure metrics (CPU, memory, disk, network)
  • Database metrics (connections, queries, latency)
  • Kubernetes metrics (pods, nodes, deployments)
  • HTTP request metrics (rate, latency, errors)

2. Grafana

Purpose: Visualization and dashboards platform

Version: Grafana 12.x

Features:

  • Rich visualization options
  • Multiple data source support
  • Template variables
  • Alert annotations
  • User management and RBAC
  • Dashboard sharing and embedding

Pre-configured Dashboards:

  • Cluster Overview
  • Node Status
  • Pod Resources
  • Application Performance
  • Database Health
  • Business KPIs

3. Loki

Purpose: Log aggregation system optimized for Kubernetes

Version: Loki 3.x

Features:

  • Label-based indexing (no full-text indexing)
  • Cost-effective storage
  • LogQL query language
  • Native Grafana integration
  • Multi-tenancy support
  • S3/GCS backend support

Log Sources:

  • Application logs
  • System logs
  • Kubernetes events
  • Audit logs
  • Access logs

4. Promtail

Purpose: Agent for shipping logs to Loki

Deployment: DaemonSet (runs on all nodes)

Features:

  • Automatic service discovery
  • Label extraction from logs
  • Log parsing and transformation
  • Position tracking
  • Batch sending

5. AlertManager

Purpose: Alert routing, grouping, and notification

Version: AlertManager 0.27.x

Features:

  • Alert deduplication
  • Grouping and silencing
  • Route-based notifications
  • Template-based messages
  • High availability mode

Notification Channels:

  • Email
  • Slack
  • PagerDuty
  • Webhook
  • OpsGenie

6. Node Exporter

Purpose: Hardware and OS metrics from host machines

Deployment: DaemonSet (runs on all nodes)

Metrics:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Filesystem usage

7. cAdvisor

Purpose: Container resource usage and performance metrics

Deployment: DaemonSet (runs on all nodes)

Metrics:

  • Container CPU usage
  • Container memory usage
  • Network I/O per container
  • Filesystem usage per container

Kubernetes Manifests

Namespace

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring-stack
  labels:
    name: monitoring-stack
    stack: infrastructure

Prometheus ConfigMap

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring-stack
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'healthflow-ndp'
        environment: 'staging'

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - alertmanager:9093

    rule_files:
      - /etc/prometheus/alerts.yml

    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.*):10250'
            replacement: '${1}:9100'
            target_label: __address__

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

      - job_name: 'postgresql'
        static_configs:
          - targets: ['postgresql.data-stack:5432']

      - job_name: 'redis'
        static_configs:
          - targets: ['redis.data-stack:6379']

Prometheus StatefulSet

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: monitoring-stack
spec:
  serviceName: prometheus
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v3.5.1
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=15d'
          - '--storage.tsdb.retention.size=90GB'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        ports:
        - containerPort: 9090
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
      - name: config
        configMap:
          name: prometheus-config
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 100Gi

Grafana Deployment

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring-stack
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:12.3.2
        ports:
        - containerPort: 3000
          name: http
        env:
        - name: GF_SERVER_ROOT_URL
          value: "https://grafana.healthflow.eg"
        - name: GF_SECURITY_ADMIN_USER
          value: "admin"
        - name: GF_SECURITY_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: grafana-secret
              key: admin-password
        - name: GF_INSTALL_PLUGINS
          value: "grafana-piechart-panel,grafana-clock-panel"
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
        - name: datasources
          mountPath: /etc/grafana/provisioning/datasources
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /api/health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-pvc
      - name: datasources
        configMap:
          name: grafana-datasources

Loki StatefulSet

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: loki
  namespace: monitoring-stack
spec:
  serviceName: loki
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:3.6
        args:
          - -config.file=/etc/loki/loki.yml
        ports:
        - containerPort: 3100
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: loki-storage
          mountPath: /loki
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
      volumes:
      - name: config
        configMap:
          name: loki-config
  volumeClaimTemplates:
  - metadata:
      name: loki-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 200Gi

Promtail DaemonSet

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring-stack
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    metadata:
      labels:
        app: promtail
    spec:
      serviceAccount: promtail
      containers:
      - name: promtail
        image: grafana/promtail:3.6
        args:
          - -config.file=/etc/promtail/promtail.yml
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        resources:
          limits:
            cpu: "200m"
            memory: 256Mi
          requests:
            cpu: "100m"
            memory: 128Mi
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Service Dependencies

Deployment Instructions

1. Create Namespace and Secrets

bash
# Create namespace
kubectl create namespace monitoring-stack

# Create Grafana admin password
kubectl create secret generic grafana-secret \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  --namespace monitoring-stack

# Create AlertManager configuration secret
kubectl create secret generic alertmanager-config \
  --from-file=alertmanager.yml=configs/alertmanager.yml \
  --namespace monitoring-stack

2. Create ServiceAccounts and RBAC

bash
# Prometheus ServiceAccount
kubectl apply -f prometheus/serviceaccount.yaml
kubectl apply -f prometheus/clusterrole.yaml
kubectl apply -f prometheus/clusterrolebinding.yaml

# Promtail ServiceAccount
kubectl apply -f promtail/serviceaccount.yaml
kubectl apply -f promtail/clusterrole.yaml
kubectl apply -f promtail/clusterrolebinding.yaml

3. Deploy Monitoring Components

bash
# Deploy Prometheus
kubectl apply -f prometheus/configmap.yaml
kubectl apply -f prometheus/statefulset.yaml
kubectl apply -f prometheus/service.yaml

# Deploy Loki
kubectl apply -f loki/configmap.yaml
kubectl apply -f loki/statefulset.yaml
kubectl apply -f loki/service.yaml

# Deploy Promtail
kubectl apply -f promtail/configmap.yaml
kubectl apply -f promtail/daemonset.yaml

# Deploy Grafana
kubectl apply -f grafana/configmap.yaml
kubectl apply -f grafana/pvc.yaml
kubectl apply -f grafana/deployment.yaml
kubectl apply -f grafana/service.yaml

# Deploy AlertManager
kubectl apply -f alertmanager/configmap.yaml
kubectl apply -f alertmanager/statefulset.yaml
kubectl apply -f alertmanager/service.yaml

# Deploy Node Exporter
kubectl apply -f node-exporter/daemonset.yaml

# Deploy cAdvisor
kubectl apply -f cadvisor/daemonset.yaml

4. Verify Deployment

bash
# Check pod status
kubectl get pods -n monitoring-stack

# Check services
kubectl get svc -n monitoring-stack

# Check persistent volumes
kubectl get pvc -n monitoring-stack

# Port-forward to access Grafana
kubectl port-forward -n monitoring-stack svc/grafana 3000:3000

# Access Grafana at http://localhost:3000

Configuration

Grafana Data Sources

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring-stack
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

      - name: Loki
        type: loki
        access: proxy
        url: http://loki:3100
        editable: false

Alert Rules Example

yaml
groups:
  - name: database_alerts
    interval: 30s
    rules:
      - alert: PostgreSQLDown
        expr: up{job="postgresql"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL database has been down for more than 5 minutes"

      - alert: HighDatabaseConnections
        expr: pg_stat_database_numbackends > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections"
          description: "Database has {{ $value }} active connections"

  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"

Resource Requirements

Estimates

These are rough estimates. Actual usage depends on metrics cardinality, log volume, and retention period.

ServiceCPUMemoryStorageNotes
Prometheus2 cores4 GB100 GB15-day retention
Loki1 core2 GB200 GB30-day retention
Grafana1 core2 GB10 GBDashboards and config
AlertManager0.5 core512 MB5 GBAlert state
Promtail0.2 core256 MB-Per node (DaemonSet)
Node Exporter0.1 core128 MB-Per node (DaemonSet)
cAdvisor0.2 core256 MB-Per node (DaemonSet)

Monitoring Best Practices

1. Metric Naming

  • Use consistent naming conventions
  • Include units in metric names
  • Use labels for dimensions

2. Alert Design

  • Alert on symptoms, not causes
  • Set appropriate thresholds
  • Avoid alert fatigue
  • Include runbook links

3. Dashboard Organization

  • One dashboard per service
  • Use template variables
  • Include SLO/SLI panels
  • Add documentation panels

4. Log Management

  • Use structured logging
  • Include correlation IDs
  • Set appropriate log levels
  • Implement log sampling for high-volume apps

5. Retention Policies

  • Metrics: 15-30 days
  • Logs: 7-30 days
  • Long-term: Archive to S3/GCS

Troubleshooting

Prometheus Not Scraping Targets

bash
# Check targets in Prometheus
kubectl port-forward -n monitoring-stack svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

# Check service discovery
kubectl logs -n monitoring-stack prometheus-0

# Verify RBAC permissions
kubectl auth can-i list pods --as=system:serviceaccount:monitoring-stack:prometheus

Grafana Can't Connect to Data Sources

bash
# Check if services are reachable
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://prometheus:9090/-/ready
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://loki:3100/ready

# Check datasource configuration
kubectl get configmap -n monitoring-stack grafana-datasources -o yaml

Loki Not Receiving Logs

bash
# Check Promtail logs
kubectl logs -n monitoring-stack -l app=promtail

# Verify Promtail can reach Loki
kubectl exec -n monitoring-stack -l app=promtail -- wget -O- http://loki:3100/ready

# Check Loki ingester status
kubectl exec -n monitoring-stack loki-0 -- wget -O- http://localhost:3100/ring

Next Steps