Skip to content

Monitoring Stack ​

Overview ​

The Monitoring Stack provides comprehensive observability for the HealthFlow NDP platform, including metrics collection, log aggregation, visualization, and alerting.

Stack Architecture ​

Components ​

1. Prometheus ​

Purpose: Time-series database for metrics collection and storage

Version: Prometheus 3.x

Features:

  • Multi-dimensional data model
  • Flexible query language (PromQL)
  • Pull-based metric collection
  • Service discovery
  • Alert rule evaluation
  • Federation support

Metrics Collected:

  • Application metrics (custom business metrics)
  • Infrastructure metrics (CPU, memory, disk, network)
  • Database metrics (connections, queries, latency)
  • Kubernetes metrics (pods, nodes, deployments)
  • HTTP request metrics (rate, latency, errors)

2. Grafana ​

Purpose: Visualization and dashboards platform

Version: Grafana 12.x

Features:

  • Rich visualization options
  • Multiple data source support
  • Template variables
  • Alert annotations
  • User management and RBAC
  • Dashboard sharing and embedding

Pre-configured Dashboards:

  • Cluster Overview
  • Node Status
  • Pod Resources
  • Application Performance
  • Database Health
  • Business KPIs

3. Loki ​

Purpose: Log aggregation system optimized for Kubernetes

Version: Loki 3.x

Features:

  • Label-based indexing (no full-text indexing)
  • Cost-effective storage
  • LogQL query language
  • Native Grafana integration
  • Multi-tenancy support
  • S3/GCS backend support

Log Sources:

  • Application logs
  • System logs
  • Kubernetes events
  • Audit logs
  • Access logs

4. Promtail ​

Purpose: Agent for shipping logs to Loki

Deployment: DaemonSet (runs on all nodes)

Features:

  • Automatic service discovery
  • Label extraction from logs
  • Log parsing and transformation
  • Position tracking
  • Batch sending

5. AlertManager ​

Purpose: Alert routing, grouping, and notification

Version: AlertManager 0.27.x

Features:

  • Alert deduplication
  • Grouping and silencing
  • Route-based notifications
  • Template-based messages
  • High availability mode

Notification Channels:

  • Email
  • Slack
  • PagerDuty
  • Webhook
  • OpsGenie

6. Node Exporter ​

Purpose: Hardware and OS metrics from host machines

Deployment: DaemonSet (runs on all nodes)

Metrics:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Filesystem usage

7. cAdvisor ​

Purpose: Container resource usage and performance metrics

Deployment: DaemonSet (runs on all nodes)

Metrics:

  • Container CPU usage
  • Container memory usage
  • Network I/O per container
  • Filesystem usage per container

Kubernetes Manifests ​

Namespace ​

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring-stack
  labels:
    name: monitoring-stack
    stack: infrastructure

Prometheus ConfigMap ​

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring-stack
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'healthflow-ndp'
        environment: 'staging'

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - alertmanager:9093

    rule_files:
      - /etc/prometheus/alerts.yml

    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.*):10250'
            replacement: '${1}:9100'
            target_label: __address__

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

      - job_name: 'postgresql'
        static_configs:
          - targets: ['postgresql.data-stack:5432']

      - job_name: 'redis'
        static_configs:
          - targets: ['redis.data-stack:6379']

Prometheus StatefulSet ​

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: monitoring-stack
spec:
  serviceName: prometheus
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v3.5.1
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=15d'
          - '--storage.tsdb.retention.size=90GB'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        ports:
        - containerPort: 9090
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
      - name: config
        configMap:
          name: prometheus-config
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 100Gi

Grafana Deployment ​

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring-stack
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:12.3.2
        ports:
        - containerPort: 3000
          name: http
        env:
        - name: GF_SERVER_ROOT_URL
          value: "https://grafana.healthflow.eg"
        - name: GF_SECURITY_ADMIN_USER
          value: "admin"
        - name: GF_SECURITY_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: grafana-secret
              key: admin-password
        - name: GF_INSTALL_PLUGINS
          value: "grafana-piechart-panel,grafana-clock-panel"
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
        - name: datasources
          mountPath: /etc/grafana/provisioning/datasources
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /api/health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-pvc
      - name: datasources
        configMap:
          name: grafana-datasources

Loki StatefulSet ​

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: loki
  namespace: monitoring-stack
spec:
  serviceName: loki
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:3.6
        args:
          - -config.file=/etc/loki/loki.yml
        ports:
        - containerPort: 3100
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: loki-storage
          mountPath: /loki
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
      volumes:
      - name: config
        configMap:
          name: loki-config
  volumeClaimTemplates:
  - metadata:
      name: loki-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 200Gi

Promtail DaemonSet ​

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring-stack
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    metadata:
      labels:
        app: promtail
    spec:
      serviceAccount: promtail
      containers:
      - name: promtail
        image: grafana/promtail:3.6
        args:
          - -config.file=/etc/promtail/promtail.yml
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        resources:
          limits:
            cpu: "200m"
            memory: 256Mi
          requests:
            cpu: "100m"
            memory: 128Mi
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Service Dependencies ​

Deployment Instructions ​

1. Create Namespace and Secrets ​

bash
# Create namespace
kubectl create namespace monitoring-stack

# Create Grafana admin password
kubectl create secret generic grafana-secret \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  --namespace monitoring-stack

# Create AlertManager configuration secret
kubectl create secret generic alertmanager-config \
  --from-file=alertmanager.yml=configs/alertmanager.yml \
  --namespace monitoring-stack

2. Create ServiceAccounts and RBAC ​

bash
# Prometheus ServiceAccount
kubectl apply -f prometheus/serviceaccount.yaml
kubectl apply -f prometheus/clusterrole.yaml
kubectl apply -f prometheus/clusterrolebinding.yaml

# Promtail ServiceAccount
kubectl apply -f promtail/serviceaccount.yaml
kubectl apply -f promtail/clusterrole.yaml
kubectl apply -f promtail/clusterrolebinding.yaml

3. Deploy Monitoring Components ​

bash
# Deploy Prometheus
kubectl apply -f prometheus/configmap.yaml
kubectl apply -f prometheus/statefulset.yaml
kubectl apply -f prometheus/service.yaml

# Deploy Loki
kubectl apply -f loki/configmap.yaml
kubectl apply -f loki/statefulset.yaml
kubectl apply -f loki/service.yaml

# Deploy Promtail
kubectl apply -f promtail/configmap.yaml
kubectl apply -f promtail/daemonset.yaml

# Deploy Grafana
kubectl apply -f grafana/configmap.yaml
kubectl apply -f grafana/pvc.yaml
kubectl apply -f grafana/deployment.yaml
kubectl apply -f grafana/service.yaml

# Deploy AlertManager
kubectl apply -f alertmanager/configmap.yaml
kubectl apply -f alertmanager/statefulset.yaml
kubectl apply -f alertmanager/service.yaml

# Deploy Node Exporter
kubectl apply -f node-exporter/daemonset.yaml

# Deploy cAdvisor
kubectl apply -f cadvisor/daemonset.yaml

4. Verify Deployment ​

bash
# Check pod status
kubectl get pods -n monitoring-stack

# Check services
kubectl get svc -n monitoring-stack

# Check persistent volumes
kubectl get pvc -n monitoring-stack

# Port-forward to access Grafana
kubectl port-forward -n monitoring-stack svc/grafana 3000:3000

# Access Grafana at http://localhost:3000

Configuration ​

Grafana Data Sources ​

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring-stack
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

      - name: Loki
        type: loki
        access: proxy
        url: http://loki:3100
        editable: false

Alert Rules Example ​

yaml
groups:
  - name: database_alerts
    interval: 30s
    rules:
      - alert: PostgreSQLDown
        expr: up{job="postgresql"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL database has been down for more than 5 minutes"

      - alert: HighDatabaseConnections
        expr: pg_stat_database_numbackends > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections"
          description: "Database has {{ $value }} active connections"

  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"

Resource Requirements ​

Estimates

These are rough estimates. Actual usage depends on metrics cardinality, log volume, and retention period.

ServiceCPUMemoryStorageNotes
Prometheus2 cores4 GB100 GB15-day retention
Loki1 core2 GB200 GB30-day retention
Grafana1 core2 GB10 GBDashboards and config
AlertManager0.5 core512 MB5 GBAlert state
Promtail0.2 core256 MB-Per node (DaemonSet)
Node Exporter0.1 core128 MB-Per node (DaemonSet)
cAdvisor0.2 core256 MB-Per node (DaemonSet)

Monitoring Best Practices ​

1. Metric Naming ​

  • Use consistent naming conventions
  • Include units in metric names
  • Use labels for dimensions

2. Alert Design ​

  • Alert on symptoms, not causes
  • Set appropriate thresholds
  • Avoid alert fatigue
  • Include runbook links

3. Dashboard Organization ​

  • One dashboard per service
  • Use template variables
  • Include SLO/SLI panels
  • Add documentation panels

4. Log Management ​

  • Use structured logging
  • Include correlation IDs
  • Set appropriate log levels
  • Implement log sampling for high-volume apps

5. Retention Policies ​

  • Metrics: 15-30 days
  • Logs: 7-30 days
  • Long-term: Archive to S3/GCS

Troubleshooting ​

Prometheus Not Scraping Targets ​

bash
# Check targets in Prometheus
kubectl port-forward -n monitoring-stack svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

# Check service discovery
kubectl logs -n monitoring-stack prometheus-0

# Verify RBAC permissions
kubectl auth can-i list pods --as=system:serviceaccount:monitoring-stack:prometheus

Grafana Can't Connect to Data Sources ​

bash
# Check if services are reachable
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://prometheus:9090/-/ready
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://loki:3100/ready

# Check datasource configuration
kubectl get configmap -n monitoring-stack grafana-datasources -o yaml

Loki Not Receiving Logs ​

bash
# Check Promtail logs
kubectl logs -n monitoring-stack -l app=promtail

# Verify Promtail can reach Loki
kubectl exec -n monitoring-stack -l app=promtail -- wget -O- http://loki:3100/ready

# Check Loki ingester status
kubectl exec -n monitoring-stack loki-0 -- wget -O- http://localhost:3100/ring

Next Steps ​