Monitoring Stack

Overview

The Monitoring Stack provides comprehensive observability for the HealthFlow NDP platform, including metrics collection, log aggregation, visualization, and alerting.

Stack Architecture

Components

1. Prometheus

Purpose: Time-series database for metrics collection and storage

Version: Prometheus 3.x

Features:

Multi-dimensional data model
Flexible query language (PromQL)
Pull-based metric collection
Service discovery
Alert rule evaluation
Federation support

Metrics Collected:

Application metrics (custom business metrics)
Infrastructure metrics (CPU, memory, disk, network)
Database metrics (connections, queries, latency)
Kubernetes metrics (pods, nodes, deployments)
HTTP request metrics (rate, latency, errors)

2. Grafana

Purpose: Visualization and dashboards platform

Version: Grafana 12.x

Features:

Rich visualization options
Multiple data source support
Template variables
Alert annotations
User management and RBAC
Dashboard sharing and embedding

Pre-configured Dashboards:

Cluster Overview
Node Status
Pod Resources
Application Performance
Database Health
Business KPIs

3. Loki

Purpose: Log aggregation system optimized for Kubernetes

Version: Loki 3.x

Features:

Label-based indexing (no full-text indexing)
Cost-effective storage
LogQL query language
Native Grafana integration
Multi-tenancy support
S3/GCS backend support

Log Sources:

Application logs
System logs
Kubernetes events
Audit logs
Access logs

4. Promtail

Purpose: Agent for shipping logs to Loki

Deployment: DaemonSet (runs on all nodes)

Features:

Automatic service discovery
Label extraction from logs
Log parsing and transformation
Position tracking
Batch sending

5. AlertManager

Purpose: Alert routing, grouping, and notification

Version: AlertManager 0.27.x

Features:

Alert deduplication
Grouping and silencing
Route-based notifications
Template-based messages
High availability mode

Notification Channels:

Email
Slack
PagerDuty
Webhook
OpsGenie

6. Node Exporter

Purpose: Hardware and OS metrics from host machines

Deployment: DaemonSet (runs on all nodes)

Metrics:

CPU usage
Memory utilization
Disk I/O
Network statistics
Filesystem usage

7. cAdvisor

Purpose: Container resource usage and performance metrics

Deployment: DaemonSet (runs on all nodes)

Metrics:

Container CPU usage
Container memory usage
Network I/O per container
Filesystem usage per container

Kubernetes Manifests

Namespace

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring-stack
  labels:
    name: monitoring-stack
    stack: infrastructure

Prometheus ConfigMap

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring-stack
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'healthflow-ndp'
        environment: 'staging'

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - alertmanager:9093

    rule_files:
      - /etc/prometheus/alerts.yml

    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.*):10250'
            replacement: '${1}:9100'
            target_label: __address__

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

      - job_name: 'postgresql'
        static_configs:
          - targets: ['postgresql.data-stack:5432']

      - job_name: 'redis'
        static_configs:
          - targets: ['redis.data-stack:6379']

Prometheus StatefulSet

yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: monitoring-stack
spec:
  serviceName: prometheus
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v3.5.1
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=15d'
          - '--storage.tsdb.retention.size=90GB'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        ports:
        - containerPort: 9090
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
      - name: config
        configMap:
          name: prometheus-config
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 100Gi

Grafana Deployment

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring-stack
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:12.3.2
        ports:
        - containerPort: 3000
          name: http
        env:
        - name: GF_SERVER_ROOT_URL
          value: "https://grafana.healthflow.eg"
        - name: GF_SECURITY_ADMIN_USER
          value: "admin"
        - name: GF_SECURITY_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: grafana-secret
              key: admin-password
        - name: GF_INSTALL_PLUGINS
          value: "grafana-piechart-panel,grafana-clock-panel"
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
        - name: datasources
          mountPath: /etc/grafana/provisioning/datasources
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /api/health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-pvc
      - name: datasources
        configMap:
          name: grafana-datasources

Loki StatefulSet

yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: loki
  namespace: monitoring-stack
spec:
  serviceName: loki
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:3.6
        args:
          - -config.file=/etc/loki/loki.yml
        ports:
        - containerPort: 3100
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: loki-storage
          mountPath: /loki
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "500m"
            memory: 1Gi
      volumes:
      - name: config
        configMap:
          name: loki-config
  volumeClaimTemplates:
  - metadata:
      name: loki-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 200Gi

Promtail DaemonSet

yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring-stack
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    metadata:
      labels:
        app: promtail
    spec:
      serviceAccount: promtail
      containers:
      - name: promtail
        image: grafana/promtail:3.6
        args:
          - -config.file=/etc/promtail/promtail.yml
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        resources:
          limits:
            cpu: "200m"
            memory: 256Mi
          requests:
            cpu: "100m"
            memory: 128Mi
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Service Dependencies

Deployment Instructions

1. Create Namespace and Secrets

bash

# Create namespace
kubectl create namespace monitoring-stack

# Create Grafana admin password
kubectl create secret generic grafana-secret \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  --namespace monitoring-stack

# Create AlertManager configuration secret
kubectl create secret generic alertmanager-config \
  --from-file=alertmanager.yml=configs/alertmanager.yml \
  --namespace monitoring-stack

2. Create ServiceAccounts and RBAC

bash

# Prometheus ServiceAccount
kubectl apply -f prometheus/serviceaccount.yaml
kubectl apply -f prometheus/clusterrole.yaml
kubectl apply -f prometheus/clusterrolebinding.yaml

# Promtail ServiceAccount
kubectl apply -f promtail/serviceaccount.yaml
kubectl apply -f promtail/clusterrole.yaml
kubectl apply -f promtail/clusterrolebinding.yaml

3. Deploy Monitoring Components

bash

# Deploy Prometheus
kubectl apply -f prometheus/configmap.yaml
kubectl apply -f prometheus/statefulset.yaml
kubectl apply -f prometheus/service.yaml

# Deploy Loki
kubectl apply -f loki/configmap.yaml
kubectl apply -f loki/statefulset.yaml
kubectl apply -f loki/service.yaml

# Deploy Promtail
kubectl apply -f promtail/configmap.yaml
kubectl apply -f promtail/daemonset.yaml

# Deploy Grafana
kubectl apply -f grafana/configmap.yaml
kubectl apply -f grafana/pvc.yaml
kubectl apply -f grafana/deployment.yaml
kubectl apply -f grafana/service.yaml

# Deploy AlertManager
kubectl apply -f alertmanager/configmap.yaml
kubectl apply -f alertmanager/statefulset.yaml
kubectl apply -f alertmanager/service.yaml

# Deploy Node Exporter
kubectl apply -f node-exporter/daemonset.yaml

# Deploy cAdvisor
kubectl apply -f cadvisor/daemonset.yaml

4. Verify Deployment

bash

# Check pod status
kubectl get pods -n monitoring-stack

# Check services
kubectl get svc -n monitoring-stack

# Check persistent volumes
kubectl get pvc -n monitoring-stack

# Port-forward to access Grafana
kubectl port-forward -n monitoring-stack svc/grafana 3000:3000

# Access Grafana at http://localhost:3000

Configuration

Grafana Data Sources

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring-stack
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

      - name: Loki
        type: loki
        access: proxy
        url: http://loki:3100
        editable: false

Alert Rules Example

yaml

groups:
  - name: database_alerts
    interval: 30s
    rules:
      - alert: PostgreSQLDown
        expr: up{job="postgresql"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL database has been down for more than 5 minutes"

      - alert: HighDatabaseConnections
        expr: pg_stat_database_numbackends > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections"
          description: "Database has {{ $value }} active connections"

  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"

Resource Requirements

Estimates

These are rough estimates. Actual usage depends on metrics cardinality, log volume, and retention period.

Service	CPU	Memory	Storage	Notes
Prometheus	2 cores	4 GB	100 GB	15-day retention
Loki	1 core	2 GB	200 GB	30-day retention
Grafana	1 core	2 GB	10 GB	Dashboards and config
AlertManager	0.5 core	512 MB	5 GB	Alert state
Promtail	0.2 core	256 MB	-	Per node (DaemonSet)
Node Exporter	0.1 core	128 MB	-	Per node (DaemonSet)
cAdvisor	0.2 core	256 MB	-	Per node (DaemonSet)

Monitoring Best Practices

1. Metric Naming

Use consistent naming conventions
Include units in metric names
Use labels for dimensions

2. Alert Design

Alert on symptoms, not causes
Set appropriate thresholds
Avoid alert fatigue
Include runbook links

3. Dashboard Organization

One dashboard per service
Use template variables
Include SLO/SLI panels
Add documentation panels

4. Log Management

Use structured logging
Include correlation IDs
Set appropriate log levels
Implement log sampling for high-volume apps

5. Retention Policies

Metrics: 15-30 days
Logs: 7-30 days
Long-term: Archive to S3/GCS

Troubleshooting

Prometheus Not Scraping Targets

bash

# Check targets in Prometheus
kubectl port-forward -n monitoring-stack svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

# Check service discovery
kubectl logs -n monitoring-stack prometheus-0

# Verify RBAC permissions
kubectl auth can-i list pods --as=system:serviceaccount:monitoring-stack:prometheus

Grafana Can't Connect to Data Sources

bash

# Check if services are reachable
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://prometheus:9090/-/ready
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://loki:3100/ready

# Check datasource configuration
kubectl get configmap -n monitoring-stack grafana-datasources -o yaml

Loki Not Receiving Logs

bash

# Check Promtail logs
kubectl logs -n monitoring-stack -l app=promtail

# Verify Promtail can reach Loki
kubectl exec -n monitoring-stack -l app=promtail -- wget -O- http://loki:3100/ready

# Check Loki ingester status
kubectl exec -n monitoring-stack loki-0 -- wget -O- http://localhost:3100/ring

Next Steps

Discovery Stack - Deploy Consul & Vault
Application Stack - Deploy NDP services
Gateway Stack - Configure ingress for monitoring services

Monitoring Stack ​

Overview ​

Stack Architecture ​

Components ​

1. Prometheus ​

2. Grafana ​

3. Loki ​

4. Promtail ​

5. AlertManager ​

6. Node Exporter ​

7. cAdvisor ​

Kubernetes Manifests ​

Namespace ​

Prometheus ConfigMap ​

Prometheus StatefulSet ​

Grafana Deployment ​

Loki StatefulSet ​

Promtail DaemonSet ​

Service Dependencies ​

Deployment Instructions ​

1. Create Namespace and Secrets ​

2. Create ServiceAccounts and RBAC ​

3. Deploy Monitoring Components ​

4. Verify Deployment ​

Configuration ​

Grafana Data Sources ​

Alert Rules Example ​

Resource Requirements ​

Monitoring Best Practices ​

1. Metric Naming ​

2. Alert Design ​

3. Dashboard Organization ​

4. Log Management ​

5. Retention Policies ​

Troubleshooting ​

Prometheus Not Scraping Targets ​

Grafana Can't Connect to Data Sources ​

Loki Not Receiving Logs ​

Next Steps ​

Monitoring Stack

Overview

Stack Architecture

Components

1. Prometheus

2. Grafana

3. Loki

4. Promtail

5. AlertManager

6. Node Exporter

7. cAdvisor

Kubernetes Manifests

Namespace

Prometheus ConfigMap

Prometheus StatefulSet

Grafana Deployment

Loki StatefulSet

Promtail DaemonSet

Service Dependencies

Deployment Instructions

1. Create Namespace and Secrets

2. Create ServiceAccounts and RBAC

3. Deploy Monitoring Components

4. Verify Deployment

Configuration

Grafana Data Sources

Alert Rules Example

Resource Requirements

Monitoring Best Practices

1. Metric Naming

2. Alert Design

3. Dashboard Organization

4. Log Management

5. Retention Policies

Troubleshooting

Prometheus Not Scraping Targets

Grafana Can't Connect to Data Sources

Loki Not Receiving Logs

Next Steps