Monitoring Stack
Overview
The Monitoring Stack provides comprehensive observability for the HealthFlow NDP platform, including metrics collection, log aggregation, visualization, and alerting.
Stack Architecture
Components
1. Prometheus
Purpose: Time-series database for metrics collection and storage
Version: Prometheus 3.x
Features:
- Multi-dimensional data model
- Flexible query language (PromQL)
- Pull-based metric collection
- Service discovery
- Alert rule evaluation
- Federation support
Metrics Collected:
- Application metrics (custom business metrics)
- Infrastructure metrics (CPU, memory, disk, network)
- Database metrics (connections, queries, latency)
- Kubernetes metrics (pods, nodes, deployments)
- HTTP request metrics (rate, latency, errors)
2. Grafana
Purpose: Visualization and dashboards platform
Version: Grafana 12.x
Features:
- Rich visualization options
- Multiple data source support
- Template variables
- Alert annotations
- User management and RBAC
- Dashboard sharing and embedding
Pre-configured Dashboards:
- Cluster Overview
- Node Status
- Pod Resources
- Application Performance
- Database Health
- Business KPIs
3. Loki
Purpose: Log aggregation system optimized for Kubernetes
Version: Loki 3.x
Features:
- Label-based indexing (no full-text indexing)
- Cost-effective storage
- LogQL query language
- Native Grafana integration
- Multi-tenancy support
- S3/GCS backend support
Log Sources:
- Application logs
- System logs
- Kubernetes events
- Audit logs
- Access logs
4. Promtail
Purpose: Agent for shipping logs to Loki
Deployment: DaemonSet (runs on all nodes)
Features:
- Automatic service discovery
- Label extraction from logs
- Log parsing and transformation
- Position tracking
- Batch sending
5. AlertManager
Purpose: Alert routing, grouping, and notification
Version: AlertManager 0.27.x
Features:
- Alert deduplication
- Grouping and silencing
- Route-based notifications
- Template-based messages
- High availability mode
Notification Channels:
- Slack
- PagerDuty
- Webhook
- OpsGenie
6. Node Exporter
Purpose: Hardware and OS metrics from host machines
Deployment: DaemonSet (runs on all nodes)
Metrics:
- CPU usage
- Memory utilization
- Disk I/O
- Network statistics
- Filesystem usage
7. cAdvisor
Purpose: Container resource usage and performance metrics
Deployment: DaemonSet (runs on all nodes)
Metrics:
- Container CPU usage
- Container memory usage
- Network I/O per container
- Filesystem usage per container
Kubernetes Manifests
Namespace
apiVersion: v1
kind: Namespace
metadata:
name: monitoring-stack
labels:
name: monitoring-stack
stack: infrastructurePrometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring-stack
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'healthflow-ndp'
environment: 'staging'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/alerts.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- job_name: 'postgresql'
static_configs:
- targets: ['postgresql.data-stack:5432']
- job_name: 'redis'
static_configs:
- targets: ['redis.data-stack:6379']Prometheus StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: monitoring-stack
spec:
serviceName: prometheus
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v3.5.1
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.retention.size=90GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- containerPort: 9090
name: http
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "1"
memory: 2Gi
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 100GiGrafana Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring-stack
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:12.3.2
ports:
- containerPort: 3000
name: http
env:
- name: GF_SERVER_ROOT_URL
value: "https://grafana.healthflow.eg"
- name: GF_SECURITY_ADMIN_USER
value: "admin"
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
- name: GF_INSTALL_PLUGINS
value: "grafana-piechart-panel,grafana-clock-panel"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "500m"
memory: 1Gi
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: datasources
configMap:
name: grafana-datasourcesLoki StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: monitoring-stack
spec:
serviceName: loki
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:3.6
args:
- -config.file=/etc/loki/loki.yml
ports:
- containerPort: 3100
name: http
volumeMounts:
- name: config
mountPath: /etc/loki
- name: loki-storage
mountPath: /loki
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "500m"
memory: 1Gi
volumes:
- name: config
configMap:
name: loki-config
volumeClaimTemplates:
- metadata:
name: loki-storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 200GiPromtail DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: monitoring-stack
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccount: promtail
containers:
- name: promtail
image: grafana/promtail:3.6
args:
- -config.file=/etc/promtail/promtail.yml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
resources:
limits:
cpu: "200m"
memory: 256Mi
requests:
cpu: "100m"
memory: 128Mi
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containersService Dependencies
Deployment Instructions
1. Create Namespace and Secrets
# Create namespace
kubectl create namespace monitoring-stack
# Create Grafana admin password
kubectl create secret generic grafana-secret \
--from-literal=admin-password=$(openssl rand -base64 32) \
--namespace monitoring-stack
# Create AlertManager configuration secret
kubectl create secret generic alertmanager-config \
--from-file=alertmanager.yml=configs/alertmanager.yml \
--namespace monitoring-stack2. Create ServiceAccounts and RBAC
# Prometheus ServiceAccount
kubectl apply -f prometheus/serviceaccount.yaml
kubectl apply -f prometheus/clusterrole.yaml
kubectl apply -f prometheus/clusterrolebinding.yaml
# Promtail ServiceAccount
kubectl apply -f promtail/serviceaccount.yaml
kubectl apply -f promtail/clusterrole.yaml
kubectl apply -f promtail/clusterrolebinding.yaml3. Deploy Monitoring Components
# Deploy Prometheus
kubectl apply -f prometheus/configmap.yaml
kubectl apply -f prometheus/statefulset.yaml
kubectl apply -f prometheus/service.yaml
# Deploy Loki
kubectl apply -f loki/configmap.yaml
kubectl apply -f loki/statefulset.yaml
kubectl apply -f loki/service.yaml
# Deploy Promtail
kubectl apply -f promtail/configmap.yaml
kubectl apply -f promtail/daemonset.yaml
# Deploy Grafana
kubectl apply -f grafana/configmap.yaml
kubectl apply -f grafana/pvc.yaml
kubectl apply -f grafana/deployment.yaml
kubectl apply -f grafana/service.yaml
# Deploy AlertManager
kubectl apply -f alertmanager/configmap.yaml
kubectl apply -f alertmanager/statefulset.yaml
kubectl apply -f alertmanager/service.yaml
# Deploy Node Exporter
kubectl apply -f node-exporter/daemonset.yaml
# Deploy cAdvisor
kubectl apply -f cadvisor/daemonset.yaml4. Verify Deployment
# Check pod status
kubectl get pods -n monitoring-stack
# Check services
kubectl get svc -n monitoring-stack
# Check persistent volumes
kubectl get pvc -n monitoring-stack
# Port-forward to access Grafana
kubectl port-forward -n monitoring-stack svc/grafana 3000:3000
# Access Grafana at http://localhost:3000Configuration
Grafana Data Sources
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring-stack
data:
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: falseAlert Rules Example
groups:
- name: database_alerts
interval: 30s
rules:
- alert: PostgreSQLDown
expr: up{job="postgresql"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL database has been down for more than 5 minutes"
- alert: HighDatabaseConnections
expr: pg_stat_database_numbackends > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High number of database connections"
description: "Database has {{ $value }} active connections"
- name: application_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate"
description: "Error rate is {{ $value }} errors/sec"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"Resource Requirements
Estimates
These are rough estimates. Actual usage depends on metrics cardinality, log volume, and retention period.
| Service | CPU | Memory | Storage | Notes |
|---|---|---|---|---|
| Prometheus | 2 cores | 4 GB | 100 GB | 15-day retention |
| Loki | 1 core | 2 GB | 200 GB | 30-day retention |
| Grafana | 1 core | 2 GB | 10 GB | Dashboards and config |
| AlertManager | 0.5 core | 512 MB | 5 GB | Alert state |
| Promtail | 0.2 core | 256 MB | - | Per node (DaemonSet) |
| Node Exporter | 0.1 core | 128 MB | - | Per node (DaemonSet) |
| cAdvisor | 0.2 core | 256 MB | - | Per node (DaemonSet) |
Monitoring Best Practices
1. Metric Naming
- Use consistent naming conventions
- Include units in metric names
- Use labels for dimensions
2. Alert Design
- Alert on symptoms, not causes
- Set appropriate thresholds
- Avoid alert fatigue
- Include runbook links
3. Dashboard Organization
- One dashboard per service
- Use template variables
- Include SLO/SLI panels
- Add documentation panels
4. Log Management
- Use structured logging
- Include correlation IDs
- Set appropriate log levels
- Implement log sampling for high-volume apps
5. Retention Policies
- Metrics: 15-30 days
- Logs: 7-30 days
- Long-term: Archive to S3/GCS
Troubleshooting
Prometheus Not Scraping Targets
# Check targets in Prometheus
kubectl port-forward -n monitoring-stack svc/prometheus 9090:9090
# Visit http://localhost:9090/targets
# Check service discovery
kubectl logs -n monitoring-stack prometheus-0
# Verify RBAC permissions
kubectl auth can-i list pods --as=system:serviceaccount:monitoring-stack:prometheusGrafana Can't Connect to Data Sources
# Check if services are reachable
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://prometheus:9090/-/ready
kubectl exec -n monitoring-stack deployment/grafana -- wget -O- http://loki:3100/ready
# Check datasource configuration
kubectl get configmap -n monitoring-stack grafana-datasources -o yamlLoki Not Receiving Logs
# Check Promtail logs
kubectl logs -n monitoring-stack -l app=promtail
# Verify Promtail can reach Loki
kubectl exec -n monitoring-stack -l app=promtail -- wget -O- http://loki:3100/ready
# Check Loki ingester status
kubectl exec -n monitoring-stack loki-0 -- wget -O- http://localhost:3100/ringNext Steps
- Discovery Stack - Deploy Consul & Vault
- Application Stack - Deploy NDP services
- Gateway Stack - Configure ingress for monitoring services