Loki, Alloy, and Prometheus Monitoring for Kubernetes

This guide outlines the process for installing Loki, Alloy, and Prometheus on a Kubernetes cluster. This setup provides comprehensive logging and monitoring capabilities.

Prerequisites

A running Kubernetes cluster.
Helm v3 or later.
kubectl configured to interact with your cluster.

1. Prometheus Installation

Install Prometheus using Helm, ensuring the server.extraArgs argument is correctly set to allow Alloy to send data.

helm install prometheus-main prometheus-community/prometheus \
      --version 28.9.0 \
      --namespace logging \
      --set kube-state-metrics.enabled=false \
      --set prometheus-node-exporter.enabled=false \
      --set prometheus-pushgateway.enabled=false \
      --set "server.extraArgs.web.enable-remote-write-receiver="

2. Adding Custom Scrape Targets (Optional)

To scrape targets outside the Kubernetes cluster, edit the Prometheus ConfigMap.

scrape_configs:
  - job_name: 'pve'
    scrape_interval: 15s
    static_configs:
      - targets: ['192.168.2.250:9100']
    labels:
      instance: 'pve'

3. Loki Installation

Install Loki using Helm.

helm install loki-main grafana/loki -f loki-values.yaml --namespace logging

The values are a bit in-depth. These set the local filesystem for storage instead of S3, adds a 7-day retention period on logs, and lowers the requests as the default install is asking for 8GB per pod!

#loki-values.yaml
loki:
  storage:
    type: 'filesystem'
  auth_enabled: false
  commonConfig:
    replication_factor: 1
    # This explicitly routes all storage requests to the local disk
    path_prefix: /var/loki
    storage:
      filesystem:
        chunks_directory: /var/loki/chunks
        rules_directory: /var/loki/rules
  limits_config:
    retention_period: 168h  # <--- THIS deletes logs older than 7 days
    allow_deletes: true
  schemaConfig:
    configs:
      - from: "2024-04-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  # Ensures the pod has a disk to write to
  storageConfig:
    filesystem:
      directory: /var/loki/chunks

  pattern_ingester:
      enabled: true
  limits_config:
    allow_structured_metadata: true
    volume_enabled: true
  ruler:
    enable_api: true

compactor:
  retention_enabled: true
  working_directory: /var/loki/retention
  delete_request_store: filesystem # <--- MUST match your storage type
  retention_delete_delay: 2h
  compaction_interval: 10m

minio:
  enabled: false
      
deploymentMode: SingleBinary

chunksCache:
  allocatedMemory: 512
  resources:
    requests:
      cpu: "10m"
      memory: "50Mi"
    limits:
      cpu: "50m"
      memory: "100Mi"

resultsCache:
  enabled: true
  allocatedMemory: 50
  resources:
    requests:
      cpu: "10m"
      memory: "50Mi"
    limits:
      cpu: "50m"
      memory: "100Mi"

singleBinary:
  replicas: 1
  persistence:
    enabled: true
    size: 10Gi
  resources:
    requests:
      cpu: "50m"      # Increase if logs are delayed
      memory: "50Mi"  # Increase if pod is OOMKilled

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

4. Alloy Installation

Install Alloy using Helm. Use --install if it's a new installation, or helm upgrade to update.

helm upgrade --install alloy grafana/alloy -n logging

We need to recreate the configmap to correctly label the pods and containers so we can search easily in grafana.

Delete the old configmap kubectl delete cm -n logging alloy
Apply the new configmap kubectl apply -f alloy-cm.yaml
Roll out new daemonset kubectl rollout restart -n logging daemonset alloy

#alloy-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy
  namespace: logging
data:
  config.alloy: |
    // Discover Kubernetes pods
    discovery.kubernetes "pods" {
      role = "pod"
    }

    // Relabel to add pod metadata as labels
    discovery.relabel "pod_logs" {
      targets = discovery.kubernetes.pods.targets

      // Add namespace
      rule {
        source_labels = ["__meta_kubernetes_namespace"]
        target_label  = "namespace"
      }

      // Add pod name
      rule {
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label  = "pod"
      }

      // Add container name
      rule {
        source_labels = ["__meta_kubernetes_pod_container_name"]
        target_label  = "container"
      }

      // Add node name
      rule {
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label  = "node"
      }

      // Add app label if it exists
      rule {
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label  = "app"
      }

      // Add job from controller name
      rule {
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label  = "job"
      }

      // Map all pod labels with prefix
      rule {
        action      = "labelmap"
        regex       = "__meta_kubernetes_pod_label_(.+)"
        replacement = "pod_label_$1"
      }

      // Add pod UID
      rule {
        source_labels = ["__meta_kubernetes_pod_uid"]
        target_label  = "pod_uid"
      }
    }

    // Tail logs using Kubernetes API (no filesystem access needed)
    loki.source.kubernetes "pods" {
      targets    = discovery.relabel.pod_logs.output
      forward_to = [loki.process.pods.receiver]
    }

    // Process logs
    loki.process "pods" {
      forward_to = [loki.write.default.receiver]

      // Parse CRI format logs (containerd)
      stage.cri {}

      // Try to extract JSON fields if present
      stage.json {
        expressions = {
          level = "level",
          msg   = "msg",
        }
      }

      // Add level as label if extracted
      stage.labels {
        values = {
          level = "",
        }
      }
    }

    // Write logs to Loki
    loki.write "default" {
      endpoint {
        url = "https://loki-main-prd.local.koryalbert.net/loki/api/v1/push"
        
        // Optional: Add basic auth if needed
        // basic_auth {
        //   username = "username"
        //   password = "password"
        // }
      }

      // External labels applied to all logs
      external_labels = {
        cluster = "kubernetes",
      }
    }

5. Alloy Installation on Regular Linux Hosts (Optional)

Install Alloy on non-Kubernetes hosts to collect logs and metrics.

wget -q -O gpg.key https://rpm.grafana.com/gpg.key
sudo rpm --import gpg.key
echo -e '[grafana]\nname=grafana\nbaseurl=https://rpm.grafana.com\nrepo_gpgcheck=1\nenabled=1\ngpgcheck=1\ngpgkey=https://rpm.grafana.com/gpg.key\nsslverify=1\nsslcacert=/etc/pki/tls/certs/ca-bundle.crt' | sudo tee /etc/yum.repos.d/grafana.repo
yum update
sudo dnf install alloy

Next we need to create the config for alloy at /etc/alloy/config.alloy

// This block relabels metrics coming from node_exporter to add standard labels
discovery.relabel "integrations_node_exporter" {
  targets = prometheus.exporter.unix.integrations_node_exporter.targets

  rule {
    // Set the instance label to the hostname of the machine
    target_label = "instance"
    replacement  = constants.hostname
  }

  rule {
    // Set a standard job name for all node_exporter metrics
    target_label = "job"
    replacement = "integrations/node_exporter"
  }
}

// Configure the node_exporter integration to collect system metrics
prometheus.exporter.unix "integrations_node_exporter" {
  // Disable unnecessary collectors to reduce overhead
  disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
  enable_collectors = ["meminfo"]

  filesystem {
    // Exclude filesystem types that aren't relevant for monitoring
    fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
    // Exclude mount points that aren't relevant for monitoring
    mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
    // Timeout for filesystem operations
    mount_timeout        = "5s"
  }

  netclass {
    // Ignore virtual and container network interfaces
    ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
  }

  netdev {
    // Exclude virtual and container network interfaces from device metrics
    device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
  }
}

// Define how to scrape metrics from the node_exporter
prometheus.scrape "integrations_node_exporter" {
  scrape_interval = "15s"
  // Use the targets with labels from the discovery.relabel component
  targets    = discovery.relabel.integrations_node_exporter.output
  // Send the scraped metrics to the relabeling component
  forward_to = [prometheus.remote_write.local.receiver]
}

// Define where to send the metrics for storage
prometheus.remote_write "local" {
  endpoint {
    // Send metrics to a locally running Prometheus instance
    url = "https://prometheus-main-prd.local.koryalbert.net/api/v1/write"
  }
}

// Define relabeling rules for systemd journal logs
discovery.relabel "logs_integrations_integrations_node_exporter_journal_scrape" {
  // Create a dummy target to apply rules to journal logs
  targets = [{
    __address__ = "journal",
  }]

  rule {
    // Set the instance label to the hostname
    target_label = "instance"
    replacement  = constants.hostname
  }

  rule {
    // Set a standard job name for journal logs
    target_label = "job"
    replacement = "integrations/node_exporter"
  }

  rule {
    // Extract systemd unit information into a label
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }

  rule {
    // Extract boot ID information into a label
    source_labels = ["__journal__boot_id"]
    target_label  = "boot_id"
  }

  rule {
    // Extract transport information into a label
    source_labels = ["__journal__transport"]
    target_label  = "transport"
  }

  rule {
    // Extract log priority into a level label
    source_labels = ["__journal_priority_keyword"]
    target_label  = "level"
  }
}

// Collect logs from systemd journal for node_exporter integration
loki.source.journal "logs_integrations_integrations_node_exporter_journal_scrape" {
  // Only collect logs from the last 24 hours
  max_age       = "24h0m0s"
  // Apply relabeling rules to the logs
  relabel_rules = discovery.relabel.logs_integrations_integrations_node_exporter_journal_scrape.rules
  // Send logs to the local Loki instance
  forward_to    = [loki.write.local.receiver]
}

// Define which log files to collect for node_exporter - UPDATED to include dmesg
local.file_match "logs_integrations_integrations_node_exporter_direct_scrape" {
  path_targets = [{
    // Target localhost for log collection
    __address__ = "localhost",
    // Collect standard system logs including dmesg
    __path__    = "/var/log/{syslog,messages,*.log,dmesg}",
    // Add instance label with hostname
    instance    = constants.hostname,
    // Add job label for logs
    job         = "integrations/node_exporter",
  }]
}

// Collect logs from files for node_exporter
loki.source.file "logs_integrations_integrations_node_exporter_direct_scrape" {
  // Use targets defined in local.file_match
  targets    = local.file_match.logs_integrations_integrations_node_exporter_direct_scrape.targets
  // Send logs to the local Loki instance
  forward_to = [loki.write.local.receiver]
}

// Define where to send logs for storage
loki.write "local" {
  endpoint {
    // Send logs to a locally running Loki instance
    url = "https://loki-main-prd.local.koryalbert.net/loki/api/v1/push"
  }
}

// Enable live debugging features (empty config means use defaults)
livedebugging {}

Set the following permissions to allow reading logs

usermod -aG adm alloy
usermod -aG systemd-journal alloy
usermod -aG docker alloy

setfacl -m u:alloy:r /var/log/messages
setfacl -m u:alloy:r /var/log/kdump.log
setfacl -m u:alloy:r /var/log/boot.log

systemctl restart alloy.service

Provision nodes

How to scale down a Talos cluster

Node setup with Nvidia GPU

Switch CNI

Metal LB

nginx Ingress Controller

Convert PV from ReadWriteOnce to ReadWriteMany

Restore Longhorn PV from backup

Loki, Alloy, and Prometheus Monitoring for Kubernetes

Loki, Alloy, and Prometheus Monitoring for Kubernetes

Prerequisites

1. Prometheus Installation

2. Adding Custom Scrape Targets (Optional)

3. Loki Installation

4. Alloy Installation

5. Alloy Installation on Regular Linux Hosts (Optional)