Monitoring

Introduction

Why bother doing all this in a traditional native sense? There are use cases, maybe for smaller environments. In the conclusion to this guide, I evaluate the value of spending time on managing such a stack more manually versus deploying the stack on a container platform. However, for getting to grips with all the moving parts of this monitoring and alerting stack, do it all by hand is the best way to learn and understand the pros and cons.

Native Overview

At a high level, this is the final objective, it might look complex at first glance but the target is a pair of each core component for high availability and connectivity relationships.

Overview

The rest of this guide uses the following Virtual Machines:

NFS File shares

This NFS share example is using host a 192.168.0.70 on a Linux host attached to same network as nodes.

dnf install nfs-utils -y
systemctl enable --now rpcbind
systemctl enable --now nfs-server

Create a directory to share:

mkdir -p /nfs/rules /nfs/targets /nfs/alertmanagers /nfs/thanos

Add the share to configuration:

vi /etc/exports
/nfs/rules               192.168.0.1/24(rw,sync,no_wdelay,no_root_squash,insecure)
/nfs/targets             192.168.0.1/24(rw,sync,no_wdelay,no_root_squash,insecure)
/nfs/alertmanagers       192.168.0.1/24(rw,sync,no_wdelay,no_root_squash,insecure)
/nfs/thanos              192.168.0.1/24(rw,sync,no_wdelay,no_root_squash,insecure)
NFS is typically NOT recommended for real environments, See https://thanos.io/tip/thanos/storage.md/ for configuring access to object storage and the supported clients.

Export the new share with:

exportfs -arv

And confirm the share is visible locally:

exportfs  -s
showmount -e 127.0.0.1

And from another host:

showmount -e 192.168.0.70

If required, open up the firewall ports needed:

firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --permanent --add-service=mountd
firewall-cmd --reload

PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system that has earned a strong reputation for reliability, feature robustness, and performance.

In many cases, databases are external entities. Many cloud providers now provide PostgreSQL-as-a-service or project may opt for standing up a dedicated instance or instances in a highly available configuration. It may be therefore preferable to host PostgresSQL in a virtual machine locally to replicate such environments.

These steps detail deploying PostgreSQL using CentOS Stream 8.

Installation

Install PostgreSQL 12 from the module stream:

dnf install @postgresql:12 -y

Database configuration

First step is to initialise PostgrSQL:

/usr/bin/postgresql-setup --initdb

Enable and start service

systemctl enable postgresql.service --now

Open access

If firewalld is being used, add the service:

firewall-cmd --add-service=postgresql --permanent
firewall-cmd --reload

Access PostgreSQL:

su - postgres
psql

Create a user and database:

CREATE USER grafana WITH PASSWORD 'changeme';
ALTER ROLE grafana SET client_encoding TO 'utf8';
ALTER ROLE grafana SET default_transaction_isolation TO 'read committed';
ALTER ROLE grafana SET timezone TO 'UTC';
CREATE DATABASE grafana_db;
GRANT ALL PRIVILEGES ON DATABASE grafana_db TO grafana;

List and quit:

\l
\q
exit

Ensure it is configured to listen on the IP Address of the host.

Edit vi /var/lib/pgsql/data/postgresql.conf:

listen_addresses = '192.168.0.70'

And update the configuration to allow any host on the same subnet to access the database.

Edit vi /var/lib/pgsql/data/pg_hba.conf:

Original:

# TYPE  DATABASE        USER            ADDRESS                 METHOD
local   all             all                                     peer
host    all             all             127.0.0.1/32            ident
host    all             all             ::1/128                 ident
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            ident
host    replication     all             ::1/128                 ident
# TYPE  DATABASE        USER            ADDRESS                 METHOD
local   all             all                                     peer
host    all             all             127.0.0.1/32            md5
host    all             all             192.168.0.1/24          md5
host    all             all             ::1/128                 md5
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            ident
host    replication     all             ::1/128                 ident

Restart PostgreSQL service:

systemctl restart postgresql

HAProxy

Install HAProxy:

dnf install haproxy -y

Back up the original configuration file:

mv /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.bak

And add the following configuration (changing IPs for your environment)

vi /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    stats socket /var/lib/haproxy/stats

defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    30s
    timeout queue           1m
    timeout connect         30s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 30s
    timeout check           30s
    maxconn                 4000

listen stats
    bind 0.0.0.0:9000
    mode http
    balance
    timeout client 5000
    timeout connect 4000
    timeout server 30000
    stats uri /stats
    stats refresh 5s
    stats realm HAProxy\ Statistics
    stats auth admin:changeme
    stats admin if TRUE

# Add load balancers next

This haproxy.conf example is the minimal configuration ready to add load balancers later.

Set the SELinux boolean to allow haproxy to connect to any port:

setsebool -P haproxy_connect_any=1

Open firewall:

firewall-cmd --permanent --add-port=9000/tcp --zone=public
firewall-cmd --reload

Enable and start HAProxy:

systemctl enable haproxy.service --now

View the graphical statistics report at http://192.168.0.70:9000/stats. In this example the username is admin and password is changeme.

This should be all set for adding load balancers.

Prometheus

Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database (allowing for high dimensionality) built using a HTTP pull model, with flexible queries and real-time alerting. The project is written in Go and licensed under the Apache 2 License.

The good news is that Prometheus is at the heart of this whole micro-service architecture. At its most basic it could be all that is needed. Any target to be scraped for metrics and any alerting rule is all done here using Prometheus. Every other component is peripheral, either extending or handing off responsibility or consuming data for visualisation and storage.

This section deal with deploying Prometheus on two Virtual Machines, mounting NFS shares for the target and rules configuration and load balancing the two Prometheus instances. Think of each instance of Prometheus as a Replica.

Prometheus

Add a service user account:

useradd -m -s /bin/false prometheus

Create two directories:

mkdir -p /etc/prometheus /var/lib/prometheus

Change ownership of directories:

chown prometheus:prometheus /etc/prometheus /var/lib/prometheus/

Get the latest download link from https://prometheus.io/download/:

dnf install wget -y
wget https://github.com/prometheus/prometheus/releases/download/v2.24.1/prometheus-2.24.1.linux-amd64.tar.gz

Extract the archive and copy binaries into place:

dnf install tar -y
tar -xvf prometheus-2.24.1.linux-amd64.tar.gz
cd prometheus-2.24.1.linux-amd64
cp prometheus promtool /usr/local/bin/

Check the path is correct and versions:

prometheus --version
promtool --version

Always use the IP Address or preferably DNS name, not localhost for scrape targets:

Global external_labels are added to either identify each prometheus instances in a HA configuration or the prometheus cluster, if labels are identical on each instance.

Same config on both nodes

vi /etc/prometheus/prometheus.yml
# Global config
global:
  scrape_interval:     15s
  evaluation_interval: 15s
  scrape_timeout: 15s
  external_labels:
    cluster: prometheus-cluster
    region: europe
    environment: dev

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['0.0.0.0:9090']

Note that the scrape_configs includes only this prometheus target at this stage. In other words a running instance of Prometheus exposes metrics about itself, when further instances are added they need to be included for example:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['0.0.0.0:9090','192.168.0.72']

Create a service using systemd, adding --web.listen-address=:9090:

To reduce the time to begin archiving use minutes, example:

    --storage.tsdb.max-block-duration=30m \
    --storage.tsdb.min-block-duration=30m \
vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Service
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
LimitNOFILE=65536
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --storage.tsdb.max-block-duration=2h \
    --storage.tsdb.min-block-duration=2h \
    --web.listen-address=:9090

[Install]
WantedBy=multi-user.target

Start and enable the Prometheus:

systemctl daemon-reload
systemctl enable prometheus --now
systemctl status prometheus
Prometheus store its data under /var/lib/prometheus by default.

Open firewall port:

firewall-cmd --add-port=9090/tcp --permanent
firewall-cmd --reload

A single Prometheus instance can then be access using a browser for example: http://192.168.0.71:9090/. Assuming all these steps have been repeated on a second node (192.168.0.72), add a load balancer for these two Prometheus instances using HAProxy.

On the host serving HAProxy:

vi /etc/haproxy/haproxy.cfg
# Prometheus LB
frontend prometheus-lb-frontend
    bind 192.168.0.70:9090
    default_backend prometheus-lb-backend

backend prometheus-lb-backend
    balance roundrobin
    server prometheus1 192.168.0.71:9090 check
    server prometheus2 192.168.0.72:9090 check

And restart HAProxy plus checking the status:

systemctl restart haproxy
systemctl status haproxy

Open firewall on HAProxy host too:

firewall-cmd --add-port=9090/tcp --permanent
firewall-cmd --reload

View the state of the load balancer using a browser at http://192.168.0.70:9000/stats.

View Prometheus via the load balancer using http://192.168.0.70:9090/.

Basics

A prometheus instance exposes metrics about itself, for example http://192.168.0.71:9090/metrics and the only target configuration included (at this stage) is itself.

Look at Targets in a browser:

Prometheus

Execute a query:

promhttp_metric_handler_requests_total{code="200"}
Prometheus

And observe there are no alerts configured yet:

Prometheus

Decouple config

Remember to think of each instance of Prometheus as a Replica behind the load balancer, this mean any instance of Prometheus need the same configuration. Deploying this stack natively on VMs or cloud instances (oppose to using containers), the config directories might as well be mounted file systems.

Make two directories for the target config and rules:

mkdir -p /etc/prometheus/targets /etc/prometheus/rules

Added the following to fstab:

vi /etc/fstab
192.168.0.70:/nfs/targets /etc/prometheus/targets nfs rw,sync,hard,intr 0 0
192.168.0.70:/nfs/rules /etc/prometheus/rules nfs rw,sync,hard,intr 0 0

Ensure nfs-utils is installed:

dnf install nfs-utils -y

And mount the NFS shares (created at the start of this page):

mount -a

Now update the Prometheus configuration to read files from those directories for both tartgets and rules:

vi /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'targets'
    file_sd_configs:
    - files:
      - /etc/prometheus/targets/*.yml

rule_files:
  - /etc/prometheus/rules/*.yml

And add the Prometheus target/s:

vi /etc/prometheus/targets/prometheus_targets.yml
---
- labels:
    service: prometheus
    env: staging
  targets:
  - 192.168.0.71:9090

Restart Prometheus:

systemctl restart prometheus

Everything should be the same except now the configuration is decoupled from any instance of Prometheus. When the second instance is added in this example prometheus_targets.yml should include both instances:

---
- labels:
    service: prometheus
    env: staging
  targets:
  - 192.168.0.71:9090
  - 192.168.0.72:9090

Chronyd

It’s a good idea to make sure all servers and clients are in sync with their clocks, for reference:

dnf install chrony
systemctl start chronyd
systemctl enable chronyd
chronyc tracking

Recap

Everything from this point on involves adding target scrape configurations and rules, specifically alert rules for Prometheus. All the other components are peripheral to Prometheus, either extending or handing off services, or consuming data for other purposes, as in the case of Grafana that using Prometheus as a data source for displaying information in a graphical way.

node_exporter

node_exporter is a Prometheus exporter for hardware and OS metrics exposed by UNIX and Linux kernels. Think of it as a machine agent that exposes meters at the host level, for things such as CPU, Disk usage and memory etc.

In this guide, there are initially three VMs, utilities, mon1 and mon2, these steps for installing node_exporter are repeated on any node required to be monitored.

This high level diagram summarises to architecture. node_exporter is deployed on nodes and the included in the Prometheus scrape targets config. Any number of these node_experter endpoints can be added to monitor infrastructure hosts. In this case, the nodes hosting Prometheus are included. The second Prometheus instance and other nodes are greyed out in the diagram, remember the second instance is a replica of the first.

node_exporter

Deploy node_exporter

useradd -m -s /bin/false node_exporter

Get the latest download link from https://prometheus.io/download/.

wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz

Extract the archive:

tar -xvf node_exporter-1.0.1.linux-amd64.tar.gz

Move into the extracted directory:

cd node_exporter-1.0.1.linux-amd64

Copy the node_exporter binary to a suitable path:

cp node_exporter /usr/local/bin/

Create a service for node_exporter using systemd, example includes a custom port 4100:

vi /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --web.listen-address=:9100

[Install]
WantedBy=multi-user.target

Open firewall port:

firewall-cmd --add-port=9100/tcp --permanent
firewall-cmd --reload

Start and enable the Node Exporter:

systemctl daemon-reload
systemctl enable node_exporter --now
systemctl status node_exporter

Update Targets

Add node_exporter to Prometheus scrape, include labels to identify the services. In this example nodes mon1 - 192.168.0.71, mon2 - 192.168.0.72 and utilities - 192.168.0.70 are added. Because this config is mounted by any Prometheus instance, the config is the same. This example includes three nodes:

vi /etc/prometheus/targets/node_exporter.yml
---
- labels:
    service: node_exporter
    env: staging
  targets:
  - 192.168.0.70:9100
  - 192.168.0.71:9100
  - 192.168.0.72:9100

Prometheus needs to be restarted on both Prometheus instances:

systemctl restart prometheus

At this stage, there are two prometheus targets and three node_exporter targets:

Targets

Alertmanager

Before deploying Alert Manager, alert rules can be added to Prometheus and fully tested. Alert will fully work and fire in Prometheus. All Alert Manager does is hook into Prometheus and handles the actual sending alert messages to what ever providers are configured such as email, and it takes care of de-duplication. For example, where two alert managers are in the equation, you don’t want both sending out an email for the same alert.

Rules

Consider the query node_filesystem_size_bytes{mountpoint="/boot"}, executing this in Prometheus should return each /boot file system for each of the nodes where metrics are scraped using node_exporter.

All an alert is, it such a query with an added condition.

node_filesystem_size_bytes{mountpoint="/boot"} > 1000000000
Alert Rules

In this case, increasing the number to where no condition is matched returns no results.

node_filesystem_size_bytes{mountpoint="/boot"} > 2000000000
Alert Rules

By working with Prometheus directly and tuning expressions and conditions is the best way of deriving alerts expressions.

To add an alert, create rules files under /etc/prometheus/rules/. These files can contain multiple alerts.

Example:

vi /etc/prometheus/rules/boot_fs.yml
groups:
- name: node_exporter
  rules:
    - alert: boot_partition
      expr: node_filesystem_size_bytes{mountpoint="/boot"} > 1000000000
      for: 1m
      labels:
        severity: warning
      annotations:
        title: Disk space filling up
        description: /boot is filling up

Restart each Prometheus instance systemctl restart prometheus.

That is fundamentally all there is to alerts, in this example all three nodes will start firing:

Alert Rules

Edit the alert and tweak the threshold to expr: node_filesystem_size_bytes{mountpoint="/boot"} > 2000000000 and restart Prometheus again, the alert with turn green:

Alert Rules

However, while this is functionally working, its not all too useful if you have to manually check the alerts in Prometheus. This is where Alert Manager comes into the equation.

Deploy Alert Manager

Two instances of alert manager are deployed, one on each node, each instance of alert manager needs to know about the other so they can "gossip" and know if who has sent what and avoid duplicate alerts been sent out.

Prometheus configuration is also updated to include the Alert Manager instances so it can off load the responsibility of dealing with what to do with them.

Alert Rules

Add a service user account:

useradd -m -s /bin/false alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

Extract the archive:

tar -xvf alertmanager-0.21.0.linux-amd64.tar.gz

Move into the extracted directory:

cd alertmanager-0.21.0.linux-amd64

Copy the alertmanager binary to a suitable path:

cp alertmanager /usr/local/bin/

Add the following configuration, you can use a regular Gmail account for SMTP, although it might be necessary to create app credentials, and add what receiver email address desired.

mkdir /etc/alertmanager
vi /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'AlertManager <[email protected]>'
  smtp_require_tls: true
  smtp_hello: 'alertmanager'
  smtp_auth_username: 'username'
  smtp_auth_password: 'changme'

route:
  group_by: ['instance', 'alert']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: staging

receivers:
  - name: 'staging'
    email_configs:
      - to: '[email protected]'

Change permissions:

chown -R alertmanager:alertmanager /etc/alertmanager

Create a service for alertmanager using systemd, example includes the cluster.peer:

vi /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
LimitNOFILE=65536
ExecStart=/usr/local/bin/alertmanager \
    --cluster.listen-address=0.0.0.0:9004 \
    --cluster.peer=192.168.0.72:9004 \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --web.external-url=http://192.168.0.71:9093

WorkingDirectory=/etc/alertmanager

[Install]
WantedBy=multi-user.target
the second Alert Manager instance needs to point to the other peer --cluster.peer=192.168.0.71:9004 and its own IP for --web.external-url=http://192.168.0.72:9093

Make a directory to mount the alertmanagers.yml config file:

mkdir /etc/prometheus/alertmanagers

Add the NFS mount point:

vi /etc/fstab
192.168.0.70:/nfs/alertmanagers /etc/prometheus/alertmanagers nfs rw,sync,hard,intr 0 0
mount -a

Add alertmanagers.yml:

vi /etc/prometheus/alertmanagers/alertmanagers.yml
---
- targets:
  - 192.168.0.71:9093
  - 192.168.0.72:9093

Open firewall:

firewall-cmd --add-port=9093/tcp --permanent
firewall-cmd --add-port=9004/tcp --permanent
firewall-cmd --reload

Start and enable the Alert Manager:

systemctl daemon-reload
systemctl enable alertmanager.service --now

Add the following configuration to Prometheus configuration:

vi /etc/prometheus/prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    file_sd_configs:
    - files:
      - 'alertmanagers/alertmanagers.yml'

And restart Prometheus:

systemctl restart prometheus.service

With this configured, go back to Prometheus to configure some alerts. Alerts will only appear in Alert Manager if they fire.

Check the status of each alert manager for example http://192.168.0.71:9093 and http://192.168.0.72:9093

Alert Rules

These two Alert Manager instances can be added as a load balancer, on the host serving HAProxy:

vi /etc/haproxy/haproxy.cfg
# Alert Manager LB
frontend alertmanager-lb-frontend
    bind 192.168.0.70:9093
    default_backend alertmanager-lb-backend

backend alertmanager-lb-backend
    balance roundrobin
    server alertmanager1 192.168.0.71:9093 check
    server alertmanager2 192.168.0.72:9093 check

And restart HAProxy plus checking the status:

systemctl restart haproxy
systemctl status haproxy

Open firewall on HAProxy host too:

firewall-cmd --add-port=9093/tcp --permanent
firewall-cmd --reload

Experiment by changing the condition and causing an alert to fire vi /etc/prometheus/rules/boot_fs.yml (remember to restart prometheus on both nodes)

Alert Rules

Thanos

Thanos includes quite a few components, in this section three core ones are covered for achieving high availability and mainly long term storage for historic Prometheus metrics retention.

For lab work and testing NFS storage is used.

NFS is typically NOT recommended for real environments, See https://thanos.io/tip/thanos/storage.md/ for configuring access to object storage and the supported clients.
Thanos

Thanos Binary

The same Thanos binary is used for launching the Sidecar, Store and Query components.

Get the latest release link from https://github.com/thanos-io/thanos/releases/ and download it:

wget https://github.com/thanos-io/thanos/releases/download/v0.18.0/thanos-0.18.0.linux-amd64.tar.gz

Extract the archive:

tar -xvf thanos-0.18.0.linux-amd64.tar.gz

Move into the extracted directory and copy the two Prometheus binary files to a suitable path:

cd thanos-0.18.0.linux-amd64
cp thanos /usr/local/bin/

Confirm version:

thanos --version
thanos, version 0.18.0 (branch: HEAD, revision: 60d45a02d46858a38013283b578017a171cf7b82)
  build user:       [email protected]
  build date:       20210127-12:18:59
  go version:       go1.15.7
  platform:         linux/amd64

Thanos Sidecar

Starting with the Thanos Sidecar, create a configuration directory for Thanos and a directory to use for mounting the NFS share for storing Prometheus data.

Note: both the Sidecar and Store components use the objstore.config-file which references the mount point.

mkdir -p /thanos /etc/thanos/
chown prometheus:prometheus /thanos /etc/thanos/

Check NFS share is visable from host:

dnf install nfs-utils -y

showmount -e 192.168.0.70

Mount the NFS share:

vi /etc/fstab
192.168.0.70:/nfs/thanos /thanos               nfs     defaults        0 0
mount -a
df -h

Add the file system config:

vi /etc/thanos/file_system.yaml
type: FILESYSTEM
config:
  directory: "/thanos"
chown prometheus:prometheus /etc/thanos/file_system.yaml

Create a service for Thanos Sidecar using systemd, including options for the existing Prometheus data directory, Prometheus endpoint and the object store configuration file:

vi /etc/systemd/system/thanos_sidecar.service
[Unit]
Description=Thanos Sidecar
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/thanos sidecar \
    --tsdb.path=/var/lib/prometheus \
    --prometheus.url=http://0.0.0.0:9090 \
    --objstore.config-file=/etc/thanos/file_system.yaml \
    --http-address=0.0.0.0:10902 \
    --grpc-address=0.0.0.0:10901

[Install]
WantedBy=multi-user.target

Start and enable the Thanos Sidecar:

systemctl daemon-reload
systemctl enable thanos_sidecar --now
systemctl status thanos_sidecar

Open firewall:

firewall-cmd --add-port=10901/tcp --permanent
firewall-cmd --add-port=10902/tcp --permanent
firewall-cmd --reload

Thanos Store

vi /etc/systemd/system/thanos_store.service
[Unit]
Description=Thanos Store
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/thanos store \
    --objstore.config-file=/etc/thanos/file_system.yaml \
    --http-address=0.0.0.0:10906 \
    --grpc-address=0.0.0.0:10905 \
    --data-dir=/etc/thanos \
    --log.level=debug

[Install]
WantedBy=multi-user.target

Start and enable the Thanos Store:

systemctl daemon-reload
systemctl enable thanos_store --now
systemctl status thanos_store

Open firewall:

firewall-cmd --add-port=10905/tcp --permanent
firewall-cmd --add-port=10906/tcp --permanent
firewall-cmd --reload

Thanos Query

Create a service for Thanos Query using systemd, note the store arguments, port 10905 is the Thanos Store for each instance and 10901 is the Thanos Sidecar for both instances.

vi /etc/systemd/system/thanos_query.service
[Unit]
Description=Thanos Query
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
LimitNOFILE=65536
ExecStart=/usr/local/bin/thanos query \
    --store=192.168.0.71:10905 \
    --store=192.168.0.72:10905 \
    --store=192.168.0.71:10901 \
    --store=192.168.0.72:10901 \
    --http-address=0.0.0.0:10904 \
    --grpc-address=0.0.0.0:10903

[Install]
WantedBy=multi-user.target

Start and enable the Thanos Query:

systemctl daemon-reload
systemctl enable thanos_query --now
systemctl status thanos_query

Open the firewall:

firewall-cmd --add-port=10904/tcp --permanent
firewall-cmd --reload

You should now be able to hit single instances directly, for example http://192.168.0.71:10904 and look at the stores, the following is with one node configured with the three Thanos components:

Thanos

And now with the second node and second instances of the Thanos components:

Thanos

You can use the Thanos Query to execute queries just like in Prometheus, the metrics are fed in directly from Prometheus via the Thanos Sidecar and the Thanos Store.

Thanos

Do a directory listing ls -al /thanos to see Prometheus data being written.

These two Thanos Query instances can be added as a load balancer, on the host serving HAProxy:

vi /etc/haproxy/haproxy.cfg
# Thanos Query LB
frontend thanos-query-lb-frontend
    bind 192.168.0.70:10904
    default_backend thanos-query-lb-backend

backend thanos-query-lb-backend
    balance roundrobin
    server thanos-query1 192.168.0.71:10904 check
    server thanos-query2 192.168.0.72:10904 check

And restart HAProxy plus checking the status:

systemctl restart haproxy
systemctl status haproxy

Open firewall on HAProxy host too:

firewall-cmd --add-port=10904/tcp --permanent
firewall-cmd --reload

Grafana

Grafana is a popular technology used to compose observability dashboards in this case using Prometheus metrics, and later also logs using Loki and promtail.

It’s very simple to deploy, and as default it uses a local SQLite database for single instances, the only step necessary to achieve high availability is configure the database settings for any number of Grafana instance to a shared database such as PostgresSQL.

Grafana

Add the Grafana repository:

vi /etc/yum.repos.d/grafana.repo

OSS release:

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

And install Grafana:

dnf install grafana -y

For a single instance just start the service and go. In this case for a HA pair, update grafana.ini with the PostgreSQL database settings, as defined at the start of this page. Search for "database" in the ini file and add the following settings:

vi /etc/grafana/grafana.ini
type = postgres
host = 192.168.0.70:5432
name = grafana_db
user = grafana
password = changeme

Optionally, you can change the http port the service listens on:

# The http port  to use
http_port = 3000

Open firewall:

firewall-cmd --add-port=3000/tcp --permanent
firewall-cmd --reload

Start and enable Grafana:

systemctl daemon-reload
systemctl enable  grafana-server --now
systemctl status grafana-server

Grafana will be up and running, for example http://192.168.0.71:3000/. Log in with username admin and password admin. A password change is mandated on first login.

Repeating the steps on the second node will behave like like the first since there attached to the same database, log in with admin and the new password you set during the first install http://192.168.0.72:3000/.

Add the HAProxy load balancer:

vi /etc/haproxy/haproxy.cfg
# Grafana LB
frontend grafana-lb-frontend
    bind 192.168.0.70:3000
    default_backend grafana-lb-backend

backend grafana-lb-backend
    balance roundrobin
    server grafana1 192.168.0.71:3000 check
    server grafana2 192.168.0.72:3000 check

And restart HAProxy plus checking the status:

systemctl restart haproxy
systemctl status haproxy

Open firewall on HAProxy host too:

firewall-cmd --add-port=3000/tcp --permanent
firewall-cmd --reload

Load balancers

At this stage there should be four load balancers, all configured using round robin.

HAProxy

The Grafana LB is used for using Grafana itself. The Prometheus and Thanos Query LBs can now be used a data sources in Grafana.

for reference, in this example there are:

Data Sources

In Grafana go to Configuration → Data Sources, and "Add data source", select "Prometheus", under HTTP → URL add the Thanos Query LB URL http://192.168.0.70:10904

Grafana

Example Dashboard

We have node_exporter running a some hosts and included them as scape targets in Prometheus. This is all out-of-the-box configuration. Most common metrics and dashboard have already being solved.

To proved a flavour of the power here, take a look at this dashboard: https://grafana.com/grafana/dashboards/1860. (There are lots of others to search for).

In Grafana, use the "+" sign in left hand menu and select "Import", enter the code, in this case "1860" and select "Load", then "Import:

Grafana

On the next screen, select Prometheus source in the drop down menu, and select "Import":

Grafana

You should now see some magic happen, the data source provides all the metrics from the node_exporter’s picking out the hosts, which are available to select in a drop down menu and using similar Prometheus query’s, visualise the data, quite impressive!

Grafana

Examining pre made dashboard such as this provides the inspiration and know-how for building custom dashboards.

Using Ansible

It obvious that all this manual implementation is not very practical, and sure, there are Ansible Roles available on Ansible Galaxy to deploy most of this stuff. That said, understanding the requirements close up and personally paves the way to automate this yourself. I tend to use Ansible in this way from my client Linux laptop, it might be an idea to do it on the utilities host in this scenario!

Remote Hosts

These steps need to be repeated on any host intended to be managed by Ansible. To use a user account called "ansible" and become root, it helps in /etc/sudoers to allow sudo with no password, on each node to be managed by Ansible. For example:

visudo
## Allows people in group wheel to run all commands
#%wheel        ALL=(ALL)        ALL

## Same thing without a password
%wheel        ALL=(ALL)        NOPASSWD: ALL

Add that user on remote hosts make them a member of the wheel group:

useradd ansible
passwd ansible
usermod -aG wheel ansible

Quick Install

Create a new working directory:

mkdir ansible && cd ansible

Create a new Python 3 Virtual Environment:

python3 -m venv venv

Activate the Python environment:

source venv/bin/activate

Make sure pip is updated:

pip install --upgrade pip

Install Ansible:

pip install ansible

Get Started

Make the directory skeleton:

mkdir -p inventories/lab playbooks/lab roles

Add the IP Address of any host needed to be managed by Ansible, in this case in the lab inventory:

vi inventories/lab/hosts
192.168.0.70
192.168.0.71
192.168.0.72

[monitoring_servers]
192.168.0.71
192.168.0.72

Add an ansible.cfg file in the playbooks/lab directory:

vi playbooks/lab/ansible.cfg
[defaults]
inventory = ../../inventories/lab
roles_path = ../../roles
host_key_checking = False
retry_files_enabled = False
command_warnings = False
remote_user = ansible

Create a basic smoke test role:

mkdir -p roles/smoke_test/tasks roles/smoke_test/defaults

Add a default variable:

vi roles/smoke_test/defaults/main.yml
---
message: 'Hello World!'

Add a basic task to the role:

vi roles/smoke_test/tasks/main.yml
---
- name: Smoke test
  shell: echo "{{ message }}" > /tmp/smoke_test.yml

Add a playbook for the smoke test, this example if for a single server:

vi playbooks/lab/smoke_test.yml
---
- name: Smoke Test Playbook
  hosts: 192.168.0.71
  remote_user: ansible
  become: yes
  roles:
    - smoke_test

Or target the group of hosts:

---
- name: Smoke Test Playbook
  hosts: monitoring_servers
  remote_user: ansible
  become: yes
  roles:
    - smoke_test

Move into the playbooks/lab/ directory:

cd playbooks/lab/

Copy your ID to the target server/s, in this example ansible:

ssh-copy-id [email protected]
ssh-copy-id [email protected]

Run the smoke test playbook:

ansible-playbook smoke_test.yml

Working with Facts

Its useful to list facts for when define conditions on certain tasks, here is an example:

ansible all -m setup -a "filter=ansible_distribution*

Group Vars

Create a directory for the lab inventory called group_vars which will hold parameters that apply to the monitoring_servers group. This group includes 192.168.0.71 and 192.168.0.72, paramters will apply to both these servers.

mkdir -p inventories/lab/group_vars
vi inventories/lab/group_vars/monitoring_servers.yml
---
message: 'Hello World from group_vars!'

Run the smoke test playbook again and see that the group variables override the defaults:

ansible-playbook smoke_test.yml

Add Roles

I create a role for each component from scratch to keep things a simple and minimalistic as possible. Mature roles are available from Ansible Galaxy for example:

I still prefer to create them from scratch while getting to grips with things before adopting community versions.

The only steps needed are to create the Python virtual environment with Ansible installed, activated and update the following to your needs:

vi inventories/lab/group_vars/monitoring_servers.yml
alertmanager_cluster_peers:
  - '192.168.0.71'
  - '192.168.0.72'

thanos_store_stores:
  - 192.168.0.71:10905
  - 192.168.0.72:10905

thanos_store_sidecars:
  - 192.168.0.71:10901
  - 192.168.0.72:10901

grafana_db_host     : '192.168.0.70:5432'
grafana_db_name     : 'grafana_db'
grafana_db_user     : 'grafana'
grafana_db_password : 'changeme'

Change into the playbooks/lab/ directory, ensure ansible.cfg is correct and run:

ansible-playbook deploy_all.yml

Logging

EFK

Elasticsearch, Fluentd and Kibana

Create a Virtual Machine, this example uses 4 CPU cores, 8GB of memory and 60GB storage with bridge networking so the IP Address of the EFK VM is on the same network as my OpenShift 4.6 home lab.

Assuming CentOS 8.2 is installed on the VM, make sure all is up-to-date:

dnf update -y
reboot

Install Java:

dnf install java-11-openjdk-devel -y

Add EPEL:

dnf install epel-release -y

Reducing steps in this document and to remove potential issues, disabling both SELinux and firewalld:

vi /etc/sysconfig/selinux
SELINUX=disabled
systemctl stop firewalld
systemctl disable firewalld

Elasticsearch

Add the Elasticsearch repository:

vi /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-7.x]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Import the key:

rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

Install Eleasticsearch:

dnf install elasticsearch -y

Back up the original configuration:

cp /etc/elasticsearch/elasticsearch.yml /etc/elasticsearch/elasticsearch.yml.original

Strip out the noise:

grep -v -e '^#' -e '^$' /etc/elasticsearch/elasticsearch.yml.original > /etc/elasticsearch/elasticsearch.yml

Add the following settings to expose Elasticsearch to the network:

cluster.name: my-efk
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
transport.host: localhost
transport.tcp.port: 9300
http.port: 9200
network.host: 0.0.0.0
cluster.initial_master_nodes: node-1

Start and enable the service:

systemctl enable elasticsearch.service --now

Kibana

Install Kibana:

dnf install kibana -y

Back up the original configuration:

cp /etc/kibana/kibana.yml /etc/kibana/kibana.yml.original

Update the configuration for the Elasticsearch host:

vi /etc/kibana/kibana.yml
elasticsearch.hosts: [“http://localhost:9200"]

Start and enable Kibana:

systemctl enable kibana.service --now

NGINX

Install NGINX:

dnf install nginx -y

Create a user name and password for Kibana:

echo "kibana:`openssl passwd -apr1`" | tee -a /etc/nginx/htpasswd.kibana

Back up the original configuration:

cp /etc/kibana/kibana.yml /etc/kibana/kibana.yml.original

Add the following configuration:

vi /etc/kibana/kibana.yml
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
    worker_connections 1024;
}
http {
    log_format main '$remote_addr — $remote_user [$time_local] "$request"'
    '$status $body_bytes_sent "$http_referer"'
    '"$http_user_agent" "$http_x_forwarded_for"';
    access_log /var/log/nginx/access.log main;
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    include /etc/nginx/mime.types;
    default_type application/octet-stream;
    include /etc/nginx/conf.d/*.conf;
    server {
        listen 80;
        server_name _;
        auth_basic "Restricted Access";
        auth_basic_user_file /etc/nginx/htpasswd.kibana;
    location / {
        proxy_pass http://localhost:5601;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection ‘upgrade’;
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        }
    }
}

Start and enable NGINX:

systemctl enable nginx.service --now

Smoke Testing

Smoke testing

With all that in place, test Elasticsearch is up and running, the following should return a JSON response:

curl http://127.0.0.1:9200/_cluster/health?pretty

You should be able access Kibana via a browser at the IP Address of your instance, in my case http://192.168.0.70

Once in there, navigate to "Management" → "Stack Management", Under "Kibana" → "Index Patterns" and click "Create Index Pattern". This is where you will see various sources to index.

From a command line PUT an example data:

curl -X PUT "192.168.0.70:9200/characters/_doc/1?pretty" -H 'Content-Type: application/json' -d '{"name": "Mickey Mouse"}
curl -X PUT "192.168.0.70:9200/characters/_doc/2?pretty" -H 'Content-Type: application/json' -d '{"name": "Daffy Duck"}
curl -X PUT "192.168.0.70:9200/characters/_doc/3?pretty" -H 'Content-Type: application/json' -d '{"name": "Donald Duck"}
curl -X PUT "192.168.0.70:9200/characters/_doc/4?pretty" -H 'Content-Type: application/json' -d '{"name": "Bugs Bunny"}

In Kibana, when you go to "Create Index Pattern" as described before, you should now see characters has appeared, type characters* and click "Next step" and create the index pattern. Navigate to "Kibana" → "Discover" and if you have more than one "Index Pattern" select the characters* index from the drop-down menu (near top left) and you should see the data you PUT into Elasticsearch.

This pattern is what I use to see and add indexes to Kibana when adding forwarders.

For reference you can return individual results using:

curl -X GET "localhost:9200/characters/_doc/1?pretty"

Grafana Loki

Coming Soon

Loki

Coming Soon

Promtail

Coming Soon

OpenShift

Kubernetes

Monitoring

Time to deploy these stacks on OpenShift.

Coming Soon

Logging

Coming Soon

Conclusion