Add initial prometheus rules and alerts

f9d8b69a · David Hoese · 9e8ac016 · f9d8b69a · f9d8b69a · f9d8b69a
Verified Commit f9d8b69a authored 4 years ago by David Hoese
--- a/admin/README.md
+++ b/admin/README.md
@@ -61,9 +61,15 @@ https://github.com/kubernetes/ingress-nginx/tree/master/charts/ingress-nginx
 ```bash
 helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
-helm install -n kube-system ingress-nginx ingress-nginx/ingress-nginx --set controller.metrics.enabled=true
+helm install -n kube-system ingress-nginx ingress-nginx/ingress-nginx --set controller.metrics.enabled=true --set controller.metrics.serviceMonitor.enabled=true --set controller.metrics.serviceMonitor.namespace="monitoring" --set controller.metrics.serviceMonitor.additionalLabels.release="prometheus-operator"
 ```
+Note the above includes enabling metric gathering for a Prometheus server.
+We enable the metrics endpoint on the controller, then enable the
+ServiceMonitor which is Prometheus resource that tells Prometheus about the
+metrics. We also add an extra label for kubekorner's particular installation
+of Prometheus so our ServiceMonitor can be found automatically.
 ### Local Path Configuration
 When running on a K3S-based (rancher) cluster like the one currently running
@@ -285,10 +291,10 @@ Prometheus Operator will install its own custom resources definitions (CRDs)
 to allow other applications to create their own ways of interacting with
 Prometheus.
-To install this on the Kubekorner K3s cluster we will use the stable
+To install this on the Kubekorner K3s cluster we will use the 
-prometheus-operator helm chart maintained by the helm community:
+prometheus-community prometheus stack helm chart maintained by the helm community:
-https://github.com/helm/charts/tree/master/stable/prometheus-operator
+https://github.com/prometheus-community/helm-charts
 First we will create a namespace specifically for prometheus:
@@ -296,25 +302,173 @@ First we will create a namespace specifically for prometheus:
 kubectl create namespace monitoring
 ```
+If your helm installation doesn't already have the necessary chart
+repositories, they can be added by doing:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo add stable https://kubernetes-charts.storage.googleapis.com/
+helm repo update
+```
 Then we will install the helm chart in that namespace with the release name
 "prometheus-operator".
 ```bash
-helm install -n monitoring prometheus-operator stable/prometheus-operator
+helm install -n monitoring prometheus-operator prometheus-community/kube-prometheus-stack
 ```
-Note, if your helm installation doesn't already have the stable chart
-repository added you may need to do:
+Also note at the time of writing this installation results in some warnings:
+```
+manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
+```
+This is described in a GitHub issue here: https://github.com/helm/charts/issues/17511
+### Customizing Prometheus rules
+In order to get the most out of Prometheus, it is a good idea to set up rules
+for alerts to send to the AlertManager servers created by Prometheus. We can
+then configure AlertManager to notify our development team of different
+conditions if needed.
+First, we need to create a set of rules that we want to be notified about. To
+configure these we create one or more `PrometheusRule` objects. Here is an
+example:
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  creationTimestamp: null
+  labels:
+    app: kube-prometheus-stack
+    release: prometheus-operator
+  name: prometheus-example-rules
+spec:
+  groups:
+  - name: ./example.rules
+    rules:
+    - alert: ExampleAlert
+      expr: vector(1)
+```
+This creates an alert called "ExampleAlert" that is fired when `expr` is true.
+In this case `vector(1)` is the equivalent of always true. The `expr` is
+a PromQL query that has access to any field recorded by Prometheus.
+Normally these rules should be automatically picked up by the Prometheus
+server(s) by matching `labels`. By default, the Prometheus Operator installed
+above will use the name of the helm chart for `app` and the name of the helm
+release for `release` to match against.
+To check, run:
 ```bash
-helm repo add stable https://kubernetes-charts.storage.googleapis.com
+$ kubectl -n monitoring get prometheus/prometheus-operator-kube-p-prometheus -o go-template="{{ .spec.ruleSelector }}"
-helm repo update
+map[matchLabels:map[app:kube-prometheus-stack release:prometheus-operator]]
 ```
-Also note at the time of writing this installation results in some warnings:
+Although a little cryptic, this is showing:
+```yaml
+matchLabels:
+  app: kube-prometheus-stack
+  release: prometheus-operator
 ```
-manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
+If the above yaml PrometheusRule configuration was stored in a `example_rule.yaml` we could
+deploy it by running:
+```bash
+kubectl create -n monitoring -f example_rule.yaml
+```
+To investigate if our rules are showing up in Prometheus we can forward the
+service to the cluster node and then forward that to our local machine
+with SSH. Note you'll need to use the name of your service in your
+installation.
+```bash
+kubectl -n monitoring port-forward service/prometheus-operated 9995:9090
 ```
-This is described in a GitHub issue here: https://github.com/helm/charts/issues/17511
+If we go to `http://localhost:9995/alerts` we will see the current alerts
\ No newline at end of file
+Prometheus is aware of. We can click on "Graph" at the top and query the
+Prometheus PromQL that we might want to use in our other rules.
+We can do a similar check for firing alerts in the alertmanager by forwarding
+another port:
+```bash
+kubectl -n monitoring port-forward service/prometheus-operator-kube-p-alertmanager 9993:9093
+```
+And going to `http://localhost:9993`.
+### Customizing Prometheus Alerts
+Now that the rules should have been picked up, we need to configure the
+alertmanager to do something when these alerts are fired. The below
+instructions are one approach to configuring the alertmanager. The available
+methods are changing over time as the prometheus community grows the helm
+chart used above. Other solutions may involve ConfigMap resources or mounting
+additional volumes for alertmanager. The below approach is the simplest but
+does require "upgrading" the Prometheus Operator installation whenever it
+changes.
+To configure how alerts are handled by alertmanager we need to modify the
+alertmanager configuration. Below we've embedded our alertmanager
+configuration in a YAML file that we will provide to our helm chart upgrade
+as the new "values" file.
+```yaml
+alertmanager:
+  ## Alertmanager configuration directives
+  ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
+  ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
+  ##
+  config:
+    global:
+      resolve_timeout: 5m
+      slack_api_url: "https://hooks.slack.com/services/blah/blah/blah"
+    route:
+      group_by: ["instance", "severity"]
+      group_wait: 30s
+      group_interval: 5m
+      repeat_interval: 12h
+      receiver: "null"
+      routes:
+      - match:
+          alertname: ExampleAlert
+        receiver: "geosphere-dev-team"
+    receivers:
+    - name: "null"
+    - name: "geosphere-dev-team"
+      slack_configs:
+      - channel: "#geo2grid"
+        text: "summary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"
+```
+To upgrade the prometheus operator installation and assuming the above is in a
+file called `custom_prom_values.yaml`:
+```bash
+helm upgrade --reuse-values -n monitoring -f custom_prom_values.yaml prometheus-operator prometheus-community/kube-prometheus-stack
+```
+You can verify that the upgrade updated the related secret with:
+```bash
+kubectl -n monitoring get secrets alertmanager-prometheus-operator-kube-p-alertmanager -o jsonpath="{.data.alertmanager\.yaml}" | base64 -d
+```
+You should also see the config-reloader for alertmanager eventually pickup on
+the new config:
+```bash
+kubectl -n monitoring logs pod/alertmanager-prometheus-operator-kube-p-alertmanager-0 -c config-reloader --tail 50 -f
+```
--- a/admin/example_prometheus_rule.yaml
+++ b/admin/example_prometheus_rule.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  creationTimestamp: null
+  labels:
+    app: kube-prometheus-stack
+    release: prometheus-operator
+  name: prometheus-example-rules
+spec:
+  groups:
+  - name: ./example.rules
+    rules:
+    - alert: ExampleAlert
+      expr: vector(1)
+      labels:
+        severity: warning
+      annotations:
+        summary: "Example Alert"
+        description: "A test prometheus rule that always fires"
\ No newline at end of file
--- a/admin/kubekorner_geosphere_prometheus_rules.yaml
+++ b/admin/kubekorner_geosphere_prometheus_rules.yaml
--- a/admin/prometheus_kubernetes_values.yaml
+++ b/admin/prometheus_kubernetes_values.yaml
+alertmanager:
+  ## Alertmanager configuration directives
+  ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
+  ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
+  ##
+  config:
+    global:
+      resolve_timeout: 5m
+      slack_api_url: "FIXME: <https://hooks.slack.com/services/...>"
+    route:
+      group_by: ["instance", "severity"]
+      group_wait: 30s
+      group_interval: 5m
+      repeat_interval: 12h
+      receiver: "null"
+      routes:
+      - match_re:
+          ruleGroup: "geosphere-.*"
+        receiver: "geosphere-dev-team"
+    receivers:
+    - name: "null"
+    - name: "geosphere-dev-team"
+      slack_configs:
+      - channel: "#geo2grid"
+        send_resolved: true
+        icon_emoji: '{{ if eq .Status "firing" }}:fearful:{{ else }}:excellent:{{ end }}'
+        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
+        title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}'
+        text: |-
+          {{ range .Alerts }}
+            *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
+            *Description:* {{ .Annotations.description }}
+            *Details:*
+            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
+            {{ end }}
+          {{ end }}
\ No newline at end of file