(1)创建钉钉机器人(只能在内部群里添加)

点击【群设置】-【机器人】-【添加机器人】

AlertManager配置钉钉告警-1

AlertManager配置钉钉告警-2

选择【自定义】-【添加】

AlertManager配置钉钉告警-3

定义机器人名字,选择【加签】-【完成】。需要复制加签字符串SEC75e13f72c3573f501cfe9dc1d84e20532a74924b68fe0536eb4a481029217d91

AlertManager配置钉钉告警-4

复制Webhook地址:https://oapi.dingtalk.com/robot/send?access_token=fe670c17883f0190a7a38f0079b463173392ebfe352513f6df9a7e97e196be85

AlertManager配置钉钉告警-5

(2)部署prometheus-webhook-dingtalk(二进制方式部署,并没有部署到k8s里)

prometheus-webhook-dingtalk是一个实现钉钉告警的插件,github地址:https://github.com/timonwong/prometheus-webhook-dingtalk

$ wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz
$ tar zxf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz -C /opt
$ ln -s /opt/prometheus-webhook-dingtalk-2.0.0.linux-amd64  /opt/prometheus-webhook-dingtalk

2、定义systemd服务管理脚本

$ vi /lib/systemd/system/prometheus-webhook.service

[Unit]
Description=Prometheus Dingding Webhook
[Service]
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target

$ vi /opt/prometheus-webhook-dingtalk/config.yml

Request timeout
timeout: 5s

Uncomment following line in order to write template from scratch (be careful!)
no_builtin_template: true

Customizable templates path
templates:
  - /opt/prometheus-webhook-dingtalk/ding.tmpl

You can also override default template using `default_message`
The following example to use the 'legacy' template from v0.3.0
default_message:
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'

Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=fe670c17883f0190a7a38f0079b463173392ebfe352513f6df9a7e97e196be85
    # secret for signature
    secret: SEC75e13f72c3573f501cfe9dc1d84e20532a74924b68fe0536eb4a481029217d91

    message:
      title: '{{ template "ops.title" . }}'  #  给这个webhook应用上 模板标题 (ops.title是我们模板文件中的title 可在下面给出的模板文件中看到)
      text: '{{ template "ops.content" . }}' #  给这个webhook应用上 模板内容  (ops.content是我们模板文件中的content 可在下面给出的模板文件中看到)

3、定义模板文件

$ vi /opt/prometheus-webhook-dingtalk/ding.tmpl

{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}

{{ define "__alert_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **故障主机**: {{ .Labels.instance }}
    **告警信息**: {{ .Annotations.description }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

{{ define "__resolved_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **故障主机**: {{ .Labels.instance }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    **恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

{{ define "ops.title" }}
{{ template "__subject" . }}
{{ end }}

{{ define "ops.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len  }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len  }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}

{{ define "ops.link.title" }}{{ template "ops.title" . }}{{ end }}
{{ define "ops.link.content" }}{{ template "ops.content" . }}{{ end }}
{{ template "ops.title" . }}
{{ template "ops.content" . }}

启动服务

$ systemctl daemon-reload
$ systemctl enable prometheus-webhook.service
$ systemctl start prometheus-webhook.service

4、创建endpoint

由于prometheus-webhook-dingtalk为k8s外面的服务,要想让k8s里的pod直接使用最好是创建一个endpoint

$ vi  prometheus-webhook-dingtalk.yaml

apiVersion: v1
kind: Endpoints
metadata:
  name: dingtalk
subsets:
  - addresses:
    - ip: 192.168.1.31
    ports:
      - port: 8060

---
apiVersion: v1
kind: Service  ##注意,该service里并不需要定义selector,只要Service name和Endpoint name保持一致即可
metadata:
  name: dingtalk
spec:
  ports:
    - port: 8060

使其生效

$ kubectl apply -f prometheus-webhook-dingtalk.yaml

5、配置Alertmanager

$ vi alertmanager_config.yaml

apiVersion: v1
data:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    templates:
      - '/bitnami/alertmanager/data/template/ding.tmpl'
    receivers:
      - name: 'dingtalk_webhook'
        webhook_configs:
        - url: 'http://dingtalk.default.svc.cluster.local:8060/dingtalk/webhook1/send'
          send_resolved: true
    route:
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 3h
      receiver: 'dingtalk_webhook'
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: prometheus
    meta.helm.sh/release-namespace: default
  labels:
    app.kubernetes.io/component: alertmanager
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: prometheus
    app.kubernetes.io/version: 0.26.0
    helm.sh/chart: prometheus-0.3.2
  name: prometheus-alertmanager

6、由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里,所以在NFS服务端192.168.1.34上操作,编写模板文件

$ cd /data/nfs/default-data-prometheus-alertmanager-0-pvc-105e6608-d0e4-4304-af09-a93b124424fe/template
$ vi ding.tmpl

{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}

{{ define "__alert_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **故障主机**: {{ .Labels.instance }}
    **告警信息**: {{ .Annotations.description }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

{{ define "__resolved_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **故障主机**: {{ .Labels.instance }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    **恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

{{ define "ops.title" }}
{{ template "__subject" . }}
{{ end }}

{{ define "ops.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len  }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len  }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}

{{ define "ops.link.title" }}{{ template "ops.title" . }}{{ end }}
{{ define "ops.link.content" }}{{ template "ops.content" . }}{{ end }}
{{ template "ops.title" . }}
{{ template "ops.content" . }}

7、重新导入配置

$ kubectl delete cm prometheus-alertmanager; kubectl apply -f alertmanager_config.yaml

8、在k8s-master01节点上重启Alertmanager服务

$ kubectl get po |grep 'prometheus-alertmanager'|awk '{print $1}' |xargs -i kubectl delete po {}

9、在192.168.1.35上模拟CPU使用偏高,需要执行两次

$ cat /dev/zero > /dev/null  &

10、打开浏览器输入http://192.168.1.31:31093访问Prometheus,到Prometheus页面下查看告警

AlertManager配置企业微信告警-2

11、登录钉钉,查看告警信息

AlertManager配置企业微信告警-3