(1)创建钉钉机器人(只能在内部群里添加)
点击【群设置】-【机器人】-【添加机器人】


选择【自定义】-【添加】

定义机器人名字,选择【加签】-【完成】。需要复制加签字符串SEC75e13f72c3573f501cfe9dc1d84e20532a74924b68fe0536eb4a481029217d91

复制Webhook地址:https://oapi.dingtalk.com/robot/send?access_token=fe670c17883f0190a7a38f0079b463173392ebfe352513f6df9a7e97e196be85

(2)部署prometheus-webhook-dingtalk(二进制方式部署,并没有部署到k8s里)
prometheus-webhook-dingtalk是一个实现钉钉告警的插件,github地址:https://github.com/timonwong/prometheus-webhook-dingtalk
$ wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz
$ tar zxf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz -C /opt
$ ln -s /opt/prometheus-webhook-dingtalk-2.0.0.linux-amd64 /opt/prometheus-webhook-dingtalk
2、定义systemd服务管理脚本
$ vi /lib/systemd/system/prometheus-webhook.service
[Unit]
Description=Prometheus Dingding Webhook
[Service]
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
$ vi /opt/prometheus-webhook-dingtalk/config.yml
Request timeout
timeout: 5s
Uncomment following line in order to write template from scratch (be careful!)
no_builtin_template: true
Customizable templates path
templates:
- /opt/prometheus-webhook-dingtalk/ding.tmpl
You can also override default template using `default_message`
The following example to use the 'legacy' template from v0.3.0
default_message:
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
Targets, previously was known as "profiles"
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=fe670c17883f0190a7a38f0079b463173392ebfe352513f6df9a7e97e196be85
# secret for signature
secret: SEC75e13f72c3573f501cfe9dc1d84e20532a74924b68fe0536eb4a481029217d91
message:
title: '{{ template "ops.title" . }}' # 给这个webhook应用上 模板标题 (ops.title是我们模板文件中的title 可在下面给出的模板文件中看到)
text: '{{ template "ops.content" . }}' # 给这个webhook应用上 模板内容 (ops.content是我们模板文件中的content 可在下面给出的模板文件中看到)
3、定义模板文件
$ vi /opt/prometheus-webhook-dingtalk/ding.tmpl
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
{{ define "__alert_list" }}{{ range . }}
---
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**故障主机**: {{ .Labels.instance }}
**告警信息**: {{ .Annotations.description }}
**触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "__resolved_list" }}{{ range . }}
---
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**故障主机**: {{ .Labels.instance }}
**触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "ops.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "ops.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ops.link.title" }}{{ template "ops.title" . }}{{ end }}
{{ define "ops.link.content" }}{{ template "ops.content" . }}{{ end }}
{{ template "ops.title" . }}
{{ template "ops.content" . }}
启动服务
$ systemctl daemon-reload
$ systemctl enable prometheus-webhook.service
$ systemctl start prometheus-webhook.service
4、创建endpoint
由于prometheus-webhook-dingtalk为k8s外面的服务,要想让k8s里的pod直接使用最好是创建一个endpoint
$ vi prometheus-webhook-dingtalk.yaml
apiVersion: v1
kind: Endpoints
metadata:
name: dingtalk
subsets:
- addresses:
- ip: 192.168.1.31
ports:
- port: 8060
---
apiVersion: v1
kind: Service ##注意,该service里并不需要定义selector,只要Service name和Endpoint name保持一致即可
metadata:
name: dingtalk
spec:
ports:
- port: 8060
使其生效
$ kubectl apply -f prometheus-webhook-dingtalk.yaml
5、配置Alertmanager
$ vi alertmanager_config.yaml
apiVersion: v1
data:
alertmanager.yaml: |
global:
resolve_timeout: 5m
templates:
- '/bitnami/alertmanager/data/template/ding.tmpl'
receivers:
- name: 'dingtalk_webhook'
webhook_configs:
- url: 'http://dingtalk.default.svc.cluster.local:8060/dingtalk/webhook1/send'
send_resolved: true
route:
group_wait: 10s
group_interval: 5m
repeat_interval: 3h
receiver: 'dingtalk_webhook'
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: prometheus
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/component: alertmanager
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: prometheus
app.kubernetes.io/version: 0.26.0
helm.sh/chart: prometheus-0.3.2
name: prometheus-alertmanager
6、由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里,所以在NFS服务端192.168.1.34上操作,编写模板文件
$ cd /data/nfs/default-data-prometheus-alertmanager-0-pvc-105e6608-d0e4-4304-af09-a93b124424fe/template
$ vi ding.tmpl
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
{{ define "__alert_list" }}{{ range . }}
---
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**故障主机**: {{ .Labels.instance }}
**告警信息**: {{ .Annotations.description }}
**触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "__resolved_list" }}{{ range . }}
---
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**故障主机**: {{ .Labels.instance }}
**触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "ops.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "ops.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ops.link.title" }}{{ template "ops.title" . }}{{ end }}
{{ define "ops.link.content" }}{{ template "ops.content" . }}{{ end }}
{{ template "ops.title" . }}
{{ template "ops.content" . }}
7、重新导入配置
$ kubectl delete cm prometheus-alertmanager; kubectl apply -f alertmanager_config.yaml
8、在k8s-master01节点上重启Alertmanager服务
$ kubectl get po |grep 'prometheus-alertmanager'|awk '{print $1}' |xargs -i kubectl delete po {}
9、在192.168.1.35上模拟CPU使用偏高,需要执行两次
$ cat /dev/zero > /dev/null &
10、打开浏览器输入http://192.168.1.31:31093访问Prometheus,到Prometheus页面下查看告警

11、登录钉钉,查看告警信息
