一、节点宕机快速恢复服务

说明:当节点故障时,k8s集群中默认需要等待5分钟,才能进行漂移。

1.1 环境准备

1、清除node02节点上的污点

[root@k8s-master01 ~]# k taint node k8s-node02 ingress-

2、创建测试应用

[root@k8s-master01 ~]# vim  test-deploy.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deploy
  labels:
    app: test-deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deploy
  template:
    metadata:
      labels:
        app: test-deploy
    spec:
      containers:
      - name: nginx
        image: registry.cn-hangzhou.aliyuncs.com/zq-demo/nginx:1.14.2

应用

[root@k8s-master01 ~]# kaf  test-deploy.yaml 

1.2 节点宕机快速恢复服务

1、查看部署在node02节点上测试应用

[root@k8s-master01 ~]# kgp -owide
NAME                          READY   STATUS    RESTARTS   AGE     IP               NODE         NOMINATED NODE   READINESS GATES
test-deploy-77d76d744-pg948   1/1     Running   0          2m58s   192.168.58.195   k8s-node02   <none>           <none>

2、将node02主机关机,来模拟node02主机故障

[root@k8s-node02 ~]# shutdown -h now

3、在master节点上查看节点状态变为Not Ready

[root@k8s-master01 ~]# kg node | grep node02
k8s-node02     NotReady   <none>          10d   v1.32.3

此时再查看node02节点上pod,观察到虽然主机挂了,但是node02节点上的pod仍然存在,这是因为系统设置的故障等待时间为5分钟

[root@k8s-master01 ~]# kgp  -owide
NAME                          READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
test-deploy-77d76d744-pg948   1/1     Running   0          30m   192.168.58.195   k8s-node02   <none>           <none>

4、将node02节点开机复原,观察到节点状态变为Ready

[root@k8s-master01 ~]# kg node  | grep node02 
k8s-node02     Ready    <none>          10d   v1.32.3

5、将node02节点上一个测试应用添加tolerationSeconds参数设置宽限期为10s

添加如下配置信息

      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 10
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 10

完整配置文件

[root@k8s-master01 ~]# vim  test-deploy.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deploy
  labels:
    app: test-deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deploy
  template:
    metadata:
      labels:
        app: test-deploy
    spec:
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 10
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 10
      containers:
      - name: nginx
        image: registry.cn-hangzhou.aliyuncs.com/zq-demo/nginx:1.14.2

重新应用

[root@k8s-master01 ~]# kaf  test-deploy.yaml 

6、再次模拟节点宕机

[root@k8s-node02 ~]# shutdown -h now

查看节点,观察状态变为NotReady

[root@k8s-master01 ~]# kg node | grep node02
k8s-node02     NotReady   <none>          10d   v1.32.3

查看测试应用,观察新的pod到飘到node01节点,旧的应用已实现了删除

[root@k8s-master01 ~]# kgp -owide
NAME                           READY   STATUS        RESTARTS   AGE     IP               NODE         NOMINATED NODE   READINESS GATES
test-deploy-67df4dfb6b-dkx22   1/1     Running       0          3m1s    192.168.85.217   k8s-node01   <none>           <none>
test-deploy-7bb955dd46-s4f4h   1/1     Terminating   0          6m35s   192.168.58.197   k8s-node02   <none>           <none>