K8S组件calico重建过程

问题背景：周一来了以后看到calico-node组件pod重启100多次，查看日志发现warning日志：

Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get “http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused

一、问题日志

频繁重启

[root@master ~]# kubectl get pods -n calico-system -o wide 
NAMESPACE              NAME                                         READY   STATUS    RESTARTS          AGE     IP               NODE     NOMINATED NODE   READINESS GATES
aliang-cka             web-5dc86dfc-t7nrb                           1/1     Running   0                 2d16h   10.244.140.72    node02   <none>           <none>
calico-apiserver       calico-apiserver-bb689689-b5v88              1/1     Running   0                 2d19h   10.244.196.131   node01   <none>           <none>
calico-apiserver       calico-apiserver-bb689689-dwlf4              1/1     Running   0                 2d19h   10.244.140.66    node02   <none>           <none>
calico-system          calico-kube-controllers-58d9bdcc64-tfqgx     1/1     Running   0                 2d23h   10.244.219.65    master   <none>           <none>
calico-system          calico-node-dr6ch                            1/1     Running   128 (64m ago)     2d23h   192.168.0.12     node01   <none>           <none>
calico-system          calico-node-lj89c                            1/1     Running   140 (2m44s ago)   2d23h   192.168.0.13     node02   <none>           <none>
calico-system          calico-node-vrz58                            1/1     Running   138 (45s ago)     2d23h   192.168.0.11     master   <none>           <none>
calico-system          calico-typha-578cfdc69-95f9b                 1/1     Running   167 (2s ago)      2d23h   192.168.0.13     node02   <none>           <none>
calico-system          calico-typha-578cfdc69-zhffj                 1/1     Running   121 (108m ago)    2d23h   192.168.0.12     node01   <none>           <none>
calico-system          csi-node-driver-5ntdf                        2/2     Running   0                 2d23h   10.244.219.68    master   <none>           <none>
calico-system          csi-node-driver-9psnp                        2/2     Running   0                 2d23h   10.244.140.65    node02   <none>           <none>
calico-system          csi-node-driver-fz67c                        2/2     Running   0                 2d23h   10.244.196.129   node01   <none>           <none>

calico-node Events日志

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  23m   kubelet  Readiness probe failed: 2024-07-15 01:27:04.839 [INFO][3310] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-07-15 01:27:14.839 [INFO][3320] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  20m  kubelet  Readiness probe failed: 2024-07-15 01:30:24.839 [INFO][3553] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  16m  kubelet  Readiness probe failed: 2024-07-15 01:34:44.839 [INFO][3867] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  9m (x1666 over 2d18h)    kubelet  Liveness probe failed: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  110s (x3936 over 2d18h)  kubelet  (combined from similar events): Readiness probe failed: 2024-07-15 01:49:04.836 [INFO][4911] confd/health.go 180: Number of node(s) with BGP peering established = 2

二、解决办法：

1.完全删除calico-node pod服务。

# 在master节点执行删除calico相关pod service,deployment namespace
kubectl delete -f tigera-operator.yaml
kubectl delete -f custom-resources.yaml

# 以上命令执行后如果发现有Error返回，检查calico相关pod service,deployment namespace，手动删除，即删除calico-system命名空间下的所有服务
 kubectl delete pod -n calico-system csi-node-driver-jhdvh csi-node-driver-9nmrb csi-node-driver-2w8p8 calico-node-x7spm calico-node-8z8rm calico-node-78ffv 
 kubectl delete deployment -n  calico-system  calico-typha calico-kube-controllers 
 kubectl delete deployment -n  calico-apiserver calico-apiserver
 
 kubectl delete svc -n  calico-system  calico-typha calico-kube-controllers 
 kubectl delete svc -n  calico-apiserver calico-apiserver
 
 kubectl delete ns calico-apiserver
 kubectl delete ns calico-system
 
# 不出意外的情况下，在删除calico-system 命名空间的时候会删不掉，calico-system状态变成了Terminating
[root@master ]# kubectl get ns -A
NAME                   STATUS        AGE
calico-system          Terminating   3d1h
default                Active        3d1h
kube-node-lease        Active        3d1h
kube-public            Active        3d1h
kube-system            Active        3d1h
kubernetes-dashboard   Active        2d19h

# 删不掉的解决办法：
# 1.先导出配置文件
kubectl get ns calico-system -o json > tmp.json

# 2.修改导出文件，删除其中的finalizers这一项，其他不变，然后保存。
....
        "resourceVersion": "624892",
        "uid": "fa96ef83-497e-4bc7-a98a-39660e90fd32"
    },
    "spec": {
        "finalizers": [   # 删除这个finalizers数组
            "kubernetes"  
        ]
    },
    "status": {
        "phase": "Active"
    }
}
....

# 3.在当前终端开启代理 kubectl proxy
[root@master ]# kubectl proxy
Starting to serve on 127.0.0.1:8001

# 4.再开一个终端，通过curl调用api删除，无输出
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/calico-system/finalize

# 5.再次查看namespace，calico-system被删掉了。
[root@master ~]# kubectl get ns -A
NAME                   STATUS   AGE
default                Active   3d1h
kube-node-lease        Active   3d1h
kube-public            Active   3d1h
kube-system            Active   3d1h
kubernetes-dashboard   Active   2d19h

# 6.将所有节点的/etc/cni/net.d/目录清空，然后重启所有节点的kubelet
rm -rf /etc/cni/net.d/*
systemctl restart kubelet

# 7.coredns的pod将会重启变成pending状态，calico删除完成！

2.重建calico组件

# 2.1重建之前检查各个节点的时间同步情况，没有同步的一定要先同步
ntpdate ntp.aliyun.com

# 2.2重建calico服务
# 下载 
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/custom-resources.yaml

# 修改custom-resources.yaml文件中 CIDR，默认是 192.168.0.0/16,修改为创建集群时的IP段，
# 我这里创建集群时用的 10.244.0.0/16，若与集群IP段与官网配置文件一直，则无需修改。
....
calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16  # 修改此处
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
....

# 执行calico部署文件
kubectl create -f tigera-operator.yaml
kubectl create -f custom-resources.yaml

# 等待pod启动，如果之前镜像没有删除的话，重建会比较快的，否则会重新拉取镜像，比较耗时。
# 重建完成
[root@master calico-operator]# kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-58d9bdcc64-vzm9r   1/1     Running   0          5m15s
calico-node-5p7qf                          1/1     Running   0          5m16s
calico-node-9lnmn                          1/1     Running   0          5m16s
calico-node-hpxdr                          1/1     Running   0          5m16s
calico-typha-65b4547c94-46fll              1/1     Running   0          5m8s
calico-typha-65b4547c94-qb2tx              1/1     Running   0          5m16s
csi-node-driver-jrx88                      2/2     Running   0          5m16s
csi-node-driver-kw6d6                      2/2     Running   0          5m16s
csi-node-driver-wdhk7                      2/2     Running   0          5m16s

一、问题日志

频繁重启

calico-node Events日志

二、解决办法：

1.完全删除calico-node pod服务。

2.重建calico组件