问题背景:周一来了以后看到calico-node组件pod重启100多次,查看日志发现warning日志:

Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get “http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused

一、问题日志

  • 频繁重启

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@master ~]# kubectl get pods -n calico-system -o wide 
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aliang-cka web-5dc86dfc-t7nrb 1/1 Running 0 2d16h 10.244.140.72 node02 <none> <none>
calico-apiserver calico-apiserver-bb689689-b5v88 1/1 Running 0 2d19h 10.244.196.131 node01 <none> <none>
calico-apiserver calico-apiserver-bb689689-dwlf4 1/1 Running 0 2d19h 10.244.140.66 node02 <none> <none>
calico-system calico-kube-controllers-58d9bdcc64-tfqgx 1/1 Running 0 2d23h 10.244.219.65 master <none> <none>
calico-system calico-node-dr6ch 1/1 Running 128 (64m ago) 2d23h 192.168.0.12 node01 <none> <none>
calico-system calico-node-lj89c 1/1 Running 140 (2m44s ago) 2d23h 192.168.0.13 node02 <none> <none>
calico-system calico-node-vrz58 1/1 Running 138 (45s ago) 2d23h 192.168.0.11 master <none> <none>
calico-system calico-typha-578cfdc69-95f9b 1/1 Running 167 (2s ago) 2d23h 192.168.0.13 node02 <none> <none>
calico-system calico-typha-578cfdc69-zhffj 1/1 Running 121 (108m ago) 2d23h 192.168.0.12 node01 <none> <none>
calico-system csi-node-driver-5ntdf 2/2 Running 0 2d23h 10.244.219.68 master <none> <none>
calico-system csi-node-driver-9psnp 2/2 Running 0 2d23h 10.244.140.65 node02 <none> <none>
calico-system csi-node-driver-fz67c 2/2 Running 0 2d23h 10.244.196.129 node01 <none> <none>

  • calico-node Events日志

1
2
3
4
5
6
7
8
9
10
11
12
13
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 23m kubelet Readiness probe failed: 2024-07-15 01:27:04.839 [INFO][3310] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 23m kubelet Readiness probe failed: 2024-07-15 01:27:14.839 [INFO][3320] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 20m kubelet Readiness probe failed: 2024-07-15 01:30:24.839 [INFO][3553] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 16m kubelet Readiness probe failed: 2024-07-15 01:34:44.839 [INFO][3867] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 9m (x1666 over 2d18h) kubelet Liveness probe failed: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 110s (x3936 over 2d18h) kubelet (combined from similar events): Readiness probe failed: 2024-07-15 01:49:04.836 [INFO][4911] confd/health.go 180: Number of node(s) with BGP peering established = 2

二、解决办法:

  • 1.完全删除calico-node pod服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# 在master节点执行删除calico相关pod service,deployment namespace
kubectl delete -f tigera-operator.yaml
kubectl delete -f custom-resources.yaml

# 以上命令执行后如果发现有Error返回,检查calico相关pod service,deployment namespace,手动删除,即删除calico-system命名空间下的所有服务
kubectl delete pod -n calico-system csi-node-driver-jhdvh csi-node-driver-9nmrb csi-node-driver-2w8p8 calico-node-x7spm calico-node-8z8rm calico-node-78ffv
kubectl delete deployment -n calico-system calico-typha calico-kube-controllers
kubectl delete deployment -n calico-apiserver calico-apiserver

kubectl delete svc -n calico-system calico-typha calico-kube-controllers
kubectl delete svc -n calico-apiserver calico-apiserver

kubectl delete ns calico-apiserver
kubectl delete ns calico-system

# 不出意外的情况下,在删除calico-system 命名空间的时候会删不掉,calico-system状态变成了Terminating
[root@master ]# kubectl get ns -A
NAME STATUS AGE
calico-system Terminating 3d1h
default Active 3d1h
kube-node-lease Active 3d1h
kube-public Active 3d1h
kube-system Active 3d1h
kubernetes-dashboard Active 2d19h

# 删不掉的解决办法:
# 1.先导出配置文件
kubectl get ns calico-system -o json > tmp.json

# 2.修改导出文件,删除其中的finalizers这一项,其他不变,然后保存。
....
"resourceVersion": "624892",
"uid": "fa96ef83-497e-4bc7-a98a-39660e90fd32"
},
"spec": {
"finalizers": [ # 删除这个finalizers数组
"kubernetes"
]
},
"status": {
"phase": "Active"
}
}
....

# 3.在当前终端开启代理 kubectl proxy
[root@master ]# kubectl proxy
Starting to serve on 127.0.0.1:8001

# 4.再开一个终端,通过curl调用api删除,无输出
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/calico-system/finalize

# 5.再次查看namespace,calico-system被删掉了。
[root@master ~]# kubectl get ns -A
NAME STATUS AGE
default Active 3d1h
kube-node-lease Active 3d1h
kube-public Active 3d1h
kube-system Active 3d1h
kubernetes-dashboard Active 2d19h

# 6.将所有节点的/etc/cni/net.d/目录清空,然后重启所有节点的kubelet
rm -rf /etc/cni/net.d/*
systemctl restart kubelet

# 7.coredns的pod将会重启变成pending状态,calico删除完成!
  • 2.重建calico组件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 2.1重建之前检查各个节点的时间同步情况,没有同步的一定要先同步
ntpdate ntp.aliyun.com

# 2.2重建calico服务
# 下载
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/custom-resources.yaml

# 修改custom-resources.yaml文件中 CIDR,默认是 192.168.0.0/16,修改为创建集群时的IP段,
# 我这里创建集群时用的 10.244.0.0/16,若与集群IP段与官网配置文件一直,则无需修改。
....
calicoNetwork:
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16 # 修改此处
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
....

# 执行calico部署文件
kubectl create -f tigera-operator.yaml
kubectl create -f custom-resources.yaml

# 等待pod启动,如果之前镜像没有删除的话,重建会比较快的,否则会重新拉取镜像,比较耗时。
# 重建完成
[root@master calico-operator]# kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-58d9bdcc64-vzm9r 1/1 Running 0 5m15s
calico-node-5p7qf 1/1 Running 0 5m16s
calico-node-9lnmn 1/1 Running 0 5m16s
calico-node-hpxdr 1/1 Running 0 5m16s
calico-typha-65b4547c94-46fll 1/1 Running 0 5m8s
calico-typha-65b4547c94-qb2tx 1/1 Running 0 5m16s
csi-node-driver-jrx88 2/2 Running 0 5m16s
csi-node-driver-kw6d6 2/2 Running 0 5m16s
csi-node-driver-wdhk7 2/2 Running 0 5m16s