自学内容网 自学内容网

Kubernetes集群启动失败问题解决

因前段时间HVV以及中间国庆节,服务器关机了。开机之后就出现了Master节点Containerd无法启动的问题。下面是对问题定位以及解决过程的总结(虽然是测试环境,也希望能对遇到相似问题的生产环境有一些启发)。

Containerd服务启动失败问题现象及定位

重新启动Kubernetes集群master节点的containerd服务一直无法成功启动,查看服务状态:

```
~]# systemctl status containerd
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2024-10-09 15:46:36 CST; 4s ago
     Docs: https://containerd.io
  Process: 1641 ExecStart=/usr/bin/containerd (code=exited, status=2)
  Process: 1639 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 1641 (code=exited, status=2)
```

并没有有效的报错信息,只有一个(code=exited, status=2);执行journalctl -f -u containerd查看日志信息如下:

```
~]# journalctl -f -u containerd
-- Logs begin at Wed 2024-10-09 15:46:11 CST. --
……
10月 09 15:55:02 k8s130-node190 systemd[1]: containerd.service: Service RestartSec=5s expired, scheduling restart.
10月 09 15:55:02 k8s130-node190 systemd[1]: containerd.service: Scheduled restart job, restart counter is at 97.
10月 09 15:55:02 k8s130-node190 systemd[1]: Stopped containerd container runtime.
10月 09 15:55:02 k8s130-node190 systemd[1]: Starting containerd container runtime...
10月 09 15:55:02 k8s130-node190 containerd[3456]: time="2024-10-09T15:55:02.862285381+08:00" level=info msg="starting containerd" revision=8b3b7ca2e5ce38e8f31a34f35b2b68ceb8470d89 version=1.6.32
10月 09 15:55:02 k8s130-node190 containerd[3456]: time="2024-10-09T15:55:02.906085764+08:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
……
10月 09 15:55:02 k8s130-node190 containerd[3456]: time="2024-10-09T15:55:02.912801252+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.cri\"..." type=io.containerd.grpc.v1
10月 09 15:55:02 k8s130-node190 containerd[3456]: panic: invalid page type: 345: 10
10月 09 15:55:02 k8s130-node190 containerd[3456]: goroutine 66 [running]:
10月 09 15:55:02 k8s130-node190 containerd[3456]: go.etcd.io/bbolt.(*Cursor).search(0xc0003f3b10, {0x55e669fae988, 0x6, 0x6}, 0x3fe0000000000000?)
……
10月 09 15:55:02 k8s130-node190 containerd[3456]: created by github.com/containerd/containerd/runtime/restart/monitor.init.0.func1 in goroutine 14
10月 09 15:55:02 k8s130-node190 containerd[3456]:         /root/rpmbuild/BUILD/runtime/restart/monitor/monitor.go:96 +0x1a5
10月 09 15:55:02 k8s130-node190 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
10月 09 15:55:02 k8s130-node190 systemd[1]: containerd.service: Failed with result 'exit-code'.
10月 09 15:55:02 k8s130-node190 systemd[1]: Failed to start containerd container runtime.
```

也没啥有效的错误信息,只是对status=2一进步解释:INVALIDARGUMENT。难道是containerd的配置文件有问题?查看/etc/containerd/config.toml,并没有发现问题。为了避免忽略了配置文件的一些细节,备份配置并重装了containerd:

  1. mv /etc/containerd/config.toml{,_bak}
  2. yum -y erase containerd.io
  3. yum -y install containerd.io --disableexcludes=docker-ce-stable

但是,仍然无法正常启动containerd服务,报错信息仍然如上。看来确实不是配置文件的问题,也不是参数的问题。

Containerd服务启动失败问题解决

网上查找是否有相同问题的解决方案,确实查到了一个:

  1. yum -y erase containerd.io
  2. rm -rf /var/lib/containerd
  3. 重启系统
  4. yum -y install containerd.io --disableexcludes=docker-ce-stable
  5. mv /etc/containerd/config.toml{,_docker}
  6. mv /etc/containerd/config.toml{_bak,}
  7. systemctl restart containerd

但是,如上操作有一个后遗症,删除/var/lib/containerd,会将containerd的所有数据均给删除了,包括:容器、镜像等。实属无奈之举。

 Kubernetes APIServer访问失败问题现象及解决

虽然,containerd服务启动成功了,也将丢失的Image找了回来:

1、如果可以连接互联网,或者配置的本地私有镜像仓库,可以重新拉取

kubeadm config images pull --config=kubeadm-init.default.yaml 

2、如果无法连接互联网,并且私有镜像仓库中不存在这些镜像,其地方将Image pull后打包成tar,然后load到本机:

docker save registry.aliyuncs.com/google_containers/kube-apiserver:v1.30.0 registry.aliyuncs.com/google_containers/kube-controller-manager:v1.30.0 registry.aliyuncs.com/google_containers/kube-scheduler:v1.30.0 registry.aliyuncs.com/google_containers/kube-proxy:v1.30.0 registry.aliyuncs.com/google_containers/etcd:3.5.12-0 registry.aliyuncs.com/google_containers/coredns:v1.11.1 -o kubernetes.tar
ctr -n k8s.io images import kubernetes.tar  --platform linux/amd64

但是,apiserver等启动后访问失败:

```
 kubectl get cs
E1011 15:23:37.066185    2683 memcache.go:265] couldn't get current server API group list: Get "https://192.168.11.190:6443/api?timeout=32s": Forbidden
……
Unable to connect to the server: Forbidden
```

遇到上面的问题,不太想仔细分析原因了,毕竟也没有什么需要保留的数据,所以就将Kubernetes reset了:

kubeadm reset && rm -rf /etc/cni/net.d && ipvsadm --clear && rm -rf $HOME/.kube && rm -rf /etc/kubernetes/* &&  -rf /var/lib/etcd

现在回想起来,极有可能是因为我配置了web代理,因为在我reset完重新init的时候有告警信息:

```
kubeadm init --config=kubeadm-init.default.yaml |tee kubeadm-init.log
[init] Using Kubernetes version: v1.30.0
[preflight] Running pre-flight checks
        [WARNING HTTPProxy]: Connection to "https://192.168.11.190" uses proxy "http://用户名:密码@192.168.XX.229:3128". If that is not intended, adjust your proxy settings
        [WARNING HTTPProxyCIDR]: connection to "10.254.0.0/16" uses proxy "http://用户名:密码@192.168.XX.229:3128". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
        [WARNING HTTPProxyCIDR]: connection to "2408:822a:730:af01::/112" uses proxy "http://用户名:密码@192.168.XX.229:3128". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
        [WARNING HTTPProxyCIDR]: connection to "172.254.0.0/16" uses proxy "http://用户名:密码@192.168.XX.229:3128". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
        [WARNING HTTPProxyCIDR]: connection to "fa00:cafe:42::/56" uses proxy "http://用户名:密码@192.168.XX.229:3128". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
```

将master reset完之后,其他node也要reset,然后重新加入到集群中;详细过程就不在这里赘述了,可以去看《kubernetes集群部署:环境准备及master节点部署(二)》系列。但是,此过程遇到了2个问题,需要记录一下。

1、k8s+docker+cri-docker的一个node,执行reset报错:

```
~]# kubeadm reset -f kubeadm-join.default.yaml && rm -rf /etc/cni/net.d && ipvsadm --clear && rm -rf $HOME/.kube && rm -rf /etc/kubernetes/* && rm -rf /var/lib/etcd
Found multiple CRI endpoints on the host. Please define which one do you wish to use by setting the 'criSocket' field in the kubeadm configuration file: unix:///var/run/containerd/containerd.sock, unix:///var/run/cri-dockerd.sock
To see the stack trace of this error execute with --v=5 or higher
```

原因:在node上安装了cri-dockerd,导致除了原生的unix:///var/run/containerd/containerd.sock外,多了一个unix:///var/run/cri-dockerd.sock,所以需要明确指定:

```
kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock && rm -rf /etc/cni/net.d && ipvsadm --clear && rm -rf $HOME/.kube && rm -rf /etc/kubernetes/*


kubeadm config print reset-defaults >kubeadm-reset.yaml 
修改criSocket: unix:///run/cri-dockerd.sock
kubeadm reset --config kubeadm-reset.yaml 
```

2、启动calico组件后,master一直NotReady。

可能kubeadm reset后,有一些历史数据未清理。执行kubectl delete calico,然后重启服务器,重新执行kubectl apply calico就OK了。


原文地址:https://blog.csdn.net/avatar_2009/article/details/142863118

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!