【案例】Kubernetes故障排查实录:Pod无法启动怎么办?


阿里云特惠 - 新用户专享

Kubernetes故障排查实录

Pod无法启动是K8s最常见的问题.本文记录一个真实案例的完整排查过程.

问题现象

$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
web-app-7d9f4b8c5-x2v9p  0/1    ImagePullBackOff   0          5m

# Pod一直无法启动,状态是ImagePullBackOff

排查过程

第1步:查看Pod详情

$ kubectl describe pod web-app-7d9f4b8c5-x2v9p

# 关键信息在Events部分:
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  2m                 default-scheduler  Successfully assigned default/web-app-7d9f4b8c5-x2v9p to node-1
  Normal   Pulling    2m                 kubelet            Pulling image "myregistry.com/web-app:v1.2.3"
  Warning  Failed     1m (x3 over 2m)    kubelet            Failed to pull image "myregistry.com/web-app:v1.2.3": rpc error: code = Unknown desc = Error response from daemon: unauthorized: authentication required
  Warning  Failed     1m (x3 over 2m)    kubelet            Error: ImagePullBackOff

发现问题:拉取镜像时认证失败.

第2步:检查镜像仓库密钥

# 查看是否有imagePullSecret
$ kubectl get pod web-app-7d9f4b8c5-x2v9p -o yaml | grep -A5 imagePullSecrets
# 输出为空!没有配置镜像拉取密钥

# 查看已有的secret
$ kubectl get secrets
NAME                  TYPE                                  DATA   AGE
default-token-xxx     kubernetes.io/service-account-token   3      30d

# 没有registry相关的secret

第3步:创建镜像仓库密钥

# 创建docker-registry类型的secret
$ kubectl create secret docker-registry regcred   --docker-server=myregistry.com   --docker-username=admin   --docker-password=your-password   --docker-email=admin@example.com

secret/regcred created

第4步:修改Deployment添加密钥引用

# 编辑deployment
$ kubectl edit deployment web-app

# 在spec.template.spec下添加:
spec:
  template:
    spec:
      imagePullSecrets:
      - name: regcred  # 引用刚才创建的secret
      containers:
      - name: web-app
        image: myregistry.com/web-app:v1.2.3

第5步:验证修复

# 查看Pod状态
$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
web-app-7d9f4b8c5-abc12  1/1    Running   0          30s

# 查看Events确认镜像拉取成功
$ kubectl describe pod web-app-7d9f4b8c5-abc12 | tail -10
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  1m    default-scheduler  Successfully assigned...
  Normal  Pulling    1m    kubelet            Pulling image "myregistry.com/web-app:v1.2.3"
  Normal  Pulled     30s   kubelet            Successfully pulled image
  Normal  Created    30s   kubelet            Created container web-app
  Normal  Started    30s   kubelet            Started container web-app

Pod状态速查表

状态 含义 排查方向
Pending 等待调度 资源不足/节点选择器/亲和性
ContainerCreating 创建容器中 镜像拉取/存储挂载/CNI
ImagePullBackOff 镜像拉取失败 镜像名/仓库认证/网络
CrashLoopBackOff 容器反复崩溃 应用错误/健康检查/资源限制
Error 启动错误 查看容器日志
Completed 正常结束 Job类型Pod正常状态

排查命令速查

# 查看Pod详情和事件
kubectl describe pod 

# 查看容器日志
kubectl logs 

# 查看之前容器的日志(崩溃后)
kubectl logs  --previous

# 进入容器调试
kubectl exec -it  -- /bin/sh

# 查看节点资源
kubectl top node

# 查看Pod资源使用
kubectl top pod

总结

K8s排查遵循”看状态→查事件→看日志→进容器“的流程.掌握describe和logs命令,能解决80%的Pod问题.

发表评论