Kubernetes 生产环境避坑指南：10 个真实故障案例与解决方案

K8s 集群上线后，故障不可避免。本文汇总了 10 个生产环境真实案例，涵盖 Pod 调度、网络、存储、RBAC 等高频问题，帮你少走 3 年弯路。

背景：为什么 K8s 故障这么多？

Kubernetes 的复杂性在于它涉及多个层次：

┌─────────────────────────────────────────┐
│              应用层                       │
│   Pod / Deployment / StatefulSet        │
├─────────────────────────────────────────┤
│              调度层                       │
│   Scheduler / Node / Taint/Toleration   │
├─────────────────────────────────────────┤
│              网络层                       │
│   CNI / Service / Ingress / DNS         │
├─────────────────────────────────────────┤
│              存储层                       │
│   PV / PVC / StorageClass               │
├─────────────────────────────────────────┤
│              控制平面                     │
│   API Server / etcd / Controller        │
└─────────────────────────────────────────┘

任何一层出问题，都会导致应用不可用。以下是真实踩过的坑。

案例 1：Pod 一直 Pending

问题现象

kubectl get pods -n production
# 输出：
NAME                     READY   STATUS    RESTARTS   AGE
web-app-7d9f8b6c5-x2n4p   0/1     Pending   0          5m

Pod 一直处于 Pending 状态，不被调度。

排查步骤

# 1. 查看 Pod 详情
kubectl describe pod web-app-7d9f8b6c5-x2n4p -n production

# 2. 常见原因：
# - 资源不足（CPU/内存不够）
# - 亲和性/反亲和性限制
# - 污点（Taint）限制
# - PVC 未绑定

# 3. 检查节点资源
kubectl describe nodes | grep -A 5 "Allocated resources"

# 4. 检查污点
kubectl get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints

解决方案

场景 A：资源不足

# 增加资源配额
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "4"
      memory: 8Gi
    default:
      cpu: 500m
      memory: 1Gi
    defaultRequest:
      cpu: 200m
      memory: 512Mi
    type: Container

场景 B：污点限制

# 查看污点
kubectl describe node node-1 | grep Taints

# 临时移除污点（测试用）
kubectl taint node node-1 dedicated-

# 或添加容忍
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Exists"
        effect: "NoSchedule"

案例 2：Pod CrashLoopBackOff

问题现象

kubectl get pods -n production
# 输出：
NAME                     READY   STATUS              RESTARTS   AGE
api-server-5f4d7c8b9-abc   0/1     CrashLoopBackOff    3          2m

Pod 不断重启。

排查

# 查看日志
kubectl logs api-server-5f4d7c8b9-abc -n production --previous

# 常见原因：
# - 应用启动失败（配置错误）
# - 健康检查失败
# - OOMKilled（内存超限）
# - 依赖服务不可达

真实案例：健康检查配置错误

# 错误的配置（NodePort 在 Init 容器里不存在）
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: my-api:v1
    ports:
    - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 0  # 太短！
      periodSeconds: 5

# 正确的配置
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: my-api:v1
    ports:
    - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30  # 给够启动时间
      periodSeconds: 10
      failureThreshold: 3
    resources:
      limits:
        memory: "512Mi"
      requests:
        memory: "256Mi"

案例 3：Service 无法访问

问题现象

# 在 Pod 内测试
kubectl exec -it nginx-pod -- sh
/ # curl http://api-service:8080/health
# curl: couldn't connect to host

排查流程

# 1. 检查 Service 是否存在
kubectl get svc -n production | grep api

# 2. 检查 Endpoint
kubectl get endpoints api-service -n production
# 如果为空，说明 Selector 没有匹配到 Pod

# 3. 检查 Pod 标签
kubectl get pods -n production --show-labels | grep app
kubectl get svc api-service -n production -o yaml | grep -A 5 selector

真实案例：标签不匹配

# Deployment 标签
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  selector:
    matchLabels:
      app: api-server
      version: v2
  template:
    metadata:
      labels:
        app: api-server
        version: v2  # 新版本用 v2

# Service Selector 还是 v1
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api-server
    version: v1  # ❌ 这里！
  ports:
  - port: 8080
    targetPort: 8080

修复：更新 Service Selector 或 Deployment 标签。

案例 4：DNS 解析失败

问题现象

kubectl exec -it test-pod -- nslookup kubernetes
# Server:    10.96.0.10
# ** server can't find kubernetes.default: NXDOMAIN

排查

# 1. 检查 CoreDNS 是否运行
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. 查看 CoreDNS 日志
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# 3. 测试其他 DNS
kubectl exec -it test-pod -- nslookup www.baidu.com

解决方案

# 方案 1：增加 CoreDNS 副本
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
spec:
  replicas: 3  # 生产至少 2 个副本

# 方案 2：配置 Pod 的 DNS 策略
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  dnsPolicy: ClusterFirst  # 默认
  # 或自定义 DNS
  dnsConfig:
    nameservers:
      - 8.8.8.8
    searches:
      - default.svc.cluster.local
    options:
      - name: ndots
        value: "2"

案例 5：PVC 无法绑定

问题现象

kubectl get pvc -n production
# NAME        STATUS    VOLUME                                     CAPACITY
# data-pvc    Pending   pvc-8f7a3c2b-xxxx                          10Gi
# 一直 Pending！

排查

# 1. 查看 PVC 详情
kubectl describe pvc data-pvc -n production

# 常见原因：
# - StorageClass 不存在
# - 存储配额超限
# - 云厂商卷类型不支持

# 2. 检查 StorageClass
kubectl get storageclass

# 3. 检查云厂商存储限制
# 腾讯云 CBS：单节点最多 20 个云盘

解决方案

# 方案 1：使用正确的 StorageClass
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "cbs-balanced"  # 腾讯云 CBS
  resources:
    requests:
      storage: 50Gi

# 方案 2：清理无用 PVC
kubectl get pvc --all-namespaces | grep Released
kubectl delete pvc <pvc-name> -n <namespace>

案例 6：Ingress 502 Bad Gateway

问题现象

浏览器访问返回 502，但 Pod 本身正常运行。

排查

# 1. 检查 Ingress Controller
kubectl get pods -n ingress-nginx

# 2. 检查 Backend
kubectl describe ingress my-ingress -n production

# 3. 测试 Pod 访问
kubectl exec -it nginx-ingress-xxx -n ingress-nginx -- curl -v http://<pod-ip>:8080/health

真实案例：健康检查路径错误

# 错误的 Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

# 加上健康检查注解
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/server-snippet: |
      location /health {
        return 200 'OK';
      }
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

案例 7：RBAC 权限不足

问题现象

# 应用报错
Forbidden: User "system:serviceaccount:default:my-app" 
cannot list pods in namespace "production"

排查

# 1. 查看 ServiceAccount
kubectl get sa my-app -n default

# 2. 查看 Role
kubectl get role -n production

# 3. 查看 RoleBinding
kubectl get rolebinding -n production

解决方案

# 创建 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list"]

# 绑定到 ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reader-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: default
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

案例 8：OOMKilled 内存超限

问题现象

kubectl get pods -n production
# NAME        READY   STATUS      RESTARTS   AGE
# api-xxx     0/1     OOMKilled   2          10m

排查与解决

# 查看资源限制
kubectl describe pod api-xxx -n production | grep -A 5 "Limits"

# 解决方案：调高内存限制
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          limits:
            memory: "2Gi"  # 从 512Mi 增加到 2Gi
            cpu: "1"
          requests:
            memory: "1Gi"
            cpu: "500m"

案例 9：集群节点 NotReady

问题现象

kubectl get nodes
# NAME        STATUS     ROLES    AGE   VERSION
# node-1      NotReady   worker   30d   v1.28.0

排查

# 1. SSH 到问题节点
ssh root@node-1

# 2. 检查 kubelet 状态
systemctl status kubelet

# 3. 查看日志
journalctl -u kubelet -n 100 --no-pager

# 常见原因：
# - 磁盘空间不足
# - 内存不足
# - kubelet 证书过期

解决方案

# 清理磁盘
docker system prune -a --volumes
rm -rf /var/lib/docker/*

# 重启 kubelet
systemctl restart kubelet

# 如果是证书问题
kubeadm kubeconfig user --org myorg --cluster mycluster

案例 10：HPA 无法扩容

问题现象

kubectl get hpa -n production
# NAME        REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS
# api-hpa     Deployment/api     85%/80%   2         10        2
# CPU 使用率 85% > 80%，但副本数还是 2！

排查

# 1. 检查 HPA 详情
kubectl describe hpa api-hpa -n production

# 2. 常见原因：
# - Pod 没有设置资源请求（CPU/内存）
# - Metrics Server 未运行
# - 副本数达到上限

# 3. 检查 Metrics Server
kubectl get pods -n kube-system | grep metrics

# 4. 测试指标采集
kubectl top pods -n production

解决方案

# Pod 必须设置资源请求
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          requests:
            cpu: "500m"  # 必须设置！HPA 依赖此指标
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

生产环境 Checklist

部署前必查：

# 1. 资源限制
kubectl check-resource-limits.sh

# 2. 健康检查
kubectl test-health-checks.sh

# 3. 网络连通性
kubectl test-network.sh

# 4. 存储可用性
kubectl test-storage.sh

# 5. RBAC 权限
kubectl test-rbac.sh

总结：避坑要点

类别	常见问题	预防措施
——	———	———-
调度	Pod Pending	提前规划资源，预留 buffer
生命周期	CrashLoop	合理健康检查 + 资源限制
网络	Service 不通	验证标签匹配，测试 DNS
存储	PVC Pending	确认 StorageClass 存在
权限	RBAC 报错	最小权限原则，测试验证
弹性	HPA 不工作	必须设置资源请求

黄金法则：先在测试环境充分验证，再上生产！

👤 作者简介

一枚在大中原腹地（河南）卖公有云的女/男士，主营腾讯云/阿里云/华为云，曾踩坑无数，现专注AI大模型应用落地。关注公众号「AI热点日报」，围观AI前沿动态~

博客：yunduancloud.icu

Kubernetes 生产环境避坑指南：10 个真实故障案例与解决方案

背景：为什么 K8s 故障这么多？

案例 1：Pod 一直 Pending

问题现象

排查步骤

解决方案

案例 2：Pod CrashLoopBackOff

问题现象

排查

真实案例：健康检查配置错误

案例 3：Service 无法访问

问题现象

排查流程

真实案例：标签不匹配

案例 4：DNS 解析失败

问题现象

排查

解决方案

案例 5：PVC 无法绑定

问题现象

排查

解决方案

案例 6：Ingress 502 Bad Gateway

问题现象

排查

真实案例：健康检查路径错误

案例 7：RBAC 权限不足

问题现象

排查

解决方案

案例 8：OOMKilled 内存超限

问题现象

排查与解决

案例 9：集群节点 NotReady

问题现象

排查

解决方案

案例 10：HPA 无法扩容

问题现象

排查

解决方案

生产环境 Checklist

总结：避坑要点

发表评论 取消回复

发表评论取消回复