Kubernetes 生产环境避坑指南:10 个真实故障案例与解决方案
K8s 集群上线后,故障不可避免。本文汇总了 10 个生产环境真实案例,涵盖 Pod 调度、网络、存储、RBAC 等高频问题,帮你少走 3 年弯路。
背景:为什么 K8s 故障这么多?
Kubernetes 的复杂性在于它涉及多个层次:
┌─────────────────────────────────────────┐ │ 应用层 │ │ Pod / Deployment / StatefulSet │ ├─────────────────────────────────────────┤ │ 调度层 │ │ Scheduler / Node / Taint/Toleration │ ├─────────────────────────────────────────┤ │ 网络层 │ │ CNI / Service / Ingress / DNS │ ├─────────────────────────────────────────┤ │ 存储层 │ │ PV / PVC / StorageClass │ ├─────────────────────────────────────────┤ │ 控制平面 │ │ API Server / etcd / Controller │ └─────────────────────────────────────────┘
任何一层出问题,都会导致应用不可用。以下是真实踩过的坑。
案例 1:Pod 一直 Pending
问题现象
kubectl get pods -n production # 输出: NAME READY STATUS RESTARTS AGE web-app-7d9f8b6c5-x2n4p 0/1 Pending 0 5m
Pod 一直处于 Pending 状态,不被调度。
排查步骤
# 1. 查看 Pod 详情 kubectl describe pod web-app-7d9f8b6c5-x2n4p -n production # 2. 常见原因: # - 资源不足(CPU/内存不够) # - 亲和性/反亲和性限制 # - 污点(Taint)限制 # - PVC 未绑定 # 3. 检查节点资源 kubectl describe nodes | grep -A 5 "Allocated resources" # 4. 检查污点 kubectl get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints
解决方案
场景 A:资源不足
# 增加资源配额
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- max:
cpu: "4"
memory: 8Gi
default:
cpu: 500m
memory: 1Gi
defaultRequest:
cpu: 200m
memory: 512Mi
type: Container
场景 B:污点限制
# 查看污点
kubectl describe node node-1 | grep Taints
# 临时移除污点(测试用)
kubectl taint node node-1 dedicated-
# 或添加容忍
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
案例 2:Pod CrashLoopBackOff
问题现象
kubectl get pods -n production # 输出: NAME READY STATUS RESTARTS AGE api-server-5f4d7c8b9-abc 0/1 CrashLoopBackOff 3 2m
Pod 不断重启。
排查
# 查看日志 kubectl logs api-server-5f4d7c8b9-abc -n production --previous # 常见原因: # - 应用启动失败(配置错误) # - 健康检查失败 # - OOMKilled(内存超限) # - 依赖服务不可达
真实案例:健康检查配置错误
# 错误的配置(NodePort 在 Init 容器里不存在)
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: my-api:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0 # 太短!
periodSeconds: 5
# 正确的配置
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: my-api:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # 给够启动时间
periodSeconds: 10
failureThreshold: 3
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"
案例 3:Service 无法访问
问题现象
# 在 Pod 内测试 kubectl exec -it nginx-pod -- sh / # curl http://api-service:8080/health # curl: couldn't connect to host
排查流程
# 1. 检查 Service 是否存在 kubectl get svc -n production | grep api # 2. 检查 Endpoint kubectl get endpoints api-service -n production # 如果为空,说明 Selector 没有匹配到 Pod # 3. 检查 Pod 标签 kubectl get pods -n production --show-labels | grep app kubectl get svc api-service -n production -o yaml | grep -A 5 selector
真实案例:标签不匹配
# Deployment 标签
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
selector:
matchLabels:
app: api-server
version: v2
template:
metadata:
labels:
app: api-server
version: v2 # 新版本用 v2
# Service Selector 还是 v1
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api-server
version: v1 # ❌ 这里!
ports:
- port: 8080
targetPort: 8080
修复:更新 Service Selector 或 Deployment 标签。
案例 4:DNS 解析失败
问题现象
kubectl exec -it test-pod -- nslookup kubernetes # Server: 10.96.0.10 # ** server can't find kubernetes.default: NXDOMAIN
排查
# 1. 检查 CoreDNS 是否运行 kubectl get pods -n kube-system -l k8s-app=kube-dns # 2. 查看 CoreDNS 日志 kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 # 3. 测试其他 DNS kubectl exec -it test-pod -- nslookup www.baidu.com
解决方案
# 方案 1:增加 CoreDNS 副本 apiVersion: apps/v1 kind: Deployment metadata: name: coredns namespace: kube-system spec: replicas: 3 # 生产至少 2 个副本
# 方案 2:配置 Pod 的 DNS 策略
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
dnsPolicy: ClusterFirst # 默认
# 或自定义 DNS
dnsConfig:
nameservers:
- 8.8.8.8
searches:
- default.svc.cluster.local
options:
- name: ndots
value: "2"
案例 5:PVC 无法绑定
问题现象
kubectl get pvc -n production # NAME STATUS VOLUME CAPACITY # data-pvc Pending pvc-8f7a3c2b-xxxx 10Gi # 一直 Pending!
排查
# 1. 查看 PVC 详情 kubectl describe pvc data-pvc -n production # 常见原因: # - StorageClass 不存在 # - 存储配额超限 # - 云厂商卷类型不支持 # 2. 检查 StorageClass kubectl get storageclass # 3. 检查云厂商存储限制 # 腾讯云 CBS:单节点最多 20 个云盘
解决方案
# 方案 1:使用正确的 StorageClass
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pvc
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: "cbs-balanced" # 腾讯云 CBS
resources:
requests:
storage: 50Gi
# 方案 2:清理无用 PVC kubectl get pvc --all-namespaces | grep Released kubectl delete pvc <pvc-name> -n <namespace>
案例 6:Ingress 502 Bad Gateway
问题现象
浏览器访问返回 502,但 Pod 本身正常运行。
排查
# 1. 检查 Ingress Controller kubectl get pods -n ingress-nginx # 2. 检查 Backend kubectl describe ingress my-ingress -n production # 3. 测试 Pod 访问 kubectl exec -it nginx-ingress-xxx -n ingress-nginx -- curl -v http://<pod-ip>:8080/health
真实案例:健康检查路径错误
# 错误的 Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
# 加上健康检查注解
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/server-snippet: |
location /health {
return 200 'OK';
}
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
案例 7:RBAC 权限不足
问题现象
# 应用报错 Forbidden: User "system:serviceaccount:default:my-app" cannot list pods in namespace "production"
排查
# 1. 查看 ServiceAccount kubectl get sa my-app -n default # 2. 查看 Role kubectl get role -n production # 3. 查看 RoleBinding kubectl get rolebinding -n production
解决方案
# 创建 Role apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader namespace: production rules: - apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list"]
# 绑定到 ServiceAccount apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: pod-reader-binding namespace: production subjects: - kind: ServiceAccount name: my-app namespace: default roleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io
案例 8:OOMKilled 内存超限
问题现象
kubectl get pods -n production # NAME READY STATUS RESTARTS AGE # api-xxx 0/1 OOMKilled 2 10m
排查与解决
# 查看资源限制
kubectl describe pod api-xxx -n production | grep -A 5 "Limits"
# 解决方案:调高内存限制
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
resources:
limits:
memory: "2Gi" # 从 512Mi 增加到 2Gi
cpu: "1"
requests:
memory: "1Gi"
cpu: "500m"
案例 9:集群节点 NotReady
问题现象
kubectl get nodes # NAME STATUS ROLES AGE VERSION # node-1 NotReady worker 30d v1.28.0
排查
# 1. SSH 到问题节点 ssh root@node-1 # 2. 检查 kubelet 状态 systemctl status kubelet # 3. 查看日志 journalctl -u kubelet -n 100 --no-pager # 常见原因: # - 磁盘空间不足 # - 内存不足 # - kubelet 证书过期
解决方案
# 清理磁盘 docker system prune -a --volumes rm -rf /var/lib/docker/* # 重启 kubelet systemctl restart kubelet # 如果是证书问题 kubeadm kubeconfig user --org myorg --cluster mycluster
案例 10:HPA 无法扩容
问题现象
kubectl get hpa -n production # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS # api-hpa Deployment/api 85%/80% 2 10 2 # CPU 使用率 85% > 80%,但副本数还是 2!
排查
# 1. 检查 HPA 详情 kubectl describe hpa api-hpa -n production # 2. 常见原因: # - Pod 没有设置资源请求(CPU/内存) # - Metrics Server 未运行 # - 副本数达到上限 # 3. 检查 Metrics Server kubectl get pods -n kube-system | grep metrics # 4. 测试指标采集 kubectl top pods -n production
解决方案
# Pod 必须设置资源请求
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
resources:
requests:
cpu: "500m" # 必须设置!HPA 依赖此指标
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
生产环境 Checklist
部署前必查:
# 1. 资源限制 kubectl check-resource-limits.sh # 2. 健康检查 kubectl test-health-checks.sh # 3. 网络连通性 kubectl test-network.sh # 4. 存储可用性 kubectl test-storage.sh # 5. RBAC 权限 kubectl test-rbac.sh
总结:避坑要点
| 类别 | 常见问题 | 预防措施 |
|---|---|---|
| —— | ——— | ———- |
| **调度** | Pod Pending | 提前规划资源,预留 buffer |
| **生命周期** | CrashLoop | 合理健康检查 + 资源限制 |
| **网络** | Service 不通 | 验证标签匹配,测试 DNS |
| **存储** | PVC Pending | 确认 StorageClass 存在 |
| **权限** | RBAC 报错 | 最小权限原则,测试验证 |
| **弹性** | HPA 不工作 | 必须设置资源请求 |
黄金法则:先在测试环境充分验证,再上生产!
👤 作者简介
一枚在大中原腹地(河南)卖公有云的女/男士,主营腾讯云/阿里云/华为云,曾踩坑无数,现专注AI大模型应用落地。关注公众号「AI热点日报」,围观AI前沿动态~
博客:yunduancloud.icu
