[深度]Kubernetes生产排障手册:OOM、调度失败与网络故障全解析

阿里云推广

Kubernetes生产环境故障排查手册

K8s故障排查是每个运维工程师的必修课。本文整理生产环境最常见的几类故障:OOMKilled、Pod调度失败、服务网络不通,给出系统化的排查思路和修复方法。

一、故障排查基本方法论

# 排障四步法
# 1. 看事件(Events): kubectl describe
# 2. 看日志(Logs): kubectl logs
# 3. 进容器(Exec): kubectl exec
# 4. 看指标(Metrics): kubectl top

# 快速全局概览命令
kubectl get pods -A --field-selector=status.phase!=Running
# 列出所有不正常的Pod

kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# 按时间排序查看最新事件,快速发现异常

二、OOMKilled:内存溢出排查与预防

# 现象: Pod被OOMKilled,不断重启
kubectl describe pod myapp-xxx | grep -A5 'Last State'
# Last State: Terminated
#   Reason: OOMKilled   <- 确认是内存溢出
#   Exit Code: 137

# 查看内存实际使用
kubectl top pod myapp-xxx
# NAME        CPU(cores)   MEMORY(bytes)
# myapp-xxx   200m         450Mi

# 如果接近limits值,说明限制太小
kubectl get pod myapp-xxx -o jsonpath='{.spec.containers[0].resources}'
# {"limits":{"memory":"512Mi"},"requests":{"memory":"256Mi"}}

# 修复: 调整资源限制
# deployment.yaml
resources:
  requests:
    memory: '512Mi'   # 提高到实际使用量
    cpu: '200m'
  limits:
    memory: '1Gi'    # 给足余量
    cpu: '500m'

# 应用修改
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp

三、Pod调度失败:Insufficient资源

# 现象: Pod一直Pending
kubectl describe pod myapp-xxx | grep -A10 Events
# Warning  FailedScheduling  0/3 nodes are available:
#   1 Insufficient cpu, 2 Insufficient memory

# 查看所有节点资源情况
kubectl describe nodes | grep -A5 'Allocated resources'

# 查看节点可分配资源
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU-ALLOC:.status.allocatable.cpu,\
MEM-ALLOC:.status.allocatable.memory

# 如果节点资源确实不足,有几个解决方向:
# 1. 扩容节点(云厂商弹性伸缩)
# 2. 降低Pod的resource requests(但不降limits)
# 3. 清理闲置的Pod

# 查找requests设置不合理的Deployment
kubectl get pods -A -o json | \
  python3 -c 'import json,sys; pods=json.load(sys.stdin)["items"]; \
  [print(p["metadata"]["name"], p["spec"]["containers"][0].get("resources",{})) for p in pods]'

四、服务网络不通:Service和DNS排查

# 启动调试容器
kubectl run debug --image=nicolaka/netshoot -it --rm -- bash

# 在调试容器内:
# 1. 测试DNS解析
nslookup myservice.mynamespace.svc.cluster.local
# 如果失败:检查CoreDNS是否正常
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. 测试Service连通性
curl -v http://myservice.mynamespace.svc.cluster.local:8080/health

# 3. 测试直连Pod(绕过Service)
POD_IP=$(kubectl get pod myapp-xxx -o jsonpath='{.status.podIP}')
curl -v http://$POD_IP:8080/health

# 如果直连Pod通,但Service不通
# 检查Service的selector是否匹配Pod的labels
kubectl describe service myservice | grep Selector
kubectl get pods --show-labels | grep myapp

五、实用诊断脚本

#!/bin/bash
# k8s-health-check.sh: K8s集群健康检查

echo '=== 节点状态 ==='
kubectl get nodes

echo '=== 异常Pod ==='  
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

echo '=== 最近事件(Warning) ==='
kubectl get events -A --field-selector=type=Warning --sort-by='.lastTimestamp' | tail -20

echo '=== 资源使用 ==='
kubectl top nodes 2>/dev/null || echo 'metrics-server未安装'

echo '=== PVC状态 ==='
kubectl get pvc -A | grep -v Bound

总结:K8s排障的核心是分层定位——先看集群级别(节点状态),再看工作负载级别(Deployment/Pod),最后看网络级别(Service/DNS)。掌握 describe/logs/exec/top 这四个核心工具,能解决90%的生产问题。

发表评论