Kubernetes Pod 调度策略

Kubernetes Pod 调度策略详解

Kubernetes Pod调度是K8s集群管理中的核心功能,决定了Pod被调度到哪个Node上运行。合理配置调度策略,可以实现资源的高效利用、业务的高可用部署和硬件资源的充分利用。

一、默认调度器工作原理

Kubernetes默认调度器(kube-scheduler)通过两个阶段选择目标Node:

  1. 过滤(Filtering):排除不满足Pod要求的Node(资源不足、污点不匹配、端口冲突等)
  2. 打分(Scoring):对剩余Node评分,综合考虑资源均衡、数据局部性、Pod亲和性等因素,选择得分最高的Node

二、资源请求与限制

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:         # 调度时保证的资源量
        cpu: "500m"     # 0.5核
        memory: "256Mi"
      limits:           # 运行时最大使用量
        cpu: "1000m"    # 1核(超过会被CPU限速,但不Kill)
        memory: "512Mi" # 超过OOM Kill

建议:requests设置为实际平均使用量,limits设置为峰值的2倍。requests过小会导致Node负载过高,过大会浪费资源。

三、节点选择(NodeSelector & NodeAffinity)

# NodeSelector:简单标签选择
spec:
  nodeSelector:
    disk-type: ssd         # 只调度到有ssd标签的节点

# NodeAffinity:更灵活的节点亲和性
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # 硬要求
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/zone
            operator: In
            values: ["cn-shanghai-a", "cn-shanghai-b"]
      preferredDuringSchedulingIgnoredDuringExecution:  # 软偏好
      - weight: 100
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values: ["high-memory"]

四、Pod亲和性与反亲和性

# Pod反亲和性:同一应用的Pod分散到不同节点(高可用)
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: my-web
        topologyKey: "kubernetes.io/hostname"  # 按Node分散
        
# Pod亲和性:前端Pod尽量和后端Pod调度到同一节点(降低延迟)
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: my-backend
          topologyKey: "kubernetes.io/hostname"

五、污点与容忍(Taints & Tolerations)

# 给Node打污点(防止普通Pod调度到特殊节点)
kubectl taint nodes gpu-node01 dedicated=gpu:NoSchedule

# Pod添加容忍才能调度到污点节点
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

常见用途:将GPU节点专用于AI训练任务;将高配节点专用于数据库;将某Node标记为维护模式(NoExecute驱逐已运行Pod)。

六、拓扑分散约束(TopologySpreadConstraints)

# 确保Pod均匀分布在多个可用区(K8s 1.19+)
spec:
  topologySpreadConstraints:
  - maxSkew: 1            # 各区Pod数量差不超过1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

七、优先级与抢占

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority  # 可抢占低优先级Pod

# 在Pod中使用
spec:
  priorityClassName: high-priority

八、总结

Kubernetes调度策略是实现高可用和资源高效利用的关键。对于生产环境,建议至少配置:资源requests和limits(避免资源争抢);Pod反亲和性(同一服务的多副本分散部署);关键业务Pod的高优先级设置。随着业务复杂度增加,逐步引入节点亲和性和拓扑分散约束等高级特性。