DESCRIPTION: If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to a pod label update), the system may or may not try to eventually evict the pod from its node. When there are multiple elements, the lists of nodes corresponding to each podAffinityTerm are intersected, i.e. all terms must be satisfied.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key <topologyKey> matches that of any node on which a pod of the set of pods is running
FIELDS: labelSelector <Object> A label query over a set of resources, in this case pods.
namespaceSelector <Object> A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces.
namespaces <[]string> namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace".
topologyKey <string> -required- This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
$ kubectl explain pod.spec.affinity.podAffinity.preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm KIND: Pod VERSION: v1
RESOURCE: podAffinityTerm <Object>
DESCRIPTION: Required. A pod affinity term, associated with the corresponding weight.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key <topologyKey> matches that of any node on which a pod of the set of pods is running
FIELDS: labelSelector <Object> A label query over a set of resources, in this case pods.
namespaceSelector <Object> A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces.
namespaces <[]string> namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace".
topologyKey <string> -required- This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
需要透過 topologyKey 來指定如何分群節點
由於決策是基於 Pod 的 Label 來決定,而 Pod 本身實際上是有 namespace 的概念的,預設情況下只會比較相同 namespace 的 Pod,如果有特別需求的時候還要使用 namespaceSelector 或是 namespace 來選定目標 namespace,則這些 namespace 上的所有 Pod 都會被納入考量
對 Anti-Affinity 來說,若服務 A 不想要與服務 B 被調度一起,則隱含服務 B 也不想要跟服務A 一起,但是對 Affinity 來說則沒有這種對稱性,所以兩者部署的演算法有些許不同,以下節錄自官方設計文件
Anti-Affinity
1 2 3
if S1 has the aforementioned RequiredDuringScheduling anti-affinity rule if a node is empty, you can schedule S1 or S2 onto the node if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
這意味如果今天有服務 A 透過 Anti-Affinity 去限制與 B 的調度情況,則部署服務 A 或是服務 B 都會去檢查是否有違反規則,沒有的話則隨意部署。
Affintiy
1 2 3 4 5 6
if S1 has the aforementioned RequiredDuringScheduling affinity rule if a node is empty, you can schedule S2 onto the node if a node is empty, you cannot schedule S1 onto the node if a node is running S2, you can schedule S1 onto the node if a node is running S1+S2 and S1 terminates, S2 continues running if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
相同範例來看,服務 A 透過 Affinity 去要求與服務 B 一起的調度情況,根據第二條規則,若服務 B 不存在,則服務 A 會卡住不能調度,處於 Pending 狀況。
部署下去後可以觀察到服務 A 與 B 幾乎同時順利完成調度的決策,一起被分配到相同的 kind.zone 內
以過程中來說,最初的服務 A 因為找不到服務 B 可以匹配,所以全部卡 Pending 的狀況
而服務 B 本身沒有描述任何 Affinity 的規則,因此本身順利被調度
當服務 B 被調度到 kind.zone=zone1 後,所有卡住的服務 A 就有參照對象可以比較,所已全部 Pod 就直接部署上去了。
PodTopologySpread
前述所探討的 NodeAffinity 以及 Inter-Pod (Anti)Affinity 可以滿足許多人控制 Pod 調度的需求,然而實際使用上會遇到一些問題
透過 NodeAffinity 並沒有保證 Pod 可以均勻的分散到各節點上,有可能會遇到分佈不均勻的情況 (66981)
Inter-Pod Anti Affinity 碰到 Deployment rolling upgrade 時會出問題,新的 Pod 要先被創立但是因為 Anti-Affinity 的限制導致沒有節點可用,所以新版本的 Pod 就會處於 Pending 而整個 Deployment 更新就會卡死 40358
因為上述問題所以就有了 Pod Topology Spread 的發展,而整個 Pod Topology Spread 中最重要的一個因素就稱為 Skew,該數值是用來處理 Pod 分配不均勻的問題,其定義為。
skew = Pods number matched in current topology - min Pods matches in a topology
PodTopologySpread 每次分配 Pod 的時候都會針對每個節點計算當下的 Skew 數值並且以數值來影響調度的決策。
$ kubectl explain pod.spec.topologySpreadConstraints KIND: Pod VERSION: v1
RESOURCE: topologySpreadConstraints <[]Object>
DESCRIPTION: TopologySpreadConstraints describes how a group of pods ought to spread across topology domains. Scheduler will schedule pods in a way which abides by the constraints. All topologySpreadConstraints are ANDed.
TopologySpreadConstraint specifies how to spread matching pods among the given topology.
FIELDS: labelSelector <Object> LabelSelector is used to find matching pods. Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain.
maxSkew <integer> -required- MaxSkew describes the degree to which pods may be unevenly distributed. required field. Default value is 1 and 0 is not allowed.
topologyKey <string> -required- TopologyKey is the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We consider each <key, value> as a "bucket", and try to put balanced number of pods into each bucket.
whenUnsatisfiable <string> -required- WhenUnsatisfiable indicates how to deal with a pod if it doesn't satisfy Possible enum values: - `"DoNotSchedule"` instructs the scheduler not to schedule the pod when constraints are not satisfied. - `"ScheduleAnyway"` instructs the scheduler to schedule the pod even if constraints are not satisfied.
maxSkew 則是用來控制 skew 的上限值,若 Pod 部署到該節點後會使得 Skew 超過此限制,則該節點就會被跳過