Part 2
Topics:
POD
- Single Container pod
- MultiContainer pod
- Login to a pod
- Copy to a pod and from a pod(container)
- How to check logs from a container
- Environment Variables of pod
- initContainer
- Command and Argument of a pod
- pullSecret of a pod
- pod restart policy
- imagePull policy
- How to delete a pod
- Pod priority
- Pod Resources
- Pod Quality of Services(QOS)
NameSpace
- Create a namespace
- Switch from one ns to another
- Resource Quota
- Resource Limits
Advance Scheduling of pod
- Scheduler
- Nodename
- nodeSelector
- Node Affinity
- Taints and Tolerations
- Pod affinity and anti affinity
- Priority and PriorityClass
- Preemption
- Disruption Budget
- Topology and Constraints
- Descheduler
POD
- A pod is the smallest deployable unit in Kubernetes that represents a single instance of an application.
- For example, if you want to run the Nginx application, you run it in a pod.
- A container is a single unit. However, a pod can contain more than one container. You can think of pods as a box that can hold one or more containers together.
- the pod gets a single unique IP address and containers running inside the pod use localhost to connect to each other on different ports.
- Each pod gets a unique IP address.
- Pods communicate with each other using the IP address.
- Containers inside a pod connect using localhost on different ports.
- Containers running inside a pod should have different port numbers to avoid port clashes.
- You can set CPU and memory resources for each container running inside the pod.
- Containers inside a pod share the same volume mount.
- All the containers inside a pod are scheduled on the same node; It cannot span multiple nodes.
- If there is more than one container, during the pod startup all the main containers start in parallel. Whereas the init containers inside the pod run in sequence.
Internal Process of Pod Creation
Submit Pod Specification
-
User Action: You provide a Pod specification (YAML or JSON) using the kubectl apply -f command or through an API request.
-
Kubernetes API Server: The specification is sent to the Kubernetes API server, which is the central management component of Kubernetes. API Server Processing
-
Validation: The API server validates the Pod specification to ensure it conforms to the Kubernetes schema.
-
Persistence: If valid, the Pod specification is stored in etcd, which is the cluster’s key-value store. This acts as the source of truth for all cluster data. Scheduler
-
Pod Scheduling: The Kubernetes Scheduler periodically looks for newly created Pods that don’t have a Node assigned and needs scheduling.
-
Resource Evaluation: The Scheduler evaluates available Nodes based on resource requirements (CPU, memory), constraints, and other scheduling policies.
-
Binding: Once a suitable Node is found, the Scheduler binds the Pod to that Node by updating the Pod’s status in etcd.
Kubelet
-
Node-Level Management: Each Node runs a Kubelet, an agent responsible for managing Pods on that Node.
-
Pod Fetching: The Kubelet periodically polls the API server for updates and finds out that a Pod has been scheduled to its Node.
-
Container Runtime Interaction: The Kubelet interacts with the container runtime (like Docker or containerd) to pull container images and create containers based on the Pod specification. Container Creation
-
Image Pulling: The container runtime pulls the required container images from the specified container registry if they are not already cached on the Node.
-
Container Start: The container runtime creates and starts the containers according to the Pod’s configuration (e.g., environment variables, volume mounts).
Pod Initialization
Lifecycle Hooks: Any defined lifecycle hooks (like initContainers) are executed. Init containers run before the main containers and must complete successfully for the Pod to start.
-
Readiness and Liveness Probes: The Kubelet performs readiness and liveness checks as defined in the Pod specification to ensure containers are running properly and are ready to accept traffic. Pod Running
-
Status Update: Once the containers are running, the Kubelet updates the Pod’s status in the API server.
-
Communication: The Pod is now part of the cluster network, and other Pods or services can communicate with it based on defined network policies and service configurations.
Detailed Lifecycle of a Pod
- Pending: When the Pod is first created, it is in the Pending state until the Scheduler assigns it to a Node.
- Running: Once the containers are started, the Pod transitions to the Running state. This state means the Pod is actively running on the assigned Node.
- Succeeded: If the Pod’s containers complete their tasks and exit successfully (for example, a Job or a single-run task), the Pod moves to the Succeeded state.
- Failed: If the Pod’s containers exit with an error or fail to start, it transitions to the Failed state.
- Unknown: If the Kubernetes system cannot determine the Pod’s state (for example, due to communication issues with the Node), the Pod status may be marked as Unknown.
Summary
The creation and management of a Pod in Kubernetes involve several key components: the API server, the Scheduler, the Kubelet, and the container runtime. Each plays a role in ensuring that the Pod is properly scheduled, created, and maintained according to the specifications provided.
Purpose of the Pause Container in Kubernetes
Overview
In Kubernetes, a pause container is a special container used primarily for managing the network namespace of a Pod. It serves as a “parent” container, ensuring that the Pod’s network namespace remains active and stable.
Key Purposes
-
Network Namespace Management
- Network Namespace: Each Pod in Kubernetes is assigned a network namespace, which isolates its network resources.
- Pause Container Role: The pause container holds the network namespace open. Without it, the namespace could be terminated if the main container(s) in the Pod stop running. The pause container ensures that the network namespace persists as long as the Pod is alive.
-
Pod Lifecycle Stability
- Main Containers: Pods can have one or more main containers. When these containers complete their tasks or exit, the pause container ensures the Pod’s network namespace is not destroyed prematurely.
- Pod Deletion: The Pod itself is only deleted when the pause container is removed. This ensures that network cleanup and other associated resources are handled correctly.
-
Efficient Resource Management
- Minimal Resource Usage: The pause container does not perform any significant work. It typically runs an idle process that uses minimal resources, which helps in maintaining an efficient resource footprint.
- Pod Reuse: By keeping the network namespace alive, the pause container allows Kubernetes to manage and reassign the network resources efficiently when Pods are scaled or recreated.
Implementation
- Container Image: The pause container usually uses a lightweight image that does nothing but keep the namespace active. A common image used is
k8s.gcr.io/pause
. - Pod Specification: The pause container is automatically added by Kubernetes when a Pod is created. Users typically do not interact with or configure the pause container directly.
Summary
The pause container is a crucial component in Kubernetes that maintains the network namespace of a Pod, ensuring stability and efficient resource management. It supports the lifecycle of Pods by keeping network resources active and ready for use, even if the main containers in the Pod stop running.
- Create a pod
kubectl run pod --image=nginx
- Check newly created pod
kubectl get pods
- Check more info of a pod like where the pod is scheduled and ip address
kubectl get pods -o wide
- check pods details like events and resources
kubectl describe pod <podname>
- Check the name of all pods
kubectl get pods -o name
- Create a pod using yaml file
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: example-pod
labels:
app: example
spec:
containers:
- name: my-container
image: nginx:latest
EOF
- Check Pod Logs
kubectl logs <pod-name>
- Follow Pod Logs in Real-time
kubectl logs -f <pod-name>
- How to check logs for a container from multiContainer pod
kubectl logs -c <containername> <podname>
- Check the logs for all containers
kubectl logs <podname> --all-containers=true
- Execute Command in a Pod:
kubectl exec -it <pod-name> -- <command>
- Copy Files to/from Pod:
kubectl cp <local-path> <pod-name>:<pod-path>
kubectl cp <pod-name>:<pod-path> <local-path>
- Delete a Pod:
kubectl delete pod <pod-name>
- How to delete a pod forcefully
kubectl delete pod --force --grace-period=0
- Port Forwarding to Pod:
kubectl port-forward <pod-name> <local-port>:<pod-port>
- Port forwarding on ip address not on the localhost
kubectl port-forward --address 0.0.0.0 pod/mypod 8888:5000
- Get YAML Definition of a Running Pod:
kubectl get pod <pod-name> -o yaml
Some useful command in realtime(Production)
- Find out all the images of all the pods
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}'
- Get All containers name
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}: {.spec.containers[*].name}{"\n"}{end}'
Define Environment Variables for a Container
A Kubernetes environment variable is a dynamic value that configures some aspect of the environment in which a Kubernetes-based application runs.
env:
- name: SERVICE_PORT
value: "8080"
- name: SERVICE_IP
value: "192.168.100.1"
Problem Statement if the pod is failing becuase of environment variable
- Deploy one mysql pod and see if the pod is failing
kubectl run mysql --image=mysql:5.6
- Check the logs and fix the issue
Environment Variables
- Apply required variables using below yaml file
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-env
spec:
containers:
- name: my-container
image: mysql:5.6
env:
- name: MYSQL_ROOT_PASSWORD
value: "root"
EOF
Command and Arguments with Kubernetes Pod
Here’s a table summarizing the field names used by Docker and Kubernetes:
Description | Docker field name | Kubernetes field name |
---|---|---|
The command run by the container | Entrypoint | command |
The arguments passed to the command | Cmd | args |
When you override the default Entrypoint and Cmd, these rules apply:
- If you do not supply command or args for a Container, the defaults defined in the Docker image are used.
- If you supply a command but no args for a Container, only the supplied command is used. The default EntryPoint and the default Cmd defined in the Docker image are ignored.
- If you supply only args for a Container, the default Entrypoint defined in the Docker image is run with the args that you supplied.
- If you supply a command and args, the default Entrypoint and the default Cmd defined in the Docker image are ignored. Your command is run with your args.
Example 1: Command Override
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-command
spec:
containers:
- name: my-container
image: nginx:latest
command: ["echo"]
args: ["Hello, Kubernetes!"]
EOF
Example 2: Command and Arguments
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-command-args
spec:
containers:
- name: my-container
image: busybox:latest
command: ["sh", "-c"]
args: ["echo Hello from Kubernetes! && sleep 3600"]
EOF
Example 3: Passing Environment Variables to Commands
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-env-vars
spec:
containers:
- name: my-container
image: alpine:latest
command: ["/bin/sh", "-c"]
args: ["echo \$GREETING"]
env:
- name: GREETING
value: "Hello, Kubernetes!"
EOF
Example 4: Passing Arguments to Docker Entrypoint
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-entrypoint-args
spec:
containers:
- name: my-container
image: ubuntu:latest
command: ["/bin/echo"]
args: ["Hello", "Kubernetes!"]
EOF
MultiContainer pod
Use Cases for Multi-Container Pods
-
Pods that run multiple containers that need to work together.
-
A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources.
-
These co-located containers form a single cohesive unit.
-
Here is an example for multiple container.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nginx-redis-pod
spec:
containers:
- name: nginx-container
image: nginx:latest
ports:
- containerPort: 80
- name: redis-container
image: redis:latest
ports:
- containerPort: 6379
EOF
InitContainer
-
A Pod can have multiple containers running apps within it, but it can also have one or more init containers, which are run before the app containers are started. Init containers are exactly like regular containers, except:
-
Init containers always run to completion.
-
Each init container must complete successfully before the next one starts.
-
If a Pod’s init container fails, the kubelet repeatedly restarts that init container until it succeeds.
-
Regular init containers (in other words: excluding sidecar containers) do not support the lifecycle, livenessProbe, readinessProbe, or startupProbe fields.
-
Init containers must run to completion before the Pod can be ready
-
If you specify multiple init containers for a Pod, kubelet runs each init container sequentially
-
Each init container must succeed before the next can run. Here is an example for initContainer:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: init-container-pod
spec:
containers:
- name: main-container
image: nginx:latest
ports:
- containerPort: 80
initContainers:
- name: init-wait-redis
image: busybox:latest
command: ["sh", "-c", "until nc -zv nginx 80; do echo 'Waiting for Redis to be ready'; sleep 1; done"]
EOF
- Make a test by creating a pod and service
kubectl run nginx --image=nginx
kubectl expose pod/nginx --port 80
- Now check the pod status initContainer should be successful
- Also check the logs for initContainer
Sidecar container
- A Sidecar container extends and enhances the functionality of a preexisting container without changing it.
- This pattern is one of the fundamental container patterns that allows single-purpose containers to cooperate closely together.
Adapter Container
- The Adapter pattern takes a heterogeneous containerized system and makes it conform to a consistent, unified interface with a standardized and normalized format that can be consumed by the outside world.
- The Adapter pattern inherits all its characteristics from the Sidecar,
Resources definition in pod
Burstable QOS
-
Kubernetes assigns the Burstable class to a Pod when a container in the pod has more resource limit than the request value.
-
A pod in this category will have the following characteristics:
- The Pod has not met the criteria for Guaranteed QoS class.
- A container in the Pod has an unequal memory or CPU request or limit An example is given below
resources: limits: memory: "300Mi" cpu: "800m" requests: memory: "100Mi" cpu: "600m"
kubectl apply -f - <<EOF apiVersion: v1 kind: Pod metadata: name: pod-with-best-efforts spec: containers: - name: my-container image: nginx:latest resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" EOF
-
Check the pods if this is running
-
check the QOS of the pod
kubectl describe pod | grep -i qos
Guranteed QOS
-
Kubernetes considers Pods classified as Guaranteed as a top priority. It won’t evict them until they exceed their limits.
-
A Pod with a Guaranteed class has the following characteristics:
- All containers in the Pod have a memory limit and request.
- All containers in the Pod have a memory limit equal to the memory request.
- All containers in the Pod have a CPU limit and a CPU request.
- All containers in the Pod have a CPU limit equal to the CPU request.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: guarenteed
spec:
containers:
- name: my-container
image: nginx:latest
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "250m"
EOF
Best Efforts
- Kubernetes assigns the Burstable class to a Pod when a container in the pod has no resource section.
- A pod in this category will have the following characteristics:
- The Pod has not met the criteria for Guaranteed QoS class.
- A container in the Pod has an unequal memory or CPU request or limit
- Pod without resourcs section are considered best efforts
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: best-efforts
spec:
containers:
- name: my-container
image: nginx:latest
EOF
imagepullSecret
- Pod that uses a Secret to pull an image from a private container image registry or repository.
- There are many private registries in use.
- This task uses Docker Hub as an example registry.
Steps to Use a Pull Secret in a Pod YAML:
- Create Docker Config JSON File:
- Create a Docker configuration file (
~/.docker/config.json
) with the credentials for your private registry. - You can use the
docker login
command to generate this file.
- Create a Docker configuration file (
docker login
Create a Kubernetes Secret:
kubectl create secret generic my-pull-secret --from-file=.dockerconfigjson=$HOME/.docker/config.json --type=kubernetes.io/dockerconfigjson
- use the yaml to create pod with private registry
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-pull-secret
spec:
containers:
- name: my-container
image: myregistry.example.com/my-image:latest
imagePullSecrets:
- name: my-pull-secret
EOF
- Give permission to serviceAccount to pull image so that you will not have to mention in pod Spec
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "my-pull-secret"}]}'
imagePullPolicy Examples
1. IfNotPresent
- Description: Pull the image only if it does not already exist locally.
- Usage: Use this policy when you want to minimize image pulls and rely on locally cached images.
- Example:
spec: containers: - name: my-container image: my-image:latest imagePullPolicy: IfNotPresent
2. Always
- Description : Always pull the latest version of the image, even if it already exists locally.
- Usage: Use this policy when you want to ensure that the container runs the latest version of the image.
- Example:
spec:
containers:
- name: my-container
image: my-image:latest
imagePullPolicy: Always
3. Never
-
Description: Never pull the image, and only use the locally cached version if available.
-
Usage : Use this policy when you want to prevent Kubernetes from pulling the image, even if it does not exist locally.
-
Example:
spec:
containers:
- name: my-container
image: my-image:latest
imagePullPolicy: Never
4. IfPresent
- Description: Pull the image only if it already exists locally. If the image does not exist locally, do not pull it.
- Usage: Use this policy when you want to pull the image only if it is already available locally, otherwise use a cached version.
- Example
spec:
containers:
- name: my-container
image: my-image:latest
imagePullPolicy: IfPresent
5. Default (IfNotPresent)
- Description: Kubernetes default behavior if imagePullPolicy is not explicitly set.
- Pull the image only if it does not already exist locally.
- Usage: This is the default behavior, and it is suitable for many use cases where you want to minimize image pulls. Example :
spec:
containers:
- name: my-container
image: my-image:latest
Container RestartPolicy in Pod block
1. Always
- Description: Always restart the container regardless of the exit status or reason.
- Usage: Use this policy for critical services that should always be running.
- Example:
spec:
restartPolicy: Always
2. OnFailure
- Description: Restart the container only if it exits with a non-zero status.
- Usage: Use this policy for jobs or batch processes that should be retried on failure.
Example:
spec:
restartPolicy: OnFailure
3. Never
- Description: Never restart the container, regardless of the exit status or reason.
- Usage: Use this policy for containers that are expected to run to completion and not be restarted.
Example:
spec:
restartPolicy: Never
4. Default (Always)
- Description: Kubernetes default behavior if restartPolicy is not explicitly set. Always restart the container.
- Usage: This is the default behavior and is suitable for many long-running services.
Example
spec:
restartPolicy: Always (default behavior)
Advance Scheduling of pod
Using nodeName
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-scheduled-node
spec:
nodeName: <mentionhere nodename>
containers:
- name: my-container
image: nginx:latest
EOF
LAB 1
- Pod Management
- Create a pod
- Login to a pod
- Check logs
- Create multi container pod
- Login to specific conatiner
- Check logs for specific container
- Describe a pod
- Check the events
- Set the resources for a container
- requests and limits
- Confgure QOS for a pod
- Best efforts
- Burstable
- Guarenteed
- Set Environement varible
- use pullSecret to download the image
Namespace
- Namespaces are logical division of kubernetes cluster.
- namespaces provides a mechanism for isolating resources within a single cluster
- Names of resources need to be unique within a namespace, but not across namespaces.
- To achieve multitenancy with Networkpolicy
- To configure RBAC
- Check all namespaces
kubectl get ns
- Create a new namespace
kubectl create ns <name>
- Switch from one ns to another
kubens
kubens <ns name>
To see which Kubernetes resources are and aren’t in a namespace:
- In a namespace
kubectl api-resources --namespaced=true
- Not in a namespace
kubectl api-resources --namespaced=false
What is a Resource Quota?
A Resource Quota in Kubernetes is a way to limit the amount of resources that can be consumed by a namespace. It helps ensure fair usage of resources among different teams or applications and prevents any single namespace from consuming all available resources in a cluster.
Key Concepts
- Namespace: Resource quotas are applied on a per-namespace basis.
- Resource Limits: Quotas can specify limits on various resources such as CPU, memory, and storage.
- Usage Tracking: Quotas help track and control resource usage within namespaces.
Types of Quotas
- Resource Quotas: Limits on CPU, memory, and storage.
- Limit Ranges: Define minimum and maximum resource limits for individual pods or containers.
Common Resources Managed by Quotas
- CPU: Maximum amount of CPU that can be requested.
- Memory: Maximum amount of memory that can be requested.
- Storage: Maximum amount of persistent volume storage that can be requested.
- Pods: Maximum number of pods that can be created in a namespace.
- Services: Maximum number of services that can be created in a namespace.
- ConfigMaps: Maximum number of ConfigMaps that can be created.
- If resourcequota is defined , Pod can not be created without resource definition. Solution is to create limitrange
Creating a Resource Quota
-
Configure Memory and CPU Quotas for a Namespace
kubectl apply -f - <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-demo
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
EOF
As per above Example:
-
The ResourceQuota places these requirements on the quota-mem-cpu-example namespace:
-
For every Pod in the namespace, each container must have a memory request, memory limit, cpu request, and cpu limit.
-
The memory request total for all Pods in that namespace must not exceed 1 GiB.
-
The memory limit total for all Pods in that namespace must not exceed 2 GiB.
-
The CPU request total for all Pods in that namespace must not exceed 1 cpu.
-
The CPU limit total for all Pods in that namespace must not exceed 2 cpu.
-
Check applied quota
kubectl get quota
- Create namespace using a yaml file
kubectl apply -f - <<EOF
kind: Namespace
apiVersion: v1
metadata:
name: test
labels:
name: test
EOF
Resource Limits
A Kubernetes cluster can be divided into namespaces. Once you have a namespace that has a default memory limit, and you then try to create a Pod with a container that does not specify its own memory limit, then the control plane assigns the default memory limit to that container.
Kubernetes Resource Limits
What are Resource Limits?
Resource Limits in Kubernetes define the maximum amount of resources (CPU and memory) that a container can use. They help manage and constrain resource usage to ensure fair sharing of resources across multiple containers and prevent a single container from consuming excessive resources.
Key Concepts
- Requests: The amount of CPU or memory that a container is guaranteed to have.
- Limits: The maximum amount of CPU or memory a container can use. If a container tries to use more than its limit, it may be throttled (for CPU) or terminated and potentially restarted (for memory).
Why Set Resource Limits?
- Prevent Resource Exhaustion: Ensure that no single container can consume all available resources, affecting other containers.
- Improve Stability: Avoid situations where containers cause resource contention and degrade overall cluster performance.
- Optimize Resource Usage: Better manage and allocate resources based on the actual needs of applications.
Specifying Resource Limits
You can specify resource limits in the container specification of a Pod’s manifest. Resource requests and limits are defined in the resources
field of the container.
Example
Here’s an example of a Pod configuration that sets both resource requests and limits for CPU and memory:
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example-container
image: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
- Create a LimitRange
kubectl apply -f - <<EOF
apiVersion: v1
kind: LimitRange
metadata:
name: example-limits
spec:
limits:
- max:
cpu: "2" # Maximum CPU limit for a container (2 cores)
memory: "1Gi" # Maximum memory limit for a container (1 GiB)
min:
cpu: "200m" # Minimum CPU request for a container (200 millicores, or 0.2 cores)
memory: "256Mi" # Minimum memory request for a container (256 MiB)
default:
cpu: "500m" # Default CPU request for a container (500 millicores, or 0.5 cores)
memory: "512Mi" # Default memory request for a container (512 MiB)
defaultRequest:
cpu: "300m" # Default CPU request if not specified (300 millicores, or 0.3 cores)
memory: "384Mi" # Default memory request if not specified (384 MiB)
type: Container
EOF
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "250m"
memory: "128Mi"
type: Containe
- Now create one pod with no resource and check if the default resource has been applied
kubectl describe pod <podname>
LAB 2
- Create a namespace
- Create a resource quota
- Check if you can create a pod
- Create a resource limits
- Create a pod now
- Check if the default cpu and memory has been taken
Pod Advance Scheduling
use nodeName
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
nodeName: kmaster
containers:
- name: example-container
image: nginx
Steps to Use nodeSelector:
- Label the node
kubectl label nodes specific-node-name diskType=ssd
-
Use this label to schedule the pod
apiVersion: v1 kind: Pod metadata: name: pod-with-node-selector spec: containers: - name: my-container image: nginx:latest nodeSelector: diskType: ssd
-
Check where this pod has been schduled
kubectl get pods -o wide
Internal Process of the Kubernetes Scheduler
The Kubernetes Scheduler is a core component responsible for assigning Pods to Nodes in a Kubernetes cluster. It ensures that Pods are placed on Nodes based on resource requirements and constraints. Here’s an overview of its internal process:
1. Pod Submission and API Server
- Pod Creation: When a new Pod is created, its specification is submitted to the Kubernetes API server.
- Storage: The API server stores the Pod’s configuration in etcd, the cluster’s key-value store.
2. Scheduler Activation
- Scheduling Loop: The Scheduler continuously watches for new or unscheduled Pods by querying the API server for Pods that do not have a Node assigned.
- Notification: When it detects an unscheduled Pod, it begins the scheduling process.
3. Filtering and Scoring
-
Filtering: The Scheduler filters out Nodes that are not suitable for the Pod based on several criteria:
- Resource Requests: Checks if the Node has enough CPU, memory, and other resources to meet the Pod’s requests.
- Node Affinity: Considers Node affinity rules specified in the Pod’s configuration.
- Taints and Tolerations: Ensures that the Pod can tolerate any taints present on the Node.
- Pod Affinity/Anti-Affinity: Evaluates if the Pod should or should not be placed near other Pods based on affinity rules.
-
Scoring: After filtering, the Scheduler scores the remaining Nodes to determine which is the best fit for the Pod. Scoring is based on various factors such as:
- Resource Utilization: Prefers Nodes with more available resources or balanced resource usage.
- Inter-Pod Affinity/Anti-Affinity: Considers how well the Node meets the Pod’s affinity/anti-affinity rules.
- Custom Scoring: Some scheduling plugins can apply additional scoring rules.
4. Binding
- Select Node: The Scheduler selects the best Node based on the highest score.
- Update API Server: It updates the Pod’s status in the API server with the chosen Node’s name.
- Binding Object: The Scheduler creates a binding object in etcd to record the Node assignment.
5. Kubelet Notification
- Node Update: The Kubelet on the selected Node detects the updated Pod specification (through periodic polling or a watch mechanism).
- Container Creation: The Kubelet pulls the necessary container images and creates containers based on the Pod’s specification.
6. Pod Initialization
- Lifecycle Hooks: Any defined lifecycle hooks, such as
initContainers
, are executed. - Health Checks: The Kubelet performs readiness and liveness checks to ensure that the Pod and its containers are running properly.
7. Pod Status Update
- API Server Update: The Kubelet updates the Pod status in the API server, reflecting its current state (e.g.,
Running
,Pending
,Failed
).
8. Monitoring and Re-scheduling
- Continuous Monitoring: The Scheduler continues to monitor the cluster for changes that might require re-scheduling, such as node failures or resource constraints.
- Re-scheduling: If necessary, the Scheduler can reassign Pods to different Nodes to maintain optimal resource utilization and availability.
Summary
The Kubernetes Scheduler plays a vital role in managing Pod placement within a cluster. It:
- Watches for unscheduled Pods.
- Filters Nodes based on the Pod’s requirements and constraints.
- Scores and selects the most suitable Node.
- Binds the Pod to the selected Node by updating the Pod’s status in the API server.
- Informs the Kubelet on the chosen Node to create and run the containers.
This process ensures efficient distribution of Pods across the cluster, meeting both resource and policy requirements.
Taint and Toleration
- A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.
- You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.
- Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.
Taint and Toleration key points
Parameter | Description |
---|---|
key | Any string, up to 253 characters. Must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores. |
value | Any string, up to 63 characters. Must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores. |
effect | One of: NoSchedule, PreferNoSchedule, NoExecute |
operator | One of: Equal, Exists |
- How to apply taint on a node
kubectl taint nodes <node-name> <key>=<value>:<effect>
kubectl taint nodes specific-node-name disktype=ssd:NoSchedule
- Now use this in pod yaml
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: toleration-pod
spec:
containers:
- name: my-container
image: nginx:latest
tolerations:
- key: disktype
operator: Equal
value: ssd
effect: NoSchedule
EOF
Example : NotEqual Operator
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: toleration-not-equal-pod
spec:
containers:
- name: my-container
image: nginx:latest
tolerations:
- key: disktype
operator: NotEqual
value: ssd
effect: NoSchedule
EOF
Example Exists Operator with Key
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: toleration-exists-key-pod
spec:
containers:
- name: my-container
image: nginx:latest
tolerations:
- key: disktype
operator: Exists
effect: NoSchedule
EOF
- Check the pods where are they schedules
kubectl get pod -o wide
LAB 3
- Create a pod to schedule on a specific node
- Set the node some label
- Test to use nodeName
Node Affinity
- Label Nodes
kubectl label nodes kworker1 example-label=value1
kubectl label nodes kworker2 example-label=value2
- Yaml for pod creation
cat <<EOF > pod-with-node-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
containers:
- name: nginx-container
image: nginx
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: example-label
operator: In
values:
- value1
- value2
EOF
Example 2 with operator notin
cat <<EOF > pod-with-node-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
containers:
- name: nginx-container
image: nginx
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: example-label
operator: NotIn
values:
- value3
- value4
EOF
Example Using Exists Operator
cat <<EOF > pod-with-node-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
containers:
- name: nginx-container
image: nginx
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: example-label
operator: Exists
EOF
- Check the pods status
Example with Preferred Affinity
cat <<EOF > pod-with-preferred-node-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-with-preferred-node-affinity
spec:
containers:
- name: nginx-container
image: nginx
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: example-label
operator: Exists
EOF
Pod Affinity and Anti Affinity
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-pod-affinity
labels:
app: web
spec:
containers:
- name: nginx-container
image: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-pod-affinity-rule
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-container
image: nginx
EOF
Another Example with weight
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-preferred-pod-affinity
labels:
app: database
spec:
containers:
- name: postgres-container
image: postgres
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-preferred-pod-affinity-rule
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-container
image: nginx
EOF
Pod antiAffinity
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-anti-affinity
labels:
app: web
spec:
containers:
- name: nginx-container
image: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-anti-affinity-rule
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-container
image: nginx
EOF
Soft AntiAffinity
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-with-preferred-anti-affinity
labels:
app: database
spec:
containers:
- name: postgres-container
image: postgres
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-preferred-anti-affinity-rule
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-container
image: nginx
EOF
Pod Priority
Pod priority classes
-
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
-
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.
-
System-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs, sdn, and so forth.
-
System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
-
Sample priority class object
kubectl apply -f - <<EOF
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
EOF
- Sample pod specification with priority class name
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
EOF
Pod Disruptions Budget:
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: example-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: example-app
EOF
Use of pdb with Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
replicas: 3
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
spec:
containers:
- name: nginx-container
image: nginx:latest
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: example-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: example-app
EOF
Pod Preemption and Pod Distruption Budget
- Pod preemption and other scheduler settings If you enable pod priority and preemption, consider your other scheduler settings:
Pod priority and pod disruption budget
- A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod priority and pod affinity
Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
Pointer Regarding Pod Scheduling
-
Preemption removes existing Pods from a cluster under resource pressure to make room for higher priority pending Pods
-
The default priority for all pods is zero ( 0 )
-
Supported Operators for Affinity
-
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the pod specification.
-
This value can be below:
- In,
- NotIn,
- Exists
- DoesNotExist
- Lt
- Gt
-
-
Specify a weight for the node, 1-100. The node that with highest weight is preferred.
-
A taint on a node instructs the node to repel all pods that do not tolerate the taint.
-
Taint Effects are given below:
The effect is one of the following:
NoSchedule
New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
PreferNoSchedule
New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
Existing pods on the node remain.
NoExecute
New pods that do not match the taint cannot be scheduled onto that node.
Existing pods on the node that do not have a matching toleration are removed.
operator
Equal
The key/value/effect parameters must match. This is the default.
Exists
The key/effect parameters must match. You must leave a blank value parameter, which matches any.
LAB 4
- Create a pod with node affinity
- create a pod with pod affinity and pod anti affinity
- create a pod with highest priority
- Also all the pod in your cluster with highest priority