Kubernetes Foundations — Architecture and Core Components

Jul 29

Kubernetes has become a foundational system for modern cloud-native infrastructure, orchestrating containers at scale across industries, from e-commerce to fintech. Born from Google’s internal system, Borg, Kubernetes inherits decades of lessons in managing global-scale applications like Gmail and Google Cloud Engine (GCE). Google’s contributions, such as cgroups (introduced to the Linux kernel in 2007) and Linux namespaces, laid the groundwork for containerization technologies that underpin modern runtimes.

This article is tailored for DevOps engineers, ML engineers, and developers seeking a deep understanding of Kubernetes’ foundational architecture. We’ll explore its core components—control planes, worker nodes, Pods, and Services—focusing on their roles, interactions, and operational details.

Kubernetes Architecture

Kubernetes is a distributed orchestration platform that manages containerized applications across a cluster of control plane and worker nodes.

Kubernetes’ architecture orchestrates containers through a distributed system of components. The diagram provides a high-level overview, omitting some elements like kubelet and kube-proxy on worker nodes.

Kubernetes consists of the following main components:

Control plane(s) and worker node(s).
Operators (controllers).
Services.
Pods of containers.
Namespaces and quotas.
Network and policies (often managed by modern plugins like Cilium).
Storage.

A Kubernetes cluster comprises one or more control plane nodes (also referred to as cp nodes) and a set of worker nodes, previously known as worker nodes in earlier versions of Kubernetes (the term "minions" is deprecated). The cluster operates through API calls to the kube-apiserver, which operators (or controllers) use to manage the cluster’s desired state. A network plugin, such as Cilium (a popular choice due to its eBPF-based capabilities), handles both internal (Pod-to-Pod) and external (outside the cluster) traffic. Using an API-based communication scheme enables Kubernetes to support non-Linux worker nodes and containers.

Stable support for Windows Server 2019 as worker nodes has been available since Kubernetes 1.14 (March 2019), allowing Windows-based workloads, such as .NET applications, to run seamlessly alongside Linux workloads. However, control plane nodes, which host critical components like kube-apiserver and etcd, remain Linux-only, requiring Linux systems for cluster management. This hybrid support facilitates mixed-OS clusters, common in enterprises with legacy Windows applications.

Kubernetes Components

Control Plane

The control plane is the cluster’s command center, managing its state and operations. It runs on cp nodes, coordinating scheduling, state persistence, and API requests.

Components Overview

The control plane includes key components:

kube-apiserver: Central hub for API interactions.
kube-scheduler: Places Pods on nodes.
kube-controller-manager: Runs controllers to manage state.
etcd: Stores cluster state and settings.

API Access

The kube-apiserver exposes the API, accessible via kubectl (e.g., kubectl get pods) or curl for programmatic interaction.

Production Add-Ons

Production clusters require add-ons:

CoreDNS: DNS service since Kubernetes 1.13 for Service/Pod resolution.
Logging: Fluentd or Fluent Bit for log collection.
Monitoring: Prometheus with Grafana for metrics.

Kubernetes lacks integrated logging/monitoring components, relying on these solutions.

Bootstrapping with kubeadm

Using kubeadm, a widely adopted tool for Kubernetes cluster initialization, the kubelet (managed by systemd) starts Pods defined in /etc/kubernetes/manifests/. These manifests configure critical control plane components, such as kube-apiserver, kube-controller-manager, and etcd, to launch the cluster. To bootstrap a cluster, administrators execute kubeadm init on the control plane node, which generates TLS certificates, a cluster CA, and configuration files, including ~/.kube/config for secure API access.
For example:

kubeadm init --pod-network-cidr=192.168.0.0/16

This command initializes the control plane and outputs a kubeadm join command with a token and CA certificate hash, which worker nodes use to join the cluster securely:

kubeadm join 10.128.0.3:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>  

These commands ensure a secure cluster setup, with certificates enabling TLS-encrypted communication between components.

While kubeadm is ideal for straightforward cluster setups and learning environments, alternative tools like kubespray and kops offer additional automation for complex or cloud-based deployments. Kubespray, a Kubernetes incubator project, uses Ansible playbooks to deploy production-ready clusters on various infrastructures, including on-premises servers and cloud providers like AWS, Azure, and GCP. It supports advanced configurations, such as high-availability (HA) control planes and customizable network plugins (e.g., Calico, Cilium). Kops (Kubernetes Operations) is tailored for cloud environments, particularly AWS, simplifying cluster creation, scaling, and upgrades with commands like:

kops create cluster --name=my-cluster.k8s.local --zones=us-west-2a  
kops update cluster --name=my-cluster.k8s.local --yes  

These tools complement kubeadm by addressing enterprise needs, such as automated multi-node deployments and cloud-native integrations, making them popular choices for production clusters in large organizations.

Kube-apiserver

The kube-apiserver is the cluster’s gateway, managing all API interactions. It connects components and users to the cluster’s state.

It validates actions, serves REST operations, configures objects (e.g., Pods), and acts as the sole gateway to etcd.
Clients use kubectl (e.g., kubectl get pods) or curl; components like kubelet report statuses via the API.

Kube-scheduler

The kube-scheduler assigns Pods to nodes based on resource needs and constraints. It ensures optimal Pod placement across the cluster.

It evaluates resources (CPU, memory, volumes), quotas, taints/tolerations, and labels (e.g., disk=ssd). A Pod may stay Pending if constraints aren’t met.
Supports TopologySpreadConstraints (stable since Kubernetes 1.19) and custom schedulers via schedulerName.
Pods can be bound to a node using nodeName.

Kube-controller-manager

The kube-controller-manager oversees the cluster’s state through controllers. It ensures objects like Deployments match their desired specifications.

It runs multiple controllers, such as:
- Replication Controller: Maintains the desired number of Pods for ReplicaSets.
- Endpoints Controller: Manages Service endpoints for connectivity.
- Namespace Controller: Handles namespace lifecycle.
Controllers operate as watch-loops, querying the kube-apiserver to align the current state with the desired state (e.g., scaling a Deployment to 3 replicas).

etcd

etcd serves as the cluster’s memory, storing its state, networking, and persistent configuration in a distributed key-value store. It ensures consistency and durability using a b+tree structure, appending new values and marking old ones for compaction to optimize storage.

Updates are serialized through the kube-apiserver, which acts as the sole gateway to etcd. Concurrent update requests are handled sequentially; if a client’s request references an outdated version, etcd returns a 409 Conflict error, requiring the client to retry with the updated version.

etcd operates using a Leader-Follower model: the Leader processes all write operations, while Followers replicate data and serve read requests. Learners, non-voting nodes in the process of joining the cluster, sync data until ready to become Followers. If the Leader fails, Followers elect a new Leader via a consensus protocol, ensuring high availability.

Given etcd’s critical role, regular backups are essential, especially before cluster upgrades or maintenance. Administrators use etcdctl snapshot save to create a backup and etcdctl snapshot restore to recover a cluster’s state. For example:

etcdctl snapshot save /backup/etcd-snapshot-$(date +%F).db  
etcdctl snapshot restore /backup/etcd-snapshot-2025-05-29.db --data-dir /var/lib/etcd-restored  

These commands safeguard against data loss, ensuring cluster recoverability. While early versions of kubeadm (pre-2021) faced challenges with etcd stability during upgrades, these issues have been resolved in modern releases, making etcd a robust component for production clusters as of Kubernetes 1.30.

Worker Nodes

Worker nodes execute containers and manage networking for the cluster’s workloads, hosting Pods and ensuring their seamless operation. Each worker node runs several key components to manage Pods and their connectivity, detailed below.

Kubelet

The kubelet is a systemd process (or equivalent) that serves as the primary agent on each worker node. It receives PodSpecs from the kube-apiserver, manages local resources (e.g., volumes, Secrets, ConfigMaps), and coordinates with the container runtime to start, stop, or restart containers. For example, the kubelet ensures a Pod’s containers adhere to the specified resource limits and reports their status back to the control plane. Since Kubernetes 1.24 (2022), Docker support has been deprecated, and the kubelet typically interfaces with containerd or cri-o as the container runtime.

Kube-proxy

Kube-proxy manages networking rules to expose containers both internally (within the cluster) and externally (to outside traffic). It uses ipvs (IP Virtual Server), preferred since Kubernetes 1.22 (2021) for its performance and scalability, to route traffic to Pods based on Service definitions. For instance, kube-proxy configures rules to forward traffic from a Service’s ClusterIP to the appropriate Pod IPs, ensuring load balancing and connectivity. In modern clusters, ipvs has largely replaced the older iptables mode.

Container Runtime

The container runtime is responsible for executing containers on the worker node. It handles low-level tasks such as pulling container images, creating containers, and managing their lifecycle. Popular runtimes include containerd and cri-o, which comply with the Container Runtime Interface (CRI) standard. Since Kubernetes 1.24, containerd has become the dominant choice in production due to its lightweight design and robust integration with Kubernetes, replacing Docker as the default runtime.

Service Operator

The Service controller ensures stable Pod connectivity with persistent IPs and label-based routing. It enables communication across the cluster and beyond.

Works with the Endpoints controller to assign IPs and manage traffic (e.g., app=backend selector).
Handles access policies for security and resource control.
Other controllers like Jobs (one-time) and CronJobs (recurring, e.g., nightly backups) support specific tasks.

Kubernetes’ architecture forms a robust foundation for container orchestration, seamlessly integrating control planes, worker nodes, Pods, and Services. By mastering these core components—from the command center of the Control Plane to the workload execution on Worker Nodes—you’re equipped to harness Kubernetes’ power for scalable, cloud-native applications.

Pods: The Smallest Deployable Unit

Kubernetes is an orchestration system designed to deploy and manage containers, but it does not manage containers individually. Instead, containers are grouped into a larger object called a Pod, the smallest deployable unit in Kubernetes. A Pod consists of one or more containers that share:

An IP address.
Access to storage.
A namespace.

Pods enable tight integration for applications and support tasks.

One container typically runs the main app, others (e.g., Istio sidecars) handle logging/security.
Controllers like Deployments manage ReplicaSets to create/terminate Pods per a PodSpec; kubelet uses it to manage containers.
Namespaces isolate resources; Services enable cross-namespace communication.
Labels (e.g., app=frontend), taints/tolerations (e.g., node-role.kubernetes.io/control-plane:NoSchedule), and annotations (e.g., build version) manage Pods.

Here’s an example of a Pod specification demonstrating resource limits, labels, and tolerations:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    app: frontend
spec:
  containers:
  - name: nginx
    image: nginx:1.14
    resources:
      limits:
        cpu: "1"
        memory: "4Gi"
      requests:
        cpu: "0.5"
        memory: "500Mi"
  tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Exists"
    effect: "NoSchedule"

Containers in Pods

Kubernetes manages container resources through PodSpecs, not at the container level directly. The resources section of a PodSpec sets runtime parameters on the scheduled node, ensuring efficient resource allocation and preventing resource starvation.

Resource Limits and Requests:

resources:
  limits:
    cpu: "1"
    memory: "4Gi"
  requests:
    cpu: "0.5"
    memory: "500Mi"

This configuration ensures a container can use up to 1 CPU core and 4GiB of memory (limits), while reserving at least 0.5 CPU cores and 500MiB of memory (requests) for predictable scheduling.

ResourceQuota: ResourceQuota enforces namespace-level constraints on CPU, memory, and object counts (e.g., number of Pods or Services). For example, a ResourceQuota might limit a namespace to 10 CPU cores and 20GiB of memory, preventing overconsumption by development teams in a multi-tenant cluster. Here’s a sample ResourceQuota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    limits.cpu: "10"
    limits.memory: "20Gi"
    pods: "50"

This quota restricts the dev namespace to 50 Pods with a total of 10 CPU cores and 20GiB of memory for all containers.

PriorityClass: The priorityClassName field in a PodSpec assigns a priority to Pods, influencing scheduling and preemption. PriorityClass, stable since Kubernetes 1.14 (2019), ensures critical workloads (e.g., production services) take precedence over lower-priority tasks (e.g., batch jobs). For example:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High-priority workloads, such as production services."
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: nginx:1.14

In this example, the high-priority PriorityClass assigns a value of 1,000,000, ensuring the critical-app Pod can preempt lower-priority Pods if resources are scarce. The scopeSelector field in ResourceQuota (stable since Kubernetes 1.15, 2019) allows quotas to apply selectively based on priority, enabling fine-grained resource management in production environments.

Init Containers for Startup Sequencing

Init Containers ensure ordered startup for Pods, completing before main containers. They run to completion or restart on failure, delaying the main containers.

Unlike standard containers, which start simultaneously in any order, Init Containers support distinct storage and security settings, enabling utilities unavailable to main containers.
For example, this Init Container waits for a directory before starting a database container:

spec:
  containers:
  - name: main-app
    image: databaseD
  initContainers:
  - name: wait-database
    image: busybox
    command: ['sh', '-c', 'until ls /db/dir; do sleep 5; done;']

LivenessProbes, ReadinessProbes, and StatefulSets can also enforce sequencing but add complexity.

Namespaces for Resource Isolation

Kubernetes uses namespaces to segregate objects for resource control and multi-tenancy. There are two types of object scopes:

Cluster-scoped objects: Exist globally across the cluster (e.g., Nodes, PersistentVolumes).
Namespace-scoped objects: Exist only within a specific namespace (e.g., Pods, Services).

As namespaces isolate resources, Pods in different namespaces communicate via Services, which abstract the underlying Pod IPs. For instance, a Pod in the frontend namespace can communicate with a Pod in the backend namespace through a Service that exposes the backend Pods.

Orchestration with Controllers

Controllers in Kubernetes enforce desired states for workloads using watch-loops to manage objects like Pods and Services.

Orchestration in Kubernetes is managed through a series of watch-loops, also known as controllers or operators, which continuously interrogate the kube-apiserver to check an object’s state and modify it until the desired state matches the current state. Default controllers are compiled into the kube-controller-manager, but custom controllers can be added using Custom Resource Definitions (CRDs), enabling tailored workload management (e.g., Prometheus Operator for monitoring).
For container workloads, the primary controller is a Deployment, which manages ReplicaSets. A ReplicaSet creates or terminates Pods according to a PodSpec—a YAML or JSON file describing the Pod—ensuring the specified number of replicas are running. The PodSpec is sent to the kubelet on the target node, which interacts with the container runtime (e.g., containerd or cri-o, as Docker support was deprecated in Kubernetes 1.24) to:

Download resources, such as container images.
Start or stop containers until the desired state is achieved.
Kubernetes provides several other workload controllers to address diverse use cases:
Jobs: For one-time tasks, such as running a database migration or batch processing. For example, a Job might execute a script to preprocess data for a machine learning pipeline.
CronJobs: For recurring tasks, such as scheduling a Pod to run a backup script every midnight. CronJobs use cron-like syntax to define schedules, e.g., 0 0 * * * for daily execution.
DaemonSets: Ensure a single Pod runs on every node in the cluster, ideal for cluster-wide services like logging, monitoring, or networking agents. For instance, a DaemonSet might deploy a Fluentd Pod on each node to collect logs, automatically scaling with cluster changes (new nodes trigger Pod creation, node removal triggers Pod deletion). DaemonSets can be configured to skip specific nodes using taints and tolerations.
StatefulSets: Manage stateful applications requiring stable identities, such as databases (e.g., MySQL, MongoDB). Unlike Deployments, StatefulSets assign unique, persistent identities to Pods (e.g., db-0, db-1) with stable storage and network identifiers. Pods are deployed sequentially, ensuring each is ready before the next starts, which is critical for applications with strict ordering needs. For example:

apiVersion: apps/v1  
kind: StatefulSet  
metadata:  
  name: mysql  
spec:  
  replicas: 3  
  selector:  
    matchLabels:  
      app: mysql  
  template:  
    metadata:  
      labels:  
        app: mysql  
    spec:  
      containers:  
      - name: mysql  
        image: mysql:8.0  

These controllers provide flexibility to handle stateless, stateful, one-time, recurring, and node-specific workloads, making Kubernetes a versatile orchestration platform.

Services for Network Connectivity

Services provide stable networking for Pods through persistent IPs and label-based routing. The Service controller requests IPs and information from the Endpoints controller to manage connectivity.

Services enable communication within/across namespaces or externally (e.g., a Service with selector app=backend routes traffic to matching Pods).

Managing Complexity with Labels, Taints, Annotations, and CRDs

Labels, taints, annotations, and Custom Resource Definitions (CRDs) simplify managing thousands of Pods across hundreds of nodes. These metadata tools enable flexible selection, scheduling, documentation, and extensibility in Kubernetes clusters.

Labels: Arbitrary key-value pairs to select objects (e.g., app=frontend) for operations like scaling, monitoring, or routing traffic. For example, a Service might target Pods with app=frontend to route traffic to a web application.

Taints/Tolerations: Nodes use taints to restrict scheduling (e.g., node-role.kubernetes.io/control-plane:NoSchedule). Pods require matching tolerations to be scheduled on tainted nodes, ensuring critical workloads avoid control plane nodes unless explicitly allowed.

Annotations: Add metadata for containers, tools, or external systems (e.g., build.version=1.2.3), not used for selection. Annotations are often used for integration with CI/CD pipelines, monitoring systems, or Helm charts. For example:

apiVersion: monitoring.coreos.com/v1  
kind: Prometheus  
metadata:  
  name: example  
spec:  
  replicas: 2  
  serviceMonitorSelector:  
    matchLabels:  
      app: monitored-app  

This annotation instructs Prometheus to scrape metrics from a Pod on port 8080, enabling automated monitoring.

Custom Resource Definitions (CRDs): Extend Kubernetes’ API to define custom resources, enabling tailored workload management. CRDs allow operators to manage complex applications declaratively. For instance, the Prometheus Operator uses a CRD to define a Prometheus resource:

apiVersion: monitoring.coreos.com/v1  
kind: Prometheus  
metadata:  
  name: example  
spec:  
  replicas: 2  
  serviceMonitorSelector:  
    matchLabels:  
      app: monitored-app  

This CRD simplifies deploying and scaling Prometheus instances, integrating seamlessly with Kubernetes’ control plane. CRDs are widely used for service meshes (e.g., Istio’s VirtualService), databases, or custom business logic, making Kubernetes highly extensible for enterprise needs.

Made by a Human

Kubernetesk8sArchitectureDevOpsMLOps

Andrey Sydelov