Kubernetes in Production: The 2025 Survival Guide
Hard-won lessons on running Kubernetes at scale, from cluster architecture and RBAC to cost optimization and real production pitfalls.
Most teams adopt Kubernetes before they need it. That's not an insult. It's just how infrastructure decisions get made when everyone's excited about cloud-native. The problem is that running Kubernetes badly is genuinely hard, and most guides skip straight to the YAML without telling you which problems it actually solves or what you're signing up for when you run it wrong at 3 AM.
I've seen clusters across multiple cloud providers, thousands of nodes, and enough post-mortems to know where the real pain lives. This is the guide I wish existed before I learned most of this the hard way.
Managed vs. Self-Managed: Make the Decision Once
Here's my take: don't run your own control plane. For the vast majority of teams in 2025, EKS, GKE Autopilot, and AKS have matured to the point where the engineering cost of managing etcd quorum, API server upgrades, and control-plane HA just doesn't pay off. The exception is air-gapped environments or genuinely extreme scale, and most teams are not at that scale.
If you do self-manage with kubeadm or Cluster API, treat your control plane nodes as immutable. Never SSH in to patch them: rebuild. Use etcd snapshots to object storage on a cron and test restores quarterly. A control plane you've never restored from backup is a control plane you don't actually have.
The control plane is not an app server. If you're SSHing into it regularly, something has gone wrong in your operational model.
Namespace Strategy and Multi-Tenancy
Namespaces are a logical boundary, not a security boundary. Don't make the mistake of assuming that separating teams into namespaces prevents a compromised workload from pivoting to another team's secrets or service accounts. Pair namespace isolation with network policies, RBAC, and for strict tenants, separate node pools or clusters entirely.
A practical namespace taxonomy for a mid-size platform team:
- system-*: cluster-level infrastructure (ingress, cert-manager, monitoring)
- shared-*: cross-team shared services (databases, message queues)
- team-[name]-[env]: per-team, per-environment workloads
- sandbox-*: ephemeral developer namespaces, hard quota enforced
Enforce LimitRanges and ResourceQuotas on every non-system namespace from day one. Retrofitting them onto a cluster that already has workloads is painful and politically charged. You don't want to have that conversation after someone's rogue job ate all the node memory at 2 AM.
# ResourceQuota for a team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments-prod
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
count/pods: "100"
count/services: "20"
count/persistentvolumeclaims: "10"
---
# LimitRange : prevents pods without resource specs
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-payments-prod
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128MiResource Requests, Limits, and the QoS Trap
Kubernetes assigns every pod a Quality of Service class based on how you set requests and limits. Get this wrong and you'll see unexpected evictions under pressure, which is one of those things that looks like a mystery until you understand the model.
- Guaranteed: requests == limits for every container. These pods are last to be evicted.
- Burstable: at least one container has a request set but limits differ. Evicted under memory pressure.
- BestEffort: no requests or limits set. Evicted first, always.
For stateful workloads and anything on the critical path, target Guaranteed QoS. For batch jobs and background workers, Burstable is fine. Never run BestEffort in production unless the workload is genuinely throwaway.
CPU limits are more nuanced than most people realize. CFS throttling means a container can be CPU-throttled even when the node has idle capacity, purely because it hit its cgroup quota. For latency-sensitive services, consider setting CPU requests but omitting CPU limits. Monitor with container_cpu_cfs_throttled_seconds_total in Prometheus and tune accordingly.
Autoscaling: HPA Is Under-Tuned 90% of the Time
The Horizontal Pod Autoscaler is table stakes, but most teams misconfigure it and then wonder why it doesn't work well under load. The default metric is CPU utilization, which is a lagging indicator for I/O-bound or queue-driven workloads. If you're scaling a payment processor based on CPU, you're scaling on the wrong signal. Always prefer custom or external metrics when they're available.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: External
external:
metric:
name: rabbitmq_queue_messages_ready
selector:
matchLabels:
queue: payments
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60The Vertical Pod Autoscaler (VPA) helps right-size pods over time, but don't run it in Auto mode for production deployments. The pod restarts it causes are disruptive. Use it in Off mode to generate recommendations, then apply them during your next deploy cycle.
KEDA is the right answer for queue-based workers, cron-driven jobs, and anything that needs scale-to-zero. It integrates with SQS, Kafka, RabbitMQ, Datadog metrics, and dozens of other sources. If you're not using KEDA for queue-driven workloads, you're leaving a lot of efficiency on the table.
RBAC Done Right
Kubernetes RBAC is additive: there's no deny rule, only grant. Your threat model is about minimizing blast radius when a service account gets compromised. The rules that matter:
- Every workload gets its own ServiceAccount. Never use the default ServiceAccount.
- Disable automatic ServiceAccount token mounting unless the pod actually calls the API server.
- Use ClusterRole sparingly; prefer Role (namespace-scoped) for almost everything.
- Audit bindings quarterly with
kubectl get rolebindings,clusterrolebindings -A -o wideand remove unused grants. - Never grant
secrets: getat the cluster level. That's equivalent to handing out root.
# Minimal ServiceAccount for a read-only API consumer
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-reader
namespace: team-analytics-prod
automountServiceAccountToken: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-metrics-reader
namespace: team-analytics-prod
rules:
- apiGroups: ["metrics.k8s.io"]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: metrics-reader-binding
namespace: team-analytics-prod
subjects:
- kind: ServiceAccount
name: metrics-reader
namespace: team-analytics-prod
roleRef:
kind: Role
name: pod-metrics-reader
apiGroup: rbac.authorization.k8s.ioSecrets Management: ESO Is the Right Call Now
Kubernetes Secrets are base64-encoded, not encrypted. Out of the box they live in etcd unencrypted unless you configure etcd encryption at rest. That's a starting point, not a solution. You have three credible options for production:
- HashiCorp Vault with the Vault Agent Injector or Vault Secrets Operator: The gold standard for dynamic secrets, fine-grained access policies, and audit logging. Complex to operate but unmatched for compliance-heavy environments.
- Bitnami Sealed Secrets: Encrypts secrets into a
SealedSecretCRD that's safe to commit to Git. Simple, GitOps-native, but static: secrets don't rotate automatically. Good for teams that want GitOps without a full secrets backend. - External Secrets Operator (ESO): Syncs secrets from AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or Vault into Kubernetes Secrets on a schedule. Best balance of simplicity and rotation support for cloud-native teams.
My take: if you're on a cloud provider and don't have a specific Vault mandate, use ESO. It's the simplest path to automatic rotation and it integrates cleanly with the secrets managers you're probably already paying for.
If you're putting raw Secrets into your Git repository, you've already had a breach. You just don't know it yet. Treat secret management as a day-one infrastructure concern.
Network Policies and Zero-Trust Networking
By default, every pod can reach every other pod in a cluster. In a multi-tenant or security-conscious environment, that's unacceptable. Network policies enforce L3/L4 segmentation, but they require a CNI that actually supports them: Calico, Cilium, or Weave, not Flannel in its default configuration.
Start with a default-deny posture in every namespace and add explicit allow rules:
# Default deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-payments-prod
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow ingress from ingress controller only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-controller
namespace: team-payments-prod
spec:
podSelector:
matchLabels:
app: api-server
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
policyTypes:
- IngressIf you're starting a new cluster in 2025, evaluate Cilium first. It takes network policy further with L7 enforcement (HTTP path and method filtering) and eBPF-based observability via Hubble. The visibility alone is worth it.
PodDisruptionBudgets and Node Management
Without a PodDisruptionBudget, a routine node drain for upgrades can take down your entire service. I've seen this happen in production. It's avoidable. Define PDBs for every stateless workload with more than one replica:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: "66%" # or use maxUnavailable: 1
selector:
matchLabels:
app: api-serverFor node pool strategy, separate workloads by failure domain and cost profile. Run system components on on-demand nodes. Run stateless, fault-tolerant workloads on Spot/Preemptible instances with a mix of instance types to reduce interruption correlation. Use taints and tolerations to enforce placement:
# Taint spot nodes
kubectl taint nodes -l cloud.google.com/gke-spot=true spot=true:NoSchedule
# Tolerate spot in workload spec
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# Anti-affinity to spread across zones
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api-server
topologyKey: topology.kubernetes.io/zoneMonitoring: The Prometheus-Grafana Stack
kube-prometheus-stack is the de facto standard: it bundles Prometheus Operator, Grafana, Alertmanager, and a curated set of recording rules into one Helm chart. Deploy it with persistent storage and a retention policy that fits your budget. For long-term metrics storage at scale, replace Prometheus's local TSDB with Thanos or Cortex to enable multi-cluster aggregation and cheap object-storage retention.
The alerts that matter most in production:
- Pod CrashLoopBackOff: immediate, any namespace
- PVC near capacity: warn at 80%, critical at 90%
- Node memory pressure or disk pressure conditions
- HPA at max replicas: signals your ceiling is too low
- API server error rate: watch for 5xx spikes on the apiserver
- etcd leader changes: instability indicator on self-managed clusters
- Certificate expiry: use cert-manager and alert 30 days out
Dashboards are for humans. Alerts are for computers. If a human is reading a dashboard to decide whether something is wrong, your alerting is insufficient.
Cost Optimization: What Bad Kubernetes Sprawl Costs You
Kubernetes waste is real money. Over-provisioned requests and idle workloads are the two biggest culprits, and both compound with scale. I've seen clusters where 40% of reserved CPU was never actually used, just requested. That's vendor cost going straight to waste. A few high-ROI levers:
- Goldilocks: Runs VPA in recommendation mode and surfaces right-sizing suggestions per workload via a UI. Use it monthly to catch drift.
- Karpenter: AWS-native cluster autoscaler replacement that provisions nodes just-in-time based on pending pod requirements, choosing the cheapest compatible instance type. Dramatically reduces over-provisioned node headroom.
- Kubecost: Allocates costs to namespaces, teams, and labels. Essential for chargeback and identifying budget outliers. The first time you show a team their actual infrastructure cost, behavior changes.
- Namespace TTLs for ephemeral environments: Automatically delete PR preview or dev namespaces after 24 hours of inactivity. A single forgotten preview environment running a database can cost hundreds of dollars a month. This happens all the time.
Common Production Pitfalls
These are the failure modes that show up in post-mortems more than anything else:
- Missing readiness probes: traffic hits pods that aren't ready, causing errors during rollouts. Define readiness probes that actually test the application's ability to serve, not just that the process is alive.
- terminationGracePeriodSeconds too short: pods are SIGKILL'd before they finish in-flight requests. Default is 30 seconds; increase for long-running request handlers.
- Missing preStop hook: Kubernetes removes the pod from the endpoints list concurrently with sending SIGTERM, so there's a race. Add a
preStop: exec: sleep 5to give kube-proxy time to propagate the endpoint removal before your process shuts down. - Huge images: 4 GB images create 30-second cold starts. Use multi-stage builds and distroless base images. Target under 200 MB for most services.
- No topology spread constraints: without explicit spread, the scheduler may place all replicas on the same node, making a single node failure catastrophic.
Kubernetes rewards the teams that treat it as a platform to be operated, not a tool to be deployed once and forgotten. The clusters that run reliably for years have runbooks, regular upgrade cycles, quarterly chaos engineering exercises, and a culture of studying every production incident. The technology is mature. The discipline of operating it well is what separates resilient platforms from fragile ones.