Kubernetes AKS Production Checklist for Architects
Azure Kubernetes Service (AKS) is fantastic for running containerized workloads, but the transition from dev/test environments to full-scale production requires careful preparation. This checklist is based on my real-world experience deploying AKS clusters for FinTech and enterprise clients.
Networking – The Foundation of Everything
Azure CNI vs Kubenet
For production, always use Azure CNI (or Azure CNI Overlay for larger clusters). Kubenet is fine for dev, but in production you need:
- Direct pod IP addressing within the Azure VNet
- Network Policy support (Calico/Azure)
- Integration with Azure Private Link
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
name: 'aks-prod-westeurope'
location: 'westeurope'
properties: {
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'calico'
serviceCidr: '10.250.0.0/16'
dnsServiceIP: '10.250.0.10'
loadBalancerSku: 'standard'
outboundType: 'userDefinedRouting'
}
apiServerAccessProfile: {
enablePrivateCluster: true
privateDNSZone: 'system'
}
}
}Private Cluster is Mandatory
A publicly accessible API server in production? Absolutely not. Private cluster + Azure Private DNS Zone + VPN/ExpressRoute for on-premises access.
Identity & RBAC
Workload Identity (not Pod Identity!)
Pod Identity is deprecated. We're moving to Workload Identity Federation:
- Create a Managed Identity in Azure
- Set up federated credentials for a Kubernetes Service Account
- The application automatically obtains Azure tokens without stored secrets
Kubernetes RBAC + Entra ID
# ClusterRoleBinding – only the Entra ID group has admin access
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aks-cluster-admins
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: Group
name: "00000000-0000-0000-0000-000000000000" # Entra ID Group Object ID
apiGroup: rbac.authorization.k8s.ioScaling & High Availability
Node Pool Strategy
- System pool: Minimum 3 nodes, dedicated for system pods (CoreDNS, kube-proxy)
- User pool(s): Separate pools for different workloads, with taints and tolerations
- Spot pool: For batch jobs and non-critical workloads (up to 90% savings)
Cluster Autoscaler + KEDA
Cluster Autoscaler for horizontal node scaling. KEDA for event-driven pod scaling based on metrics (Azure Service Bus queue depth, HTTP requests).
Monitoring & Observability
Required Stack
- Azure Monitor + Container Insights – node and pod metrics
- Prometheus + Grafana (managed via Azure Monitor) – custom dashboards
- Alerting – CPU/Memory node pools > 80%, pod restart count, OOMKilled events
Log Aggregation
All application logs into Log Analytics Workspace. Set retention to at least 90 days for compliance (NIS2).
Backup & Disaster Recovery
- Velero or Azure Backup for AKS – backing up PVs and cluster state
- Multi-region deployment – Active/Passive with Azure Traffic Manager or Front Door
- GitOps (ArgoCD/Flux) – entire cluster state versioned in Git, recovery =
git push
Production Checklist (Summary)
| Area | Requirement | Priority |
|---|---|---|
| Networking | Azure CNI + Private Cluster | Critical |
| Identity | Workload Identity + Entra ID RBAC | Critical |
| Scaling | Cluster Autoscaler + min 3 system nodes | High |
| Monitoring | Container Insights + alerting | High |
| Security | Network Policy (Calico) + Pod Security Standards | High |
| Backup | Velero/Azure Backup + GitOps | Medium |
| Cost | Spot node pools + right-sizing | Medium |
Conclusion
AKS is a great platform, but it requires an architectural approach. Don't underestimate networking and identity – these two pillars determine whether your cluster will survive its first security audit. For securing AKS access with identity-based controls, see our Zero Trust Conditional Access guide.
Need help designing an AKS architecture for your project? Explore our full range of cloud architecture services or reach out for a free consultation.
About the author

Martin Rylko
Senior Cloud Architect & DevOps Engineer
14+ years in IT – from on-premises datacenters and Hyper-V clustering to cloud infrastructure on Microsoft Azure. I specialize in Landing Zones, IaC automation, Kubernetes and security compliance.
Frequently Asked Questions
How much does AKS cost compared to self-managed Kubernetes on Azure VMs?▾
Should I use AKS managed control plane or deploy my own Kubernetes control plane?▾
What is the recommended AKS cluster upgrade strategy to avoid downtime?▾
What networking plugin should I use for AKS in production -- Azure CNI or Kubenet?▾
You might also like
AKS Breaking Changes: What Is Retiring in March 2026 and How to Migrate
Windows Server 2019, Azure Linux 2.0, and kubelet certificate rotation – three AKS retirements with March 2026 deadlines. Practical migration guide with CLI commands and Bicep templates.
ReadTerraform Azure Modules: Private Registry and Testing
Build reusable Terraform modules for Azure with private registry publishing, automated testing with Terratest, and versioned module consumption in production.
ReadTerraform Azure Best Practices: Modules & CI/CD
Terraform Azure best practices for production projects. Covers remote state locking, module structure, drift detection, naming conventions, and testing.
Read