AKS for Production: A Checklist for Cloud Architects
AKS for Production: A Checklist for Cloud Architects
Azure Kubernetes Service (AKS) is fantastic for running containerized workloads, but the transition from dev/test environments to full-scale production requires careful preparation. This checklist is based on my real-world experience deploying AKS clusters for FinTech and enterprise clients.
Networking – The Foundation of Everything
Azure CNI vs Kubenet
For production, always use Azure CNI (or Azure CNI Overlay for larger clusters). Kubenet is fine for dev, but in production you need:
- Direct pod IP addressing within the Azure VNet
- Network Policy support (Calico/Azure)
- Integration with Azure Private Link
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
name: 'aks-prod-westeurope'
location: 'westeurope'
properties: {
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'calico'
serviceCidr: '10.250.0.0/16'
dnsServiceIP: '10.250.0.10'
loadBalancerSku: 'standard'
outboundType: 'userDefinedRouting'
}
apiServerAccessProfile: {
enablePrivateCluster: true
privateDNSZone: 'system'
}
}
}Private Cluster is Mandatory
A publicly accessible API server in production? Absolutely not. Private cluster + Azure Private DNS Zone + VPN/ExpressRoute for on-premises access.
Identity & RBAC
Workload Identity (not Pod Identity!)
Pod Identity is deprecated. We're moving to Workload Identity Federation:
- Create a Managed Identity in Azure
- Set up federated credentials for a Kubernetes Service Account
- The application automatically obtains Azure tokens without stored secrets
Kubernetes RBAC + Entra ID
# ClusterRoleBinding – only the Entra ID group has admin access
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aks-cluster-admins
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: Group
name: "00000000-0000-0000-0000-000000000000" # Entra ID Group Object ID
apiGroup: rbac.authorization.k8s.ioScaling & High Availability
Node Pool Strategy
- System pool: Minimum 3 nodes, dedicated for system pods (CoreDNS, kube-proxy)
- User pool(s): Separate pools for different workloads, with taints and tolerations
- Spot pool: For batch jobs and non-critical workloads (up to 90% savings)
Cluster Autoscaler + KEDA
Cluster Autoscaler for horizontal node scaling. KEDA for event-driven pod scaling based on metrics (Azure Service Bus queue depth, HTTP requests).
Monitoring & Observability
Required Stack
- Azure Monitor + Container Insights – node and pod metrics
- Prometheus + Grafana (managed via Azure Monitor) – custom dashboards
- Alerting – CPU/Memory node pools > 80%, pod restart count, OOMKilled events
Log Aggregation
All application logs into Log Analytics Workspace. Set retention to at least 90 days for compliance (NIS2).
Backup & Disaster Recovery
- Velero or Azure Backup for AKS – backing up PVs and cluster state
- Multi-region deployment – Active/Passive with Azure Traffic Manager or Front Door
- GitOps (ArgoCD/Flux) – entire cluster state versioned in Git, recovery =
git push
Production Checklist (Summary)
| Area | Requirement | Priority | |------|-------------|----------| | Networking | Azure CNI + Private Cluster | Critical | | Identity | Workload Identity + Entra ID RBAC | Critical | | Scaling | Cluster Autoscaler + min 3 system nodes | High | | Monitoring | Container Insights + alerting | High | | Security | Network Policy (Calico) + Pod Security Standards | High | | Backup | Velero/Azure Backup + GitOps | Medium | | Cost | Spot node pools + right-sizing | Medium |
Conclusion
AKS is a great platform, but it requires an architectural approach. Don't underestimate networking and identity – these two pillars determine whether your cluster will survive its first security audit.
Need help designing an AKS architecture for your project? I offer a free consultation.
About the author

Martin Rylko
Senior Cloud Architect & DevOps Engineer
14+ years in IT – from on-premises datacenters and Hyper-V clustering to cloud infrastructure on Microsoft Azure. I specialize in Landing Zones, IaC automation, Kubernetes and security compliance.
You might also like
5 Terraform Best Practices for Production Azure Projects
Common mistakes and proven practices when working with Terraform in Azure – from state management to modularization and drift detection.
ReadNIS2 and Azure: A Practical Compliance Checklist for Architects
How to prepare your Azure environment for the NIS2 directive – concrete steps from Azure Policy through Defender for Cloud to logging and incident response.
ReadBuilding an Azure Landing Zone with Bicep
A practical guide on how to effectively structure your Bicep code for deploying an enterprise-ready Azure Landing Zone (ALZ).
Read