Martin Rylko
  • Services
  • Blog
  • About
  • Contact
  • Get in Touch
Martin Rylko

Senior Cloud Architect & DevOps Engineer. Specializing in Microsoft Azure, IaC, Cloud Security and AI.

Navigation

  • Services
  • Blog
  • About
  • Contact

Collaboration

Looking for an experienced architect for your Azure project? Get in touch.

rylko@cloudmasters.cz

© 2026 Martin Rylko. All rights reserved.

Built in the cloud. Deployed via Azure Static Web Apps.

Home/Blog/Kubernetes AKS Production Checklist for Architects
All articlesČíst česky

Kubernetes AKS Production Checklist for Architects

11/15/2025 3 min
#Kubernetes#Azure#AKS#DevOps

Azure Kubernetes Service (AKS) is fantastic for running containerized workloads, but the transition from dev/test environments to full-scale production requires careful preparation. This checklist is based on my real-world experience deploying AKS clusters for FinTech and enterprise clients.

Networking – The Foundation of Everything

Azure CNI vs Kubenet

For production, always use Azure CNI (or Azure CNI Overlay for larger clusters). Kubenet is fine for dev, but in production you need:

  • Direct pod IP addressing within the Azure VNet
  • Network Policy support (Calico/Azure)
  • Integration with Azure Private Link
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
  name: 'aks-prod-westeurope'
  location: 'westeurope'
  properties: {
    networkProfile: {
      networkPlugin: 'azure'
      networkPolicy: 'calico'
      serviceCidr: '10.250.0.0/16'
      dnsServiceIP: '10.250.0.10'
      loadBalancerSku: 'standard'
      outboundType: 'userDefinedRouting'
    }
    apiServerAccessProfile: {
      enablePrivateCluster: true
      privateDNSZone: 'system'
    }
  }
}

Private Cluster is Mandatory

A publicly accessible API server in production? Absolutely not. Private cluster + Azure Private DNS Zone + VPN/ExpressRoute for on-premises access.

Identity & RBAC

Workload Identity (not Pod Identity!)

Pod Identity is deprecated. We're moving to Workload Identity Federation:

  1. Create a Managed Identity in Azure
  2. Set up federated credentials for a Kubernetes Service Account
  3. The application automatically obtains Azure tokens without stored secrets

Kubernetes RBAC + Entra ID

# ClusterRoleBinding – only the Entra ID group has admin access
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aks-cluster-admins
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: Group
    name: "00000000-0000-0000-0000-000000000000"  # Entra ID Group Object ID
    apiGroup: rbac.authorization.k8s.io

Scaling & High Availability

Node Pool Strategy

  • System pool: Minimum 3 nodes, dedicated for system pods (CoreDNS, kube-proxy)
  • User pool(s): Separate pools for different workloads, with taints and tolerations
  • Spot pool: For batch jobs and non-critical workloads (up to 90% savings)

Cluster Autoscaler + KEDA

Cluster Autoscaler for horizontal node scaling. KEDA for event-driven pod scaling based on metrics (Azure Service Bus queue depth, HTTP requests).

Monitoring & Observability

Required Stack

  1. Azure Monitor + Container Insights – node and pod metrics
  2. Prometheus + Grafana (managed via Azure Monitor) – custom dashboards
  3. Alerting – CPU/Memory node pools > 80%, pod restart count, OOMKilled events

Log Aggregation

All application logs into Log Analytics Workspace. Set retention to at least 90 days for compliance (NIS2).

Backup & Disaster Recovery

  • Velero or Azure Backup for AKS – backing up PVs and cluster state
  • Multi-region deployment – Active/Passive with Azure Traffic Manager or Front Door
  • GitOps (ArgoCD/Flux) – entire cluster state versioned in Git, recovery = git push

Production Checklist (Summary)

AreaRequirementPriority
NetworkingAzure CNI + Private ClusterCritical
IdentityWorkload Identity + Entra ID RBACCritical
ScalingCluster Autoscaler + min 3 system nodesHigh
MonitoringContainer Insights + alertingHigh
SecurityNetwork Policy (Calico) + Pod Security StandardsHigh
BackupVelero/Azure Backup + GitOpsMedium
CostSpot node pools + right-sizingMedium

Conclusion

AKS is a great platform, but it requires an architectural approach. Don't underestimate networking and identity – these two pillars determine whether your cluster will survive its first security audit. For securing AKS access with identity-based controls, see our Zero Trust Conditional Access guide.

Need help designing an AKS architecture for your project? Explore our full range of cloud architecture services or reach out for a free consultation.

Tags:#Kubernetes#Azure#AKS#DevOps
LinkedInX / Twitter

About the author

Martin Rylko

Martin Rylko

Senior Cloud Architect & DevOps Engineer

14+ years in IT – from on-premises datacenters and Hyper-V clustering to cloud infrastructure on Microsoft Azure. I specialize in Landing Zones, IaC automation, Kubernetes and security compliance.

Email LinkedInFull profile

Frequently Asked Questions

How much does AKS cost compared to self-managed Kubernetes on Azure VMs?▾
The AKS control plane itself is free -- you only pay for the worker node VMs, storage, and networking. A typical production cluster with 3 Standard_D4s_v5 nodes runs approximately $350-450/month in West Europe. Self-managed K8s on VMs adds 15-25% overhead for etcd management, API server HA, and certificate rotation that AKS handles automatically.
Should I use AKS managed control plane or deploy my own Kubernetes control plane?▾
Use AKS managed control plane for virtually all production scenarios. Microsoft handles etcd backups, API server scaling, and Kubernetes version patches. The only case for self-managed is extreme regulatory environments that require full control plane isolation, and even then Azure Dedicated Hosts with AKS covers most compliance requirements.
What is the recommended AKS cluster upgrade strategy to avoid downtime?▾
Use node image auto-upgrade with the "node-image" channel and Kubernetes version upgrades with the "patch" channel for automatic security patches. For minor version upgrades (e.g., 1.28 to 1.29), schedule manual upgrades with PodDisruptionBudgets configured. Always maintain a blue-green node pool strategy: add a new pool on the target version, cordon and drain the old pool, then remove it.
What networking plugin should I use for AKS in production -- Azure CNI or Kubenet?▾
Always use Azure CNI (or Azure CNI Overlay for clusters exceeding 400 nodes). Azure CNI gives pods first-class VNet IP addresses, enabling direct integration with NSGs, UDRs, and Private Endpoints. Kubenet uses NAT which breaks network policy enforcement and Azure service integration. CNI Overlay is the best choice for large clusters since it decouples pod CIDR from VNet address space.

You might also like

AKS Breaking Changes: What Is Retiring in March 2026 and How to Migrate

Windows Server 2019, Azure Linux 2.0, and kubelet certificate rotation – three AKS retirements with March 2026 deadlines. Practical migration guide with CLI commands and Bicep templates.

Read

Terraform Azure Modules: Private Registry and Testing

Build reusable Terraform modules for Azure with private registry publishing, automated testing with Terratest, and versioned module consumption in production.

Read

Terraform Azure Best Practices: Modules & CI/CD

Terraform Azure best practices for production projects. Covers remote state locking, module structure, drift detection, naming conventions, and testing.

Read