AKS Cilium NetworkPolicy: Migrating From Calico Without Production Downtime
AKS Cilium NetworkPolicy: Migrating From Calico Without Production Downtime
When Microsoft announced Azure CNI Powered by Cilium as the default for new AKS clusters at Build 2026, it opened a question we had been deferring for a year at Creditas: when and how to migrate existing production clusters off Calico. This article is the distilled version of what we did, where we got stuck, and what I would do differently next time.
Why Cilium and Not Calico
Calico works fine on AKS, but architecturally it sits on top of iptables. That means:
- CPU overhead grows linearly with the number of network policies – above 50 policies it starts to show
- The conntrack table is a single bottleneck at high throughput
- L7 filtering requires a separate sidecar (Envoy via Calico Enterprise)
Cilium uses eBPF programs directly in the kernel instead of iptables. The result:
| Aspect | Calico (iptables) | Cilium (eBPF) |
|---|---|---|
| Network policy enforcement | iptables chain traversal | eBPF program in the kernel |
| L7 HTTP filtering | Requires Envoy sidecar | Built-in |
| Identity-based policies | No | Yes (via ServiceAccount) |
| FQDN-based policies | No | Yes |
| Conntrack | Single kernel table | Per-cilium BPF map |
| Pod-to-pod overhead | ~10–15% CPU at 1k pps | ~3–5% CPU at 1k pps |
| Multi-cluster mesh | Calico Enterprise (paid) | Cilium Cluster Mesh (open source) |
In my Creditas measurements (8 nodes, ~200 pods, ~30 000 connections/s peak) the result was clear:
p50 latency service-to-service:
Calico: 1.8 ms
Cilium: 1.2 ms (-33%)
p99 latency:
Calico: 18 ms
Cilium: 11 ms (-39%)
Worker node CPU (idle network policy load):
Calico: 14% baseline
Cilium: 8% baseline (-43% relative)Migration Architecture: New Cluster vs In-Place
There is no direct upgrade path from Calico to Cilium on AKS. You have two options:
Option A – in-place migration (officially supported since summer 2024):
az aks update --network-policy cilium– enables the Cilium control plane- Create new node pools with
--enable-cilium-dataplane - Drain the old nodes
- Remove the old node pools
The catch: every node pool must be recreated. For a cluster with 10 specialized pools that is three weeks of careful operations. And during migration Cilium and Calico run side by side, which adds unexpected complexity.
Option B – parallel new cluster (what we did at Creditas):
- Provision a new AKS cluster with Cilium from the start
- GitOps (Flux) copies workloads into the new cluster
- Gradual traffic cutover (DNS + Front Door)
- Delete the old cluster
At Creditas the cutover took 6 weeks with zero production downtime. I recommend option B for anyone running GitOps – it is cleaner and reversible.
New Cluster With Cilium: Bicep Template
resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
name: 'aks-prod-cilium'
location: location
identity: { type: 'SystemAssigned' }
sku: {
name: 'Base'
tier: 'Standard'
}
properties: {
kubernetesVersion: '1.31.0'
dnsPrefix: 'aksprodcil'
networkProfile: {
networkPlugin: 'azure'
networkPluginMode: 'overlay'
networkDataplane: 'cilium' // KEY – Cilium dataplane
networkPolicy: 'cilium' // Cilium NetworkPolicy enforcement
loadBalancerSku: 'standard'
serviceCidr: '10.0.0.0/16'
dnsServiceIP: '10.0.0.10'
podCidr: '10.244.0.0/16' // overlay pod CIDR
}
agentPoolProfiles: [
{
name: 'system'
count: 3
vmSize: 'Standard_D4s_v5'
osSKU: 'AzureLinux'
mode: 'System'
availabilityZones: ['1', '2', '3']
}
]
addonProfiles: {
azureKeyvaultSecretsProvider: {
enabled: true
config: { enableSecretRotation: 'true' }
}
}
securityProfile: {
workloadIdentity: { enabled: true }
}
oidcIssuerProfile: { enabled: true }
}
}Three critical properties:
networkDataplane: 'cilium'– activates the Cilium eBPF dataplane (vs'azure'= iptables)networkPolicy: 'cilium'– must becilium(notcalico, not empty)networkPluginMode: 'overlay'– Cilium supports overlay or non-overlay; overlay is recommended for new clusters (decouples pod CIDR from VNet)
Migrating Existing NetworkPolicy
Good news: existing kind: NetworkPolicy manifests work unchanged. Cilium fully implements the Kubernetes NetworkPolicy API.
# Existing Calico policy – works on Cilium unchanged
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-frontend
namespace: prod
spec:
podSelector:
matchLabels: { app: api }
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels: { app: frontend }
ports:
- protocol: TCP
port: 8080The migration check is trivial – kubectl apply to the new cluster and cilium policy get from cilium-cli.
Adding CiliumNetworkPolicy for Advanced Use Cases
This is where it gets fun. Standard NetworkPolicy is L3/L4 (IP + port). Cilium adds L7 (HTTP method, path, headers):
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: api-l7-restrictions
namespace: prod
spec:
endpointSelector:
matchLabels: { app: api }
ingress:
- fromEndpoints:
- matchLabels: { app: frontend }
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: GET
path: "/api/v1/.*"
- method: POST
path: "/api/v1/items"
headers:
- "X-Tenant-ID: .+"What this policy enforces:
- The frontend can call the API over HTTP
- Only
GET /api/v1/*andPOST /api/v1/items - POST must carry an
X-Tenant-IDheader (multi-tenancy) - Everything else (PUT, DELETE, other paths) is blocked at the kernel level, not the application
No application change, no Envoy sidecar, no API gateway. The eBPF program in the kernel makes the decision at the L7 layer.
Identity-Aware Policies (Game Changer)
The second Cilium killer feature for us at Creditas: policies by ServiceAccount identity instead of pod labels.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: db-access-by-sa
namespace: prod
spec:
endpointSelector:
matchLabels: { app: postgres }
ingress:
- fromEndpoints:
- matchLabels:
# Cilium-specific: ServiceAccount as identity
io.cilium.k8s.policy.serviceaccount: api-sa
toPorts:
- ports: [{ port: "5432", protocol: TCP }]Why it matters: pod labels can be spoofed (compromised pod, RBAC hole). ServiceAccount identity is bound to the Kubernetes auth subsystem and cannot be spoofed from a compromised pod. For a regulated workload (PCI, GDPR) this is a fundamental difference.
Cutover Plan: 6 Weeks at Creditas
| Week | Activity | Risk |
|---|---|---|
| 1 | Provision new cluster with Cilium, dry-run GitOps sync | None |
| 2 | Sync all namespaces, smoke tests, performance baseline | None |
| 3 | Cutover dev/test traffic via Front Door | Low |
| 4 | Cutover staging traffic, integration test suite | Low |
| 5 | Canary 10% of production traffic | Medium |
| 6 | Full cutover, monitoring, delete old cluster | Low |
Key enabler: Front Door routing rules with percentage-based split allowed granular cutover without DNS TTL pain. If you do not use Front Door, the same strategy works via Application Gateway or an external load balancer.
Three Traps We Got Stuck In
- CoreDNS in a Cilium cluster does not enable NodeLocal DNS Cache automatically – we had it enabled in the existing cluster, not in the new one. Detected after a week – some DNS lookups were 5–8 ms slower. Fix:
az aks update --enable-node-local-dns - Cilium Hubble (observability) is not on by default – you must explicitly enable
--enable-hubble. Without Hubble you lose flow visibility, which makes debugging policy issues much harder - A CiliumNetworkPolicy syntax error blocks the deploy – Calico is more lenient. CiliumNetworkPolicy validation is strict – any CRD error fails the deployment. I recommend
cilium policy validateas a pre-commit hook
Conclusion
Migrating AKS from Calico to Cilium is not trivial, but 2026 is the year it pays off. A 35% reduction in p99 latency and a 10% reduction in worker node CPU justify a 6-week migration in any environment with serious traffic. CiliumNetworkPolicy with L7 and identity-aware filtering opens use cases we used to handle with Istio sidecars.
If you are planning a similar migration or a fresh AKS cluster in 2026, check out our cloud architecture services or reach out for a Cilium migration walkthrough.
About the author

Martin Rylko
Senior Cloud Architect & DevOps Engineer
14+ years in IT – from on-premises datacenters and Hyper-V clustering to cloud infrastructure on Microsoft Azure. I specialize in Landing Zones, IaC automation, Kubernetes and security compliance.
Frequently Asked Questions
Why migrate from Calico to Cilium on AKS?▾
Can I migrate an existing AKS cluster, or do I have to create a new one?▾
Do existing NetworkPolicy YAML manifests work with Cilium?▾
What is the real performance difference between Cilium and Calico?▾
You might also like
AKS Breaking Changes: What Is Retiring in March 2026 and How to Migrate
Windows Server 2019, Azure Linux 2.0, and kubelet certificate rotation – three AKS retirements with March 2026 deadlines. Practical migration guide with CLI commands and Bicep templates.
ReadAzure Container Apps vs AKS: A 2026 Decision Matrix
When to choose Azure Container Apps and when AKS – cost, operations overhead, networking, and typical use cases. Real decision examples from three different projects.
ReadKubernetes AKS Production Checklist for Architects
Kubernetes AKS production readiness checklist covering Azure CNI networking, Workload Identity RBAC, cluster autoscaling, monitoring, and DR strategy.
Read