Martin Rylko
  • Services
  • Blog
  • About
  • Contact
  • Get in Touch
Martin Rylko

Senior Cloud Architect & DevOps Engineer. Specializing in Microsoft Azure, IaC, Cloud Security and AI.

Navigation

  • Services
  • Blog
  • About
  • Contact

Collaboration

Looking for an experienced architect for your Azure project? Get in touch.

rylko@cloudmasters.cz

© 2026 Martin Rylko. All rights reserved.

Built in the cloud. Deployed via Azure Static Web Apps.

Home/Blog/AKS Cilium NetworkPolicy: Migrating From Calico Without Production Downtime
All articlesČíst česky

AKS Cilium NetworkPolicy: Migrating From Calico Without Production Downtime

3/2/2026 5 min
#AKS#Kubernetes#Cilium#Networking#Azure

AKS Cilium NetworkPolicy: Migrating From Calico Without Production Downtime

When Microsoft announced Azure CNI Powered by Cilium as the default for new AKS clusters at Build 2026, it opened a question we had been deferring for a year at Creditas: when and how to migrate existing production clusters off Calico. This article is the distilled version of what we did, where we got stuck, and what I would do differently next time.

Why Cilium and Not Calico

Calico works fine on AKS, but architecturally it sits on top of iptables. That means:

  • CPU overhead grows linearly with the number of network policies – above 50 policies it starts to show
  • The conntrack table is a single bottleneck at high throughput
  • L7 filtering requires a separate sidecar (Envoy via Calico Enterprise)

Cilium uses eBPF programs directly in the kernel instead of iptables. The result:

AspectCalico (iptables)Cilium (eBPF)
Network policy enforcementiptables chain traversaleBPF program in the kernel
L7 HTTP filteringRequires Envoy sidecarBuilt-in
Identity-based policiesNoYes (via ServiceAccount)
FQDN-based policiesNoYes
ConntrackSingle kernel tablePer-cilium BPF map
Pod-to-pod overhead~10–15% CPU at 1k pps~3–5% CPU at 1k pps
Multi-cluster meshCalico Enterprise (paid)Cilium Cluster Mesh (open source)

In my Creditas measurements (8 nodes, ~200 pods, ~30 000 connections/s peak) the result was clear:

p50 latency service-to-service:
  Calico:  1.8 ms
  Cilium:  1.2 ms  (-33%)
 
p99 latency:
  Calico:  18 ms
  Cilium:  11 ms   (-39%)
 
Worker node CPU (idle network policy load):
  Calico:  14% baseline
  Cilium:  8% baseline   (-43% relative)

Migration Architecture: New Cluster vs In-Place

There is no direct upgrade path from Calico to Cilium on AKS. You have two options:

Option A – in-place migration (officially supported since summer 2024):

  1. az aks update --network-policy cilium – enables the Cilium control plane
  2. Create new node pools with --enable-cilium-dataplane
  3. Drain the old nodes
  4. Remove the old node pools

The catch: every node pool must be recreated. For a cluster with 10 specialized pools that is three weeks of careful operations. And during migration Cilium and Calico run side by side, which adds unexpected complexity.

Option B – parallel new cluster (what we did at Creditas):

  1. Provision a new AKS cluster with Cilium from the start
  2. GitOps (Flux) copies workloads into the new cluster
  3. Gradual traffic cutover (DNS + Front Door)
  4. Delete the old cluster

At Creditas the cutover took 6 weeks with zero production downtime. I recommend option B for anyone running GitOps – it is cleaner and reversible.

New Cluster With Cilium: Bicep Template

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: 'aks-prod-cilium'
  location: location
  identity: { type: 'SystemAssigned' }
  sku: {
    name: 'Base'
    tier: 'Standard'
  }
  properties: {
    kubernetesVersion: '1.31.0'
    dnsPrefix: 'aksprodcil'
    networkProfile: {
      networkPlugin: 'azure'
      networkPluginMode: 'overlay'
      networkDataplane: 'cilium'      // KEY – Cilium dataplane
      networkPolicy: 'cilium'          // Cilium NetworkPolicy enforcement
      loadBalancerSku: 'standard'
      serviceCidr: '10.0.0.0/16'
      dnsServiceIP: '10.0.0.10'
      podCidr: '10.244.0.0/16'        // overlay pod CIDR
    }
    agentPoolProfiles: [
      {
        name: 'system'
        count: 3
        vmSize: 'Standard_D4s_v5'
        osSKU: 'AzureLinux'
        mode: 'System'
        availabilityZones: ['1', '2', '3']
      }
    ]
    addonProfiles: {
      azureKeyvaultSecretsProvider: {
        enabled: true
        config: { enableSecretRotation: 'true' }
      }
    }
    securityProfile: {
      workloadIdentity: { enabled: true }
    }
    oidcIssuerProfile: { enabled: true }
  }
}

Three critical properties:

  1. networkDataplane: 'cilium' – activates the Cilium eBPF dataplane (vs 'azure' = iptables)
  2. networkPolicy: 'cilium' – must be cilium (not calico, not empty)
  3. networkPluginMode: 'overlay' – Cilium supports overlay or non-overlay; overlay is recommended for new clusters (decouples pod CIDR from VNet)

Migrating Existing NetworkPolicy

Good news: existing kind: NetworkPolicy manifests work unchanged. Cilium fully implements the Kubernetes NetworkPolicy API.

# Existing Calico policy – works on Cilium unchanged
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: prod
spec:
  podSelector:
    matchLabels: { app: api }
  policyTypes: [Ingress]
  ingress:
  - from:
    - podSelector:
        matchLabels: { app: frontend }
    ports:
    - protocol: TCP
      port: 8080

The migration check is trivial – kubectl apply to the new cluster and cilium policy get from cilium-cli.

Adding CiliumNetworkPolicy for Advanced Use Cases

This is where it gets fun. Standard NetworkPolicy is L3/L4 (IP + port). Cilium adds L7 (HTTP method, path, headers):

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-restrictions
  namespace: prod
spec:
  endpointSelector:
    matchLabels: { app: api }
  ingress:
  - fromEndpoints:
    - matchLabels: { app: frontend }
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/api/v1/.*"
        - method: POST
          path: "/api/v1/items"
          headers:
          - "X-Tenant-ID: .+"

What this policy enforces:

  • The frontend can call the API over HTTP
  • Only GET /api/v1/* and POST /api/v1/items
  • POST must carry an X-Tenant-ID header (multi-tenancy)
  • Everything else (PUT, DELETE, other paths) is blocked at the kernel level, not the application

No application change, no Envoy sidecar, no API gateway. The eBPF program in the kernel makes the decision at the L7 layer.

Identity-Aware Policies (Game Changer)

The second Cilium killer feature for us at Creditas: policies by ServiceAccount identity instead of pod labels.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: db-access-by-sa
  namespace: prod
spec:
  endpointSelector:
    matchLabels: { app: postgres }
  ingress:
  - fromEndpoints:
    - matchLabels:
        # Cilium-specific: ServiceAccount as identity
        io.cilium.k8s.policy.serviceaccount: api-sa
    toPorts:
    - ports: [{ port: "5432", protocol: TCP }]

Why it matters: pod labels can be spoofed (compromised pod, RBAC hole). ServiceAccount identity is bound to the Kubernetes auth subsystem and cannot be spoofed from a compromised pod. For a regulated workload (PCI, GDPR) this is a fundamental difference.

Cutover Plan: 6 Weeks at Creditas

WeekActivityRisk
1Provision new cluster with Cilium, dry-run GitOps syncNone
2Sync all namespaces, smoke tests, performance baselineNone
3Cutover dev/test traffic via Front DoorLow
4Cutover staging traffic, integration test suiteLow
5Canary 10% of production trafficMedium
6Full cutover, monitoring, delete old clusterLow

Key enabler: Front Door routing rules with percentage-based split allowed granular cutover without DNS TTL pain. If you do not use Front Door, the same strategy works via Application Gateway or an external load balancer.

Three Traps We Got Stuck In

  1. CoreDNS in a Cilium cluster does not enable NodeLocal DNS Cache automatically – we had it enabled in the existing cluster, not in the new one. Detected after a week – some DNS lookups were 5–8 ms slower. Fix: az aks update --enable-node-local-dns
  2. Cilium Hubble (observability) is not on by default – you must explicitly enable --enable-hubble. Without Hubble you lose flow visibility, which makes debugging policy issues much harder
  3. A CiliumNetworkPolicy syntax error blocks the deploy – Calico is more lenient. CiliumNetworkPolicy validation is strict – any CRD error fails the deployment. I recommend cilium policy validate as a pre-commit hook

Conclusion

Migrating AKS from Calico to Cilium is not trivial, but 2026 is the year it pays off. A 35% reduction in p99 latency and a 10% reduction in worker node CPU justify a 6-week migration in any environment with serious traffic. CiliumNetworkPolicy with L7 and identity-aware filtering opens use cases we used to handle with Istio sidecars.

If you are planning a similar migration or a fresh AKS cluster in 2026, check out our cloud architecture services or reach out for a Cilium migration walkthrough.

Tags:#AKS#Kubernetes#Cilium#Networking#Azure
LinkedInX / Twitter

About the author

Martin Rylko

Martin Rylko

Senior Cloud Architect & DevOps Engineer

14+ years in IT – from on-premises datacenters and Hyper-V clustering to cloud infrastructure on Microsoft Azure. I specialize in Landing Zones, IaC automation, Kubernetes and security compliance.

Email LinkedInFull profile

Frequently Asked Questions

Why migrate from Calico to Cilium on AKS?▾
Three main reasons: performance (Cilium with eBPF has 2–3× faster conntrack and lower CPU overhead on network policy enforcement), identity-aware policies (allowing policies based on ServiceAccount identity, not just IP/label), and the future – since Build 2026 Microsoft has made Cilium the default CNI for new AKS clusters, so Calico support will gradually become legacy. In my Creditas tests we cut worker node CPU by 8–12% post-migration.
Can I migrate an existing AKS cluster, or do I have to create a new one?▾
In-place migration is possible but requires recreating every node pool. There is no "az aks update --network-policy cilium" – AKS does not support that. You have to create new node pools with Cilium, drain the old ones, and remove them. For most production clusters it is cleaner to create a new cluster with Cilium and migrate workloads via GitOps. At Creditas we picked the new-cluster approach; the full cutover took 6 weeks of parallel running.
Do existing NetworkPolicy YAML manifests work with Cilium?▾
Yes, Cilium fully supports the Kubernetes NetworkPolicy API – existing manifests work unchanged. Cilium also adds the CiliumNetworkPolicy CRD with extended capabilities (L7 HTTP filtering, identity-based policies, FQDN matching). Migrating existing policies is therefore straightforward, and you can layer CiliumNetworkPolicy on top where standard NetworkPolicy is not enough.
What is the real performance difference between Cilium and Calico?▾
In our Creditas production load (8 nodes, ~200 pods, ~30 000 connections/s peak) Cilium showed 35% lower p99 latency on service-to-service calls and 10% lower CPU usage on worker nodes. Conntrack table efficiency was higher – where Calico started allocating extra memory for connections, Cilium kept it stable. Details and graphs are in our internal performance write-up linked below.

You might also like

AKS Breaking Changes: What Is Retiring in March 2026 and How to Migrate

Windows Server 2019, Azure Linux 2.0, and kubelet certificate rotation – three AKS retirements with March 2026 deadlines. Practical migration guide with CLI commands and Bicep templates.

Read

Azure Container Apps vs AKS: A 2026 Decision Matrix

When to choose Azure Container Apps and when AKS – cost, operations overhead, networking, and typical use cases. Real decision examples from three different projects.

Read

Kubernetes AKS Production Checklist for Architects

Kubernetes AKS production readiness checklist covering Azure CNI networking, Workload Identity RBAC, cluster autoscaling, monitoring, and DR strategy.

Read