Deep Network GmbH Developers' Blog

Understanding Networking Options in Azure AKS-Engine [Part 1]

AKS Engine provides convenient tooling to quickly bootstrap Kubernetes clusters on Azure. By leveraging ARM (Azure Resource Manager), AKS Engine helps you create, destroy and maintain clusters provisioned with basic IaaS resources in Azure. AKS Engine is also the library used by AKS for performing these operations to provide managed service implementations.

In this document, we are going to use AKS Engine to deploy a brand new cluster with 2 different networking options (kubenet and azure cni) into an existing or pre-created virtual network.

Pre-requisites

Infrastructure

The Virtual Network

We will deploy a virtual network that contains two subnets:

  • 10.10.0.0/24
  • 10.20.0.0/24

The first one will be used for the master nodes and the second one for the agent nodes.

The Azure Resource Manager template used to deploy this virtual network is:

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {  },
  "variables": {  },
  "resources": [
    {
      "apiVersion": "2017-06-01",
      "location": "[resourceGroup().location]",
      "name": "aks-vnet",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.10.0.0/24",
            "10.20.0.0/24"
          ]
        },
        "subnets": [
          {
            "name": "master-subnet",
            "properties": {
              "addressPrefix": "10.10.0.0/24"
            }
          },
          {
            "name": "agent-subnet",
            "properties": {
              "addressPrefix": "10.20.0.0/24"
            }
          }
        ]
      },
      "type": "Microsoft.Network/virtualNetworks"
    }
  ]
}

If you want to try different subnet IP ranges, you can change the address prefixes in aks-vnet.json file.

API Model

The API model file provides various configurations which aks-engine uses to create a cluster. We’ll use aks.json api model file.

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.10",
      "kubernetesConfig": {
        "networkPlugin": "azure",
        "networkPolicy": "azure",
	      "apiServerConfig": {
          "--enable-admission-plugins": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,AlwaysPullImages"
	      }
      }
    },
    "masterProfile": {
      "count": 1,
      "vmSize": "Standard_D2_v2"
    },
    "agentPoolProfiles": [
     {
        "name": "agentpool1",
        "count": 2,
        "vmSize": "Standard_D2_v2"
     }
    ],
    "linuxProfile": {
      "adminUsername": "azureadmin",
      "ssh": {
        "publicKeys": [
          {
            "keyData": ""
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId":"",
      "secret": ""
    }
  }
}
  • keyData: must contain the public portion of the SSH key we generated - this will be associated with the adminUsername value found in the same section of the cluster definition (e.g. ‘ssh-rsa AAAAB3NzaC1yc2EAAAADAQABA….’)
  • clientId: this is the service principal’s appId
  • secret: this is the service principal’s password

You should only provide keyData, clientId and secret fields in aks.json to be able to run the build script.

Build Script

To build and deploy the kubernetes cluster on a custom Azure VNET, we’ll use build.sh script. The script will:

  • create the resource group on Azure,
  • deploy a custom Azure VNET from aks-vnet.json,
  • generate ARM templates,
  • deploy kubernetes cluster on Azure,
  • merge newly created kube config to the older,
  • and, finally, print the cluster-info.

*NOTE: You can find build script and other related files to deploy a custom Kubernetes cluster here. To work with it, please download all content to your local and access the aksengine-advanced-networking directory on your terminal.

The build.sh script gets two parameters, environment (e.g.dev) and networking plugin that are supported by aks-engine.

There are 5 different Network Plugin options for aks-engine:

  • Azure Container Networking (default)
  • Kubenet
  • Flannel
  • Cilium
  • Antrea

HINT: In this document, we only explain the details of Kubenet and Azure-CNI networking options.

A sample usage of the build.sh script is as follows:

$ sh build.sh dev azure

In following sections, we will create kubernetes clusters with different networking options in Azure.

Networking Options of AKS-Engine

Azure Container Networking

The default networking plugin of aks-engine is azure CNI. When we use Azure plugin, the pods get their own private IPs which are secondary IPs on the VMs’ NICs. Here pods aren’t exposed in the Virtual Network so there is no such private IPs.

To get a k8s cluster with azure CNI networking option, run the following command:

$ sh build.sh dev azure

(When the execution ends, cluster info is printed on the command prompt.)

Now, type the command below to get the IP addresses of the nodes on your cluster:

$ kubectl get nodes -o json | jq '.items[].status.addresses[].address'

(If you did not already install jq on your computer, it can be downloaded from here)

An output similar to the following should be given:

"k8s-agentpool1-37464322-vmss000000"
"10.20.0.4"
"k8s-agentpool1-37464322-vmss000001"
"10.20.0.35"
"k8s-master-37464322-0"
"10.10.0.5"

It’s obvious that our nodes are getting IPs from the Azure VNET. Since the azure networking plug-in was used on the deployment, we’re expecting that the pods also will get IP address from our custom Azure VNET. Let’s check it out, by creating some pods.

Open pods/ssh-pod-a-node-0.yaml file in an editor and change kubernetes.io/hostname field with the name of your first agent node (The node with IP address 10.20.0.4). In my case, it’s k8s-agentpool1-37464322-vmss000000.

You can find all pod definitions here.

Deploy a sample pod on node-0 with the following command:

$ kubectl apply -f pods/ssh-pod-a-node-0.yaml

Check the status of the pod with:

$ kubectl get pods -o wide
NAME               READY   STATUS    RESTARTS   AGE   IP           NODE
ssh-pod-a-node-0   1/1     Running   0          15m   10.20.0.31   k8s-agentpool1-37464322-vmss000000

We see that the IP of the pod is 10.20.0.31 and it’s from our custom Azure VNET.

Let’s check the connectivity between pods. Currently, we have a pod on node-0, deploy a new pod on node-1 by executing the following command:

$ kubectl apply -f pods/ssh-pod-c-node-1.yaml

Run kubectl get pods -o wide again to list the running pods:

NAME               READY   STATUS    RESTARTS   AGE   IP           NODE
ssh-pod-a-node-0   1/1     Running   0          20m   10.20.0.31   k8s-agentpool1-37464322-vmss000000
ssh-pod-c-node-1   1/1     Running   0          19s   10.20.0.43   k8s-agentpool1-37464322-vmss000001

The IP of the pod on node-1 is 10.20.0.43. Now, we can connect to first pod and try to ping second pod.

$ kubectl exec -it ssh-pod-a-node-0 -- bash

Run ping command to see the connectivity:

root@ssh-pod-a-node-0:/# ping 10.20.0.43
PING 10.20.0.43 (10.20.0.43) 56(84) bytes of data.
64 bytes from 10.20.0.43: icmp_seq=1 ttl=64 time=1.86 ms
64 bytes from 10.20.0.43: icmp_seq=2 ttl=64 time=0.630 ms
...

We’re seeing that ICMP (ping) packages are successfully going from one pod to the another. Now, let’s look into what is going on behind the scenes.

To copy your ssh rsa key into the pod on node-0, run the following command from your host.

$ kubectl cp ~/.ssh/ake_rsa ssh-pod-a-node-0:id_rsa

Connect to pod-a and check if the ssh rsa key copied correctly. Then, try to connect to the agent node-0.

$ kubectl exec -it ssh-pod-a-node-0 -- bash
root@ssh-pod-a-node-0:/# ls
bin   dev  home    lib	  media  opt   root  sbin  sys	usr
boot  etc  id_rsa  lib64  mnt	 proc  run   srv   tmp	var

root@ssh-pod-a-node-0:/# ssh -i id_rsa azureadmin@10.20.0.4

If everything went well, we’re connected to node-0. You can list network interfaces of the node with ifconfig or ip a commands. When you execute these commands, you will see some interfaces whose names are beginning with azv. These NICs are the veth pipe pairs (a cable with two ends, whatever data that comes in one will come out of other and vice versa) of the pods we create on the node-0.

So, we need to find the veth pair of pod-a and listen that interface to see if ping packages arrive to it. To do that, first list running docker containers on the node.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ docker ps
CONTAINER ID        IMAGE                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
4e4237e6912e        nginx                                                  "nginx -g 'daemon of…"   2 days ago          Up 2 days                               k8s_ssh-pod-a-node-0_ssh-pod-a-node-0_default_badd4f8e-369f-11e9-aeec-000d3a22d65a_0
c144e2deae85        k8s.gcr.io/pause-amd64:3.1                             "/pause"                 2 days ago          Up 2 days                               k8s_POD_ssh-pod-a-node-0_default_badd4f8e-369f-11e9-aeec-000d3a22d65a_0

Copy the pause container (a container which holds the network namespace for the pod.) name and change the following command accordingly.

$ docker inspect k8s_POD_ssh-pod-a-node-0_default_badd4f8e-369f-11e9-aeec-000d3a22d65a_0 | jq '.[].NetworkSettings.SandboxKey'
"/var/run/docker/netns/55568bb81358"

Run the command below after changing the namespace with the one that you got in the previous step.

[$ sudo nsenter --net=/var/run/docker/netns/55568bb81358 ip address](azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo nsenter --net=/var/run/docker/netns/55568bb81358 ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
16: eth0@if17: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:14:84:b5:9b:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.20.0.31/24 scope global eth0
       valid_lft forever preferred_lft forever
)

The NIC name eth0@if17 means that eth0 interface of pod-a is paired with interface 17 in the node-0. So, let’s look at the 17th NIC on the node.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ ip a | grep 17
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
17: azv32fe06c7c68@if16: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master azure0 state UP group default qlen 1000

We have learnt that the veth pair on node-0 side is azv32fe06c7c68. Now, let’s send ping packages from pod-a and capture them on interface azv32fe06c7c68.

On a seperate tab, connect to pod-a and run the following:

root@ssh-pod-a-node-0:/# ping  10.20.0.43
PING 10.20.0.43 (10.20.0.43) 56(84) bytes of data.
64 bytes from 10.20.0.43: icmp_seq=1 ttl=64 time=3.38 ms
64 bytes from 10.20.0.43: icmp_seq=2 ttl=64 time=0.707 ms
...

And, on node-0, execute tcpdump command to see the icmp packages.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i azv32fe06c7c68 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on azv32fe06c7c68, link-type EN10MB (Ethernet), capture size 262144 bytes
09:45:55.141131 IP 10.20.0.31 > 10.20.0.43: ICMP echo request, id 466, seq 42, length 64
09:45:55.141754 IP 10.20.0.43 > 10.20.0.31: ICMP echo reply, id 466, seq 42, length 64
...

All veth interfaces are bridged to azure0 NIC. You can verify it with brctl show command. (You can install it with sudo apt install bridge-utils)

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ brctl show
bridge name	bridge id		STP enabled	interfaces
azure0		8000.000d3a21af12	no		azv04420ae93c6
							azv2c81fb847bd
							azv32fe06c7c68
							azv60b26fe7924
							azv7c82e9d3c90
							eth0
docker0		8000.0242fcd1dfa9	no

Now that the icmp packages coming from pod-a goes to azure0 interface, let’s capture the packets on it.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i azure0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on azure0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:00:48.366965 IP 10.20.0.31 > 10.20.0.43: ICMP echo request, id 466, seq 924, length 64
10:00:48.368352 IP 10.20.0.43 > 10.20.0.31: ICMP echo reply, id 466, seq 924, length 64

Now, let’s listen eth0 interface of pod-a and see the ping packages.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:05:28.421475 IP 10.20.0.31 > 10.20.0.43: ICMP echo request, id 472, seq 1, length 64
10:05:28.422588 IP 10.20.0.43 > 10.20.0.31: ICMP echo reply, id 472, seq 1, length 64

That’s it .. Since all the nodes and pods are directly connected to Azure VNET, there is no need any packet translation and all packets are send as is. Yo can see the connected NICs of aks-vnet on Azure Portal.

Kubenet

Kubernetes default networking provider, kubenet, is a simple network plugin that works with various cloud providers. Kubenet is a very basic network provider, and basic is good, but does not have very many features. Moreover, kubenet has many limitations. For instance, when running kubenet in AWS Cloud, you are limited to 50 EC2 instances. Route tables are used to configure network traffic between Kubernetes nodes, and are limited to 50 entries per VPC.

To learn about the maximum number of routes you can add to a route table and the maximum number of user-defined route tables you can create per Azure subscription, see Azure limits.

Let’s create a kubernetes cluster with kubenet networking plug-in by running the below command.

$ sh build.sh dev kubenet

After execution finishes, cluster info of our kubernetes instance are printed to the screen. List the nodes with the command:

$ kubectl get nodes -o wide
NAME                                 STATUS   ROLES    AGE   VERSION    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-agentpool1-37464322-vmss000000   Ready    agent    16m   v1.10.12   <none>        Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-agentpool1-37464322-vmss000001   Ready    agent    16m   v1.10.12   <none>        Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-master-37464322-0                Ready    master   16m   v1.10.12   <none>        Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1

Let’s continue by creating two seperate pods on each agent nodes. Before, we need to change the node selector field of the yaml files. Open pods/ssh-pod-a-node-0.yaml file in an editor and change kubernetes.io/hostname field with the name of your first agent node (The node with IP address 10.20.0.4). In my case, it’s k8s-agentpool1-37464322-vmss000000.

Now, deploy pod-a on node-0 and pod-c on node-1.

$ kubectl apply -f pods/ssh-pod-a-node-0.yaml && kubectl apply -f pods/ssh-pod-c-node-1.yaml
pod/ssh-pod-a-node-0 created
pod/ssh-pod-c-node-1 created

$ kubectl get pods -o wide
NAME               READY   STATUS    RESTARTS   AGE   IP           NODE
ssh-pod-a-node-0   1/1     Running   0          47s   10.244.1.6   k8s-agentpool1-37464322-vmss000000
ssh-pod-c-node-1   1/1     Running   0          41s   10.244.0.7   k8s-agentpool1-37464322-vmss000001

Notice that the IPs of the pods aren’t from Azure VNET, they are from 10.244.0.0/16 which is default for kubenet plug-in. If you open the dev-aks-rg resource group in Azure Portal, you would see a new resource type that we do not see in azure cni networking is the Route Table. For Kubernetes clusters with kubenet networking, we need to update the Azure VNET to attach to the route table. This is a known bug and is actually documented.

Fortunately, we have associated the route table and agent-subnet in build.sh script. You can verify it by opening the route table in the Azure portal. You should see the agent-subnet in Subnets section of the route table.

Now, run the below command to send icmp packages from pod-a to pod-c.

$ kubectl exec -it ssh-pod-a-node-0 -- ping 10.244.0.7
PING 10.244.0.7 (10.244.0.7) 56(84) bytes of data.
64 bytes from 10.244.0.7: icmp_seq=1 ttl=62 time=1.43 ms
64 bytes from 10.244.0.7: icmp_seq=2 ttl=62 time=0.791 ms
64 bytes from 10.244.0.7: icmp_seq=3 ttl=62 time=0.733 ms
...

Let’s connect to node-0 and capture packets in NICs of the node.

$ kubectl cp ~/.ssh/ake_rsa ssh-pod-a-node-0:id_rsa
$ kubectl exec -it ssh-pod-a-node-0 -- bash
root@ssh-pod-a-node-0:/# ssh -i id_rsa azureadmin@10.20.0.4

If you list the networking interfaces of node-0, you should see a bunch of veth interfaces along with the others. We can find the veth pair of the eht0 interface of pod-a in a similar way we did in azure cni section.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ docker ps
CONTAINER ID        IMAGE                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5e7a48217e87        nginx                                                  "nginx -g 'daemon of…"   About an hour ago   Up About an hour                        k8s_ssh-pod-a-node-0_ssh-pod-a-node-0_default_6a292568-38f1-11e9-b0fa-000d3a255d62_0
230a64c000be        k8s.gcr.io/pause-amd64:3.1                             "/pause"                 About an hour ago   Up About an hour                        k8s_POD_ssh-pod-a-node-0_default_6a292568-38f1-11e9-b0fa-000d3a255d62_0
...
azureadmin@k8s-agentpool1-37464322-vmss000000:~$ docker inspect k8s_POD_ssh-pod-a-node-0_default_6a292568-38f1-11e9-b0fa-000d3a255d62_0  | jq '.[].NetworkSettings.SandboxKey'
"/var/run/docker/netns/ccfde56348fc"
azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo nsenter --net=/var/run/docker/netns/ccfde56348fc ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether b6:ed:2a:70:e5:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.1.6/24 scope global eth0
       valid_lft forever preferred_lft forever

From interface eth0@if10, we understand that NIC eth0 of pod-a is paired with interface 10 in node-0. List the NICs of node-0 to find the name of interface 10.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ ip a | grep 10
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    inet 10.20.0.4/24 brd 10.20.0.255 scope global eth0
3: enP1p0s2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
5: cbr0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    inet 10.244.1.1/24 scope global cbr0
10: veth923df63b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cbr0 state UP group default

Now, we can capture the packets in veth923df63b interface. When you run the tcpdump, you should get an output similar to the following:

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i veth923df63b icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth923df63b, link-type EN10MB (Ethernet), capture size 262144 bytes
12:32:18.961826 IP 10.244.1.6 > 10.244.0.7: ICMP echo request, id 463, seq 1, length 64
12:32:18.962876 IP 10.244.0.7 > 10.244.1.6: ICMP echo reply, id 463, seq 1, length 64
...

This is good .. it means that we’ve found the right veth pair of the pod pod-a. To find the interface that is briged to veth NIC, we can run brctl show command.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ brctl show
bridge name     bridge id               STP enabled     interfaces
cbr0            8000.a61b61df1786       no              veth2bbca7d4
                                                        veth3c3e12f3
                                                        veth7264149b
                                                        veth923df63b
                                                        vethc59025dd
docker0         8000.0242e1d44187       no

We can capture the ICMP packets in cbr0 and eth0 interfaces respectively.

azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i cbr0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:40:24.871135 IP 10.244.1.6 > 10.244.0.7: ICMP echo request, id 468, seq 1, length 64
12:40:24.872458 IP 10.244.0.7 > 10.244.1.6: ICMP echo reply, id 468, seq 1, length 64
...
azureadmin@k8s-agentpool1-37464322-vmss000000:~$ sudo tcpdump -i eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:40:31.973569 IP 10.244.1.6 > 10.244.0.7: ICMP echo request, id 468, seq 8, length 64
12:40:31.974275 IP 10.244.0.7 > 10.244.1.6: ICMP echo reply, id 468, seq 8, length 64

This means that the ICMP (ping) packets get out of the node without any translation. So, how do packets know the right destination? The answer is Azure route table. If you open the route table in Azure Portal, you can see the rule that the packets destined to 10.244.0.0/24 are routed to node 10.20.0.5.

References

  • AKS-Engine Quickstart Guide [1]
  • Azure Networking Limits [2]
  • Cluster Definitions [3]
  • AKS Engine the Long Way [4]
  • Attaching Cluster Route Table to VNET [5]

Author


Senior Software Developer