Dustin Kirkland
on 22 June 2015
If you read my last post, perhaps you followed the embedded instructions and ran hundreds of LXD system containers on your own Ubuntu machine.
Or perhaps you’re already a Docker enthusiast and your super savvy microservice architecture orchestrates dozens of applications among a pile of process containers.
Either way, the massive multiplication of containers everywhere introduces an interesting networking problem:
How do thousands of containers interact with thousands of other containers efficiently over a network? What if every one of those containers could just route to one another?
Canonical is pleased to introduce today an innovative solution that addresses this problem in perhaps the most elegant and efficient manner to date! We call it “The Fan” — an extension of the network tunnel driver in the Linux kernel. The fan was conceived by Mark Shuttleworth and John Meinel, and implemented by Jay Vosburgh and Andy Whitcroft.
A Basic Overview
Each container host has a “fan bridge” that enables all of its containers to deterministically map network traffic to any other container on the fan network. I say “deterministically”, in that there are no distributed databases, no consensus protocols, and no more overhead than IP-IP tunneling. A more detailed technical description can be found here.
Quite simply, a /16 network gets mapped on onto an unused /8 network, and container traffic is routed by the host via an IP tunnel.
A Demo
Interested yet? Let’s take it for a test drive in AWS…
First, launch two instances in EC2 (or your favorite cloud) in the same VPC. Ben Howard has created special test images for AWS and GCE, which include a modified Linux kernel, a modified iproute2 package, a new fanctl package, and Docker installed by default. You can find the right AMIs here.
Build and Publish report for trusty 20150621.1228.
-----------------------------------
BUILD INFO:
VERSION=14.04-LTS
STREAM=testing
BUILD_DATE=
BUG_NUMBER=1466602
STREAM="testing"
CLOUD=CustomAWS
SERIAL=20150621.1228
-----------------------------------
PUBLICATION REPORT:
NAME=ubuntu-14.04-LTS-testing-20150621.1228
SUITE=trusty
ARCH=amd64
BUILD=core
REPLICATE=1
IMAGE_FILE=/var/lib/jenkins/jobs/CloudImages-Small-CustomAWS/workspace/ARCH/amd64/trusty-server-cloudimg-CUSTOM-AWS-amd64-disk1.img
VERSION=14.04-LTS-testing-20150621.1228
INSTANCE_BUCKET=ubuntu-images-sandbox
INSTANCE_eu-central-1=ami-1aac9407
INSTANCE_sa-east-1=ami-59a22044
INSTANCE_ap-northeast-1=ami-3ae2453a
INSTANCE_eu-west-1=ami-d76623a0
INSTANCE_us-west-1=ami-238d7a67
INSTANCE_us-west-2=ami-53898c63
INSTANCE_ap-southeast-2=ami-ab95ef91
INSTANCE_ap-southeast-1=ami-98e9edca
INSTANCE_us-east-1=ami-b1a658da
EBS_BUCKET=ubuntu-images-sandbox
VOL_ID=vol-678e2c29
SNAP_ID=snap-efaa288b
EBS_eu-central-1=ami-b4ac94a9
EBS_sa-east-1=ami-e9a220f4
EBS_ap-northeast-1=ami-1aee491a
EBS_eu-west-1=ami-07602570
EBS_us-west-1=ami-318c7b75
EBS_us-west-2=ami-858b8eb5
EBS_ap-southeast-2=ami-558bf16f
EBS_ap-southeast-1=ami-faeaeea8
EBS_us-east-1=ami-afa25cc4
----
6cbd6751-6dae-4da7-acf3-6ace80c01acc
Next, ensure that those two instances can talk to one another. Here, I tested that in both directions, using both ping and nc.
ubuntu@ip-172-30-0-28:~$ ifconfig eth0
eth0 Link encap:Ethernet HWaddr 0a:0a:8f:f8:cc:21
inet addr:172.30.0.28 Bcast:172.30.0.255 Mask:255.255.255.0
inet6 addr: fe80::80a:8fff:fef8:cc21/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:2904565 errors:0 dropped:0 overruns:0 frame:0
TX packets:9919258 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:13999605561 (13.9 GB) TX bytes:14530234506 (14.5 GB)
ubuntu@ip-172-30-0-28:~$ ping -c 3 172.30.0.27
PING 172.30.0.27 (172.30.0.27) 56(84) bytes of data.
64 bytes from 172.30.0.27: icmp_seq=1 ttl=64 time=0.289 ms
64 bytes from 172.30.0.27: icmp_seq=2 ttl=64 time=0.201 ms
64 bytes from 172.30.0.27: icmp_seq=3 ttl=64 time=0.192 ms
--- 172.30.0.27 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.192/0.227/0.289/0.045 ms
ubuntu@ip-172-30-0-28:~$ nc -l 1234
hi mom
─────────────────────────────────────────────────────────────────────
ubuntu@ip-172-30-0-27:~$ ifconfig eth0
eth0 Link encap:Ethernet HWaddr 0a:26:25:9a:77:df
inet addr:172.30.0.27 Bcast:172.30.0.255 Mask:255.255.255.0
inet6 addr: fe80::826:25ff:fe9a:77df/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:11157399 errors:0 dropped:0 overruns:0 frame:0
TX packets:1671239 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:16519319463 (16.5 GB) TX bytes:12019363671 (12.0 GB)
ubuntu@ip-172-30-0-27:~$ ping -c 3 172.30.0.28
PING 172.30.0.28 (172.30.0.28) 56(84) bytes of data.
64 bytes from 172.30.0.28: icmp_seq=1 ttl=64 time=0.245 ms
64 bytes from 172.30.0.28: icmp_seq=2 ttl=64 time=0.185 ms
64 bytes from 172.30.0.28: icmp_seq=3 ttl=64 time=0.186 ms
--- 172.30.0.28 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.185/0.205/0.245/0.030 ms
ubuntu@ip-172-30-0-27:~$ echo "hi mom" | nc 172.30.0.28 1234
If that doesn’t work, you might have to adjust your security group until it does.
Now, import the Ubuntu image in Docker in both instances.
$ sudo docker pull ubuntu
Pulling repository ubuntu
...
e9938c931006: Download complete
9802b3b654ec: Download complete
14975cc0f2bc: Download complete
8d07608668f6: Download complete
Now, let’s create a fan bridge on each of those two instances. We can create it on the command line using the new fanctl command, or we can put it in /etc/network/interfaces.d/eth0.cfg.
We’ll do the latter, so that the configuration is persistent across boots.
$ cat /etc/network/interfaces.d/eth0.cfg
# The primary network interface
auto eth0
iface eth0 inet dhcp
up fanctl up 250.0.0.0/8 eth0/16 dhcp
down fanctl down 250.0.0.0/8 eth0/16
$ sudo ifup --force eth0
Now, let’s look at our ifconfig…
$ ifconfig
docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
eth0 Link encap:Ethernet HWaddr 0a:0a:8f:f8:cc:21
inet addr:172.30.0.28 Bcast:172.30.0.255 Mask:255.255.255.0
inet6 addr: fe80::80a:8fff:fef8:cc21/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:2905229 errors:0 dropped:0 overruns:0 frame:0
TX packets:9919652 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:13999655286 (13.9 GB) TX bytes:14530269365 (14.5 GB)
fan-250-0-28 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:250.0.28.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::8032:4dff:fe3b:a108/64 Scope:Link
UP BROADCAST MULTICAST MTU:1480 Metric:1
RX packets:304246 errors:0 dropped:0 overruns:0 frame:0
TX packets:245532 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:13697461502 (13.6 GB) TX bytes:37375505 (37.3 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:1622 errors:0 dropped:0 overruns:0 frame:0
TX packets:1622 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:198717 (198.7 KB) TX bytes:198717 (198.7 KB)
lxcbr0 Link encap:Ethernet HWaddr 3a:6b:3c:9b:80:45
inet addr:10.0.3.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::386b:3cff:fe9b:8045/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:648 (648.0 B)
tunl0 Link encap:IPIP Tunnel HWaddr
UP RUNNING NOARP MTU:1480 Metric:1
RX packets:242799 errors:0 dropped:0 overruns:0 frame:0
TX packets:302666 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12793620 (12.7 MB) TX bytes:13697374375 (13.6 GB)
Pay special attention to the new fan-250-0-28 device! I’ve only shown this on one of my instances, but you should check both.
Now, let’s tell Docker to use that device as its default bridge.
$ fandev=$(ifconfig | grep ^fan- | awk '{print $1}')
$ echo $fandev
fan-250-0-28
$ echo "DOCKER_OPTS='-d -b $fandev --mtu=1480 --iptables=false'" |
sudo tee -a /etc/default/docker.io
Make sure you restart the docker.io service
$ sudo service docker.io restart
Now we can launch a Docker container in each of our two EC2 instances…
$ sudo docker run -it ubuntu
root@261ae39d90db:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr e2:f4:fd:f7:b7:f5
inet addr:250.0.28.3 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::e0f4:fdff:fef7:b7f5/64 Scope:Link
UP BROADCAST RUNNING MTU:1480 Metric:1
RX packets:7 errors:0 dropped:2 overruns:0 frame:0
TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:558 (558.0 B) TX bytes:648 (648.0 B)
And here’s a second one, on my other instance…
sudo docker run -it ubuntu
root@ddd943163843:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 66:fa:41:e7:ad:44
inet addr:250.0.27.3 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::64fa:41ff:fee7:ad44/64 Scope:Link
UP BROADCAST RUNNING MTU:1480 Metric:1
RX packets:12 errors:0 dropped:2 overruns:0 frame:0
TX packets:13 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:936 (936.0 B) TX bytes:1026 (1.0 KB)
Now, let’s send some traffic back and forth! Again, we can use ping and nc.
root@261ae39d90db:/# ping -c 3 250.0.27.3
PING 250.0.27.3 (250.0.27.3) 56(84) bytes of data.
64 bytes from 250.0.27.3: icmp_seq=1 ttl=62 time=0.563 ms
64 bytes from 250.0.27.3: icmp_seq=2 ttl=62 time=0.278 ms
64 bytes from 250.0.27.3: icmp_seq=3 ttl=62 time=0.260 ms
--- 250.0.27.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.260/0.367/0.563/0.138 ms
root@261ae39d90db:/# echo "here come the bits" | nc 250.0.27.3 9876
root@261ae39d90db:/#
─────────────────────────────────────────────────────────────────────
root@ddd943163843:/# ping -c 3 250.0.28.3
PING 250.0.28.3 (250.0.28.3) 56(84) bytes of data.
64 bytes from 250.0.28.3: icmp_seq=1 ttl=62 time=0.434 ms
64 bytes from 250.0.28.3: icmp_seq=2 ttl=62 time=0.258 ms
64 bytes from 250.0.28.3: icmp_seq=3 ttl=62 time=0.269 ms
--- 250.0.28.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.258/0.320/0.434/0.081 ms
root@ddd943163843:/# nc -l 9876
here come the bits
Alright, so now let’s really bake your noodle…
That 250.0.0.0/8 network can actually be any /8 network. It could be a 10.* network or any other /8 that you choose. I’ve chosen to use something in the reserved Class E range, 240.* – 255.* so as not to conflict with any other routable network.
Finally, let’s test the performance a bit using iperf and Amazon’s 10gpbs instances!
So I fired up two c4.8xlarge instances, and configured the fan bridge there.
$ fanctl show
Bridge Overlay Underlay Flags
fan-250-0-28 250.0.0.0/8 172.30.0.28/16 dhcp host-reserve 1
And
$ fanctl show
Bridge Overlay Underlay Flags
fan-250-0-27 250.0.0.0/8 172.30.0.27/16 dhcp host-reserve 1
Would you believe 5.46 Gigabits per second, between two Docker instances, directly addressed over a network? Witness…
Server 1…
root@84364bf2bb8b:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 92:73:32:ac:9c:fe
inet addr:250.0.27.2 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::9073:32ff:feac:9cfe/64 Scope:Link
UP BROADCAST RUNNING MTU:1480 Metric:1
RX packets:173770 errors:0 dropped:2 overruns:0 frame:0
TX packets:107628 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6871890397 (6.8 GB) TX bytes:7190603 (7.1 MB)
root@84364bf2bb8b:/# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 250.0.27.2 port 5001 connected with 250.0.28.2 port 35165
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 6.36 GBytes 5.46 Gbits/sec
And Server 2…
root@04fb9317c269:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr c2:6a:26:13:c5:95
inet addr:250.0.28.2 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::c06a:26ff:fe13:c595/64 Scope:Link
UP BROADCAST RUNNING MTU:1480 Metric:1
RX packets:109230 errors:0 dropped:2 overruns:0 frame:0
TX packets:150164 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:28293821 (28.2 MB) TX bytes:6849336379 (6.8 GB)
root@04fb9317c269:/# iperf -c 250.0.27.2
multicast ttl failed: Invalid argument
------------------------------------------------------------
Client connecting to 250.0.27.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 250.0.28.2 port 35165 connected with 250.0.27.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 6.36 GBytes 5.47 Gbits/sec
Multiple containers, on separate hosts, directly addressable to one another with nothing more than a single network device on each host. Deterministic routes. Blazing fast speeds. No distributed databases. No consensus protocols. Not an SDN. This is just amazing!
Request for Comments
Give it a try and let us know what you think! We’d love to get your feedback and use cases as we work the kernel and userspace changes upstream.
Over the next few weeks, you’ll see the fan patches landing in Wily, and backported to Trusty and Vivid. We are also drafting an RFC, as we think that other operating systems and the container world and the Internet at large would benefit from Fan Networking.
I’m already a fan!