Send ICMP Echo Replies using eBPF

(by )

For my master thesis I am working with eBPF, the Extended Berkeley Packet Filter. By now it is used by several subsystems in the Linux kernel, ranging from tracing and seccomp rules to network filtering.

As I am using it for network filtering I wanted a small useful and working example on how to parse and resend packets with it. Luckily, the hard part of attaching it early in the packet processing pipeline is already handled by tc, Linux’ traffic control utility from the iproute2 project.

However, it took me a while to get a reliably working ICMP ping-pong example to work. Now that I have I published it to save others the trouble.
The result is online in the ebpf-icmp-ping repository. The rest of the blog post will explain some of the steps in bpf.c and how it is used.

A subset of C can be compiled to the eBPF bytecode and luckily the Clang compiler has a eBPF backend to make it all work.

The usable subset is a lot more restricted than plain C and requires a bit more boilerplate to assist the compiler (and Kernel verifier) to produce safe programs. All memory access needs to be checked up front. Assigning from one part in the passed buffer to another might fail (I’m not 100% sure yet whether that’s due to restrictions of eBPF or the code generation). And you can’t have loops, but luckily Clang/LLVM is quite good at unrolling loops with a fixed iteration count.

Let’s dive in.

First we define our function and put it in a specific section of the generated ELF file. tc will know how to pull it out. Our function gets a single pointer to a kernel-allocated buffer of the network packet.

SEC("action")
int pingpong(struct __sk_buff *skb)

Accessing data in this buffer can be done using different methods. Either read out bytes at specified offsets or rely on the struct definitions of the Kernel. We do the latter, but first we need to check that there is enough data. If not, we don’t do anything.

void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;

if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) + sizeof(struct icmphdr) > data_end)
    return TC_ACT_UNSPEC;

Once that is done, the verifier let’s us use pointers to the right parts of the buffer

struct ethhdr  *eth  = data;
struct iphdr   *ip   = (data + sizeof(struct ethhdr));
struct icmphdr *icmp = (data + sizeof(struct ethhdr) + sizeof(struct iphdr));

We do some checks to ensure we have a packet we can handle and then parse out the addresses. MAC addresses are 48 bits, so the best is to copy them out.

__u8 src_mac[ETH_ALEN];
__u8 dst_mac[ETH_ALEN];
bpf_memcpy(src_mac, eth->h_source, ETH_ALEN);
bpf_memcpy(dst_mac, eth->h_dest, ETH_ALEN);

The IP addresses can be accessed more directly.

__u32 src_ip = ip->saddr;
__u32 dst_ip = ip->daddr;

We can then swap the MAC addresses by storing the other address at the right place.

bpf_skb_store_bytes(skb, offsetof(struct ethhdr, h_source), dst_mac, ETH_ALEN, 0);
bpf_skb_store_bytes(skb, offsetof(struct ethhdr, h_dest), src_mac, ETH_ALEN, 0);

Same goes for the IPs:

bpf_skb_store_bytes(skb, IP_SRC_OFF, &dst_ip, sizeof(dst_ip), 0);
bpf_skb_store_bytes(skb, IP_DST_OFF, &src_ip, sizeof(src_ip), 0);

The IP header is checksummed, but simply swapping a few bytes does not affect the checksum, so no need to recalculate it. We can then modify the ICMP type, but here we need to calculate the new checksum. The Linux kernel provides helper methods for eBPF to do this.

First recalculate the checksum:

__u8 new_type = 0;
bpf_l4_csum_replace(skb, ICMP_CSUM_OFF, ICMP_PING, new_type, ICMP_CSUM_SIZE);

Then insert the actual data (the order is not relevant here).

bpf_skb_store_bytes(skb, ICMP_TYPE_OFF, &new_type, sizeof(new_type), 0);

Last but not least we need to redirect the packet back out the same network interface it came in. This is done using another helper function:

bpf_clone_redirect(skb, skb->ifindex, 0);

The last argument specifies the direction, where 0 is tx, and thus outgoing and 1 is rx, thus incoming. Finally we set a return code to inform the kernel that the packet should not be processed any further.

The full code is in bpf.c.

To use this code we first need a qdisc to attach this program to as an action.

tc qdisc add dev eth0 ingress handle ffff:

Then we can attach the classifier (which does nothing) and our action (the ICMP pong) to the create ingress queue:

tc filter add dev eth0 parent ffff: bpf obj bpf.o sec classifier flowid ffff:1 \
  action bpf obj bpf.o sec action ok

If all worked correctly, tc can show some info:

$ tc filter show dev eth0 ingress
filter parent ffff: protocol all pref 49152 bpf
filter parent ffff: protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[classifier]
        action order 1: bpf bpf.o:[action] default-action pass
        index 30 ref 1 bind 1

If you enabled the debug print, the output can be viewed as well:

$ tc exec bpf dbg
Running! Hang up with ^C!

  <idle>-0  [000] ..s. 81710.218035: : [action] IP Packet, proto= 1, src= 20490432, dst= 1714989248

And that’s it. ICMP Echo Requests are now handled inside the kernel using eBPF and never travel through the rest of the network stack.

2016 in many words and some photos

(by )

Last year I summarized my year in a long blog post, and I did the same the year before as well. So here comes the 2016 edition.

My year in numbers

I was at 3 different conferences in 2 different cities across 2 different countries:

  1. View Source in Berlin, Germany (September), where I gave a Rust workshop
  2. RustFest in Berlin, Germany (September), which I co-organized
  3. Rust Belt Rust in Pittsburgh, USA (October), where I gave a workshop about Rust & Emscripten

In total I gave 5 different talks, which included twice at the Rust Cologne meetup, once at the Amsterdam Rust meetup and once at the Rust meetup in Stockholm. New talks are already lining up with two confirmed in the first half of next year. Since March I am a regular attendee of the Rust Cologne meetup and I’ve become part of the organizer team in June as well. We managed to have one meetup per month since March and also had the Novemb.rs Code Sprint Weekend in November.

GitHub says I made 2185 countable contributions (including some private repositories) across dozens of repositories. My number of published crates went up to 20. Yet I didn’t manage to really work on semantic-rs at all.

Last year I said I plan to release stable 1.0 releases of all hiredis derivates, but I pretty much failed on all of them. I’m not giving up though, maybe I can find the time & motivation soon to get them out. But after that they will probably stay maintained, but not further developed by me.

I posted 183 191 photos on Instagram and my ~/photos/2016 directory now contains about 2000 photos. Because of my own stupidity and no proper backup I lost quite a few photos from my phone. :(

I wrote more than 5000 tweets, more than 3700 of them in reply to someone. That’s still 3000 tweets less than last year. Wow. 11 new blog posts on this blog are just in range of what I did last year as well. Hope to keep this up!

My year in photos and words

The year started off with a trip to Canada. With temperatures as low as -25°C, lots of snow, pancakes and beer I had a wonderful time. I definitely need to go back once more in the summer.

Canada Forest

Canada Forest

Freezing cold

Freezing cold

Sun over frozen Canada

Sun over frozen Canada

In May I made a short trip to Stockholm. Equipped with the absolute best weather I enjoyed a few more days away from university and work and could relax. In addition a canceled flight gave me another night in this city.

Beautiful Stockholm

Beautiful Stockholm

After I got back to Aachen I bought my first Longboard. I enjoyed multiple long trips over the Vennbahn. See more impression in Longboarding: Vennbahn.

Longboarding on Vennbahn

Longboarding on Vennbahn

In August I got back to Stockholm a second time. This time with friends for a 4 day long kajaking trip through the Stockholm Archipelago. And again we had the best weather for such a trip. I could even visit the Stockholm Rust meetup and give a talk a few days later before heading back home.

Sunset in the archipelago of Stockholm

Sunset in the archipelago of Stockholm

In August I took part in the Mozilla Tech Speakers program and thus became an official Tech Speaker.

Starting in October I was back at full-time university (more or less), in search for a master thesis to finish with. Sadly, one of the more promising opportunities fell through and so it took until late November that I found an interesting topic and a supervisor to work with. Just before Christmas I officially started this thesis and thus starting the 6 month countdown for finishing my work. Which means, if nothing else goes horribly wrong, I will finish my masters degree in June.

The year then ended with my yearly trip to Hamburg, though this time I opted for not visiting the yearly Chaos Communication Congress (in full), but stay with friends, enjoy the city and wait for the end of the year.

Thanks

Again lots of the things I did were only possible due to help and encouragment from other people.
A big thanks to the team behind RustFest: Flaki, Emma, Ben, Johanna, Florian, Katharina & Andrew. It was a pleasure to organize the conference with you.
Thanks to Pascal, Flo & Colin for all the organisation work of Rust Cologne.
Thanks to my employer rrbone and my boss Dominik once more.
Thanks to the people that traveled with me or let me stay at their place.
And thanks to the friends & family near and far away.

The Future

2017 will bring a lot of changes. As mentioned above I started my thesis and will finish in June. I already have some plans for the time after that, including bigger travel plans, but I don’t know where I will end up after that. 2017 will probably be the year I leave Aachen after more than 5 years living there.

<3

Xen - split driver, initial communication

(by )

In the previous post I explained how to initially setup a split driver for Xen with the backend in dom0 and the frontend in a domU.

This time we are taking a look at the internal states each side goes through. Most of this code is a trimmed down version of the Xen network driver.

The full code can be found in chapter 2 of the example repository.

Background

The frontend part of the driver sits in an uprivileged domU and gets its input from the kernel in this virtual machine. Depending on its usecase it then passed on commands what to do and the data over to the backend part, sitting in an privileged domain such as dom0. For example in case of the network driver, the domU kernel generates network packets which are passed over to the backend, which is then responsible to transfering this data to the actual network card.

Before all of this can happen both parts need to be able to communicate with each other. Each part must probably set up a few things before it can do its job. Some of these things must be advanced in lock-step, so each part advances to its next status and then waits for counterpart to advance as well.

The state machine

Internally this is all done through a state machine. Both sides start in the XenbusStateInitialising state. The goal is to reach XenbusStateConnected once fully setup.

If no setup is required at all, it is as easy as saying so:

xenbus_switch_state(dev, XenbusStateConnected);

Of course a driver rarely has to do nothing at all. Instead in each intermediate state some work can be done. This results in a fairly large state machine on both ends, but most of it is just boilerplate.

This results in about 30 lines extra in the frontend and 100 lines in the backend. In this blog post I will focus only on a few relevant lines.

Both sides gain another callback function, to be notified when the other side changes its state. This way they can advance in lock-step.

static void mydevicefront_otherend_changed(struct xenbus_device *dev,
			    enum xenbus_state backend_state)
	
}

static struct xenbus_driver mydevicefront_driver = {
	.ids  = mydevicefront_ids,
	.probe = mydevicefront_probe,
	.otherend_changed = mydevicefront_otherend_changed,
};

(The backend as a similar one)

The passed state will tell us the new state of the other side. In the frontend we can simply wait for different state switches and switch over the frontend as well, eventually reaching XenbusStateConnected

switch (backend_state)
{
	
	case XenbusStateInitWait:
		if (dev->state != XenbusStateInitialising)
			break;
		if (frontend_connect(dev) != 0)
			break;
		xenbus_switch_state(dev, XenbusStateConnected);

		break;
	
}

The frontend_connect function should then set up everything necessary.

The backend has a similar function:

static void mydeviceback_otherend_changed(struct xenbus_device *dev, enum xenbus_state frontend_state)
{
	switch (frontend_state) {
		
		case XenbusStateConnected:
			set_backend_state(dev, XenbusStateConnected);
			break;
		
	}

This defers to yet another function, actually just boilerplate to ensure the right order of state changes:

static void set_backend_state(struct xenbus_device *dev,
			      enum xenbus_state state)
{
	while (dev->state != state) {
		switch (dev->state) {
		
		case XenbusStateInitWait:
			switch (state) {
			case XenbusStateConnected:
				backend_connect(dev);
				xenbus_switch_state(dev, XenbusStateConnected);
				break;
			case XenbusStateClosing:
			case XenbusStateClosed:
				xenbus_switch_state(dev, XenbusStateClosing);
				break;
			default:
				BUG();
			}
			break;

		
		}
	}
}

Note: The whole state machine switching was taken from the network driver and may be reduced for other cases.

Again, this calls into another function backend_connect where we can handle the setup.

For every invalid state switch it will trigger the BUG() macro, which crashes the module and in turn the kernel, but at least you know where to start.

Last but not least let’s set the initial state in the backend:

static int mydeviceback_probe(struct xenbus_device *dev,
              const struct xenbus_device_id *id)
{
	xenbus_switch_state(dev, XenbusStateInitialising);
	return 0;
}

With the module code done, we need one last change: The guest domain must be able to write its state back to the XenStore. Thus we set the correct permissions on paths in the XenStore using:

xenstore-chmod $DOM0_KEY r0 b$DOMU_ID
xenstore-chmod $DOMU_KEY r$DOMU_ID b0

The XenStore permissions are a bit unusual. There are 4 different modes per file: read, write or both, no access. However, the owner of a file always has full access (both).

The very first permission always sets the owner and the permissions for any remaining user. Every additional permission overwrites this first specified permission for the given user.

So the above r0 b$DOMU_ID means:

For the second line it is the other way around.

Run it

With everything in the code, we can compile the modules and load them into the kernel.

dom0# insmod mydeviceback.ko
domU# insmod mydevicefront.ko
d0m0# ./activate.sh 1

In the kernel log output of dom0 you should see something along the lines:

Hello World!
Probe called. We are good to go.
Connecting the backend now.

And in the log output of domU you should see:

Hello World!
Probe called. We are good to go.
Connecting the frontend now.
Other side says it is connected as well.

At this point both sides are in XenbusStateConnected mode and can communicate.

As always, the full code can be found in chapter 2 of the example repository.

Up next

Now that we know each state to go through, we can set up communication through event channels. We take a closer look at this in the next post.

Xen - a backend/frontend driver example

(by )

Recently I began working on my master thesis. For this I have to get familiar with the Xen hypervisor and its implementation of drivers. As the documentation on its implementation is quite sparse I want to write down some of my findings, so others don’t have to re-read and re-learn everything. In this post I’ll focus on how to get a minimal driver in a paravirtualized VM running. Following posts will then focus on how to do communication through event channels and shared memory These are all things I need for the project I am working on, so I need to figure out how this works anyway.

Background

The Xen hypervisor is only a minimal hypervisor implementation, which is booted and then boots a special Linux machine, the so-called dom0. This dom0 is most often just a regular Linux distribution such as Ubuntu. Using Xen-specific tools it is then possible to launch additional virtual machines (VMs). These are called domU. In the default case, dom0 is responsible to acutally talk to the hardware attached to a machine, such as hard disks and the network card. However, VMs of course also need some way to store data or generate network traffic. In Xen this is handled by virtual devices attached to the domU. Generic drivers then proxy data that should be written to disk or network packets to send out through the dom0 to the actual device.

These drivers follow a split-driver model, where one part of the driver, the backend, resides in the dom0 and the other half, the frontend, is a module in the domU machine. Both parts can be implemented as kernel modules and be loaded dynamically.

What’s not documented as clearly as it should be: Activation of the virtual device and thus invoking the right methods of the kernel module is done by writing data to the XenStore. For actual hardware this is already handled automatically. For your own custom virtual device this can be done manually.

A minimal driver

Our driver won’t do anything useful besides saying “Hello” and showing a message when it is activated. The boilderplate for this example is quite huge, the full code can also be found in the xen-split-driver-example repository.

I assume you already have a Xen host, you are connected to the dom0 and have at least one domU running.

The frontend driver resides in mydevicefront.c:

#include <linux/module.h>  /* Needed by all modules */
#include <linux/kernel.h>  /* Needed for KERN_ALERT */

#include <xen/xen.h>       /* We are doing something with Xen */
#include <xen/xenbus.h>

// The function is called on activation of the device
static int mydevicefront_probe(struct xenbus_device *dev,
              const struct xenbus_device_id *id)
{
	printk(KERN_NOTICE "Probe called. We are good to go.\n");
	return 0;
}

// This defines the name of the devices the driver reacts to
static const struct xenbus_device_id mydevicefront_ids[] = {
	{ "mydevice"  },
	{ ""  }
};

// We set up the callback functions
static struct xenbus_driver mydevicefront_driver = {
	.ids  = mydevicefront_ids,
	.probe = mydevicefront_probe,
};

// On loading this kernel module, we register as a frontend driver
static int __init mydevice_init(void)
{
	printk(KERN_NOTICE "Hello World!\n");

	return xenbus_register_frontend(&mydevicefront_driver);
}
module_init(mydevice_init);

// ...and on unload we unregister
static void __exit mydevice_exit(void)
{
	xenbus_unregister_driver(&mydevicefront_driver);
	printk(KERN_ALERT "Goodbye world.\n");
}
module_exit(mydevice_exit);

MODULE_LICENSE("GPL");
MODULE_ALIAS("xen:mydevice");

The backend driver is very similar and resides in mydeviceback.c:

#include <linux/module.h>  /* Needed by all modules */
#include <linux/kernel.h>  /* Needed for KERN_ALERT */

#include <xen/xen.h>       /* We are doing something with Xen */
#include <xen/xenbus.h>

// The function is called on activation of the device
static int mydeviceback_probe(struct xenbus_device *dev,
			const struct xenbus_device_id *id)
{
	printk(KERN_NOTICE "Probe called. We are good to go.\n");
	return 0;
}

// This defines the name of the devices the driver reacts to
static const struct xenbus_device_id mydeviceback_ids[] = {
	{ "mydevice" },
	{ "" }
};

// We set up the callback functions
static struct xenbus_driver mydeviceback_driver = {
	.ids  = mydeviceback_ids,
	.probe = mydeviceback_probe,
};

// On loading this kernel module, we register as a frontend driver
static int __init mydeviceback_init(void)
{
	printk(KERN_NOTICE "Hello World!\n");

	return xenbus_register_backend(&mydeviceback_driver);
}
module_init(mydeviceback_init);

// ...and on unload we unregister
static void __exit mydeviceback_exit(void)
{
	xenbus_unregister_driver(&mydeviceback_driver);
	printk(KERN_ALERT "Goodbye world.\n");
}
module_exit(mydeviceback_exit);

MODULE_LICENSE("GPL");
MODULE_ALIAS("xen-backend:mydevice");

To compile each module indivudally, put them in their own directory and add a Makefile per module:

obj-m += mydevicefront.o

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

Change the first line to obj-m += mydeviceback.o for the backend driver.
You can then compile each module on their host and will get a mydeviceback.ko and mydevicefront.ko.

Next, you need to load the modules. In the dom0:

insmod mydeviceback.ko

In the domU:

insmod mydevicefront.ko

Check with dmesg that on both sides you get the “Hello World”.

Activation of the driver requires to add a virtual device to the Xenstore. I wrote a small script, activate.sh to do that.

#!/bin/bash

DOMU_ID=$1

if [ -z "$DOMU_ID"   ]; then
  echo "Usage: $0 [domU ID]]"
  echo
  echo "Connects the new device, with dom0 as backend, domU as frontend"
  exit 1
fi

DEVICE=mydevice
DOMU_KEY=/local/domain/$DOMU_ID/device/$DEVICE/0
DOM0_KEY=/local/domain/0/backend/$DEVICE/$DOMU_ID/0

# Tell the domU about the new device and its backend
xenstore-write $DOMU_KEY/backend-id 0
xenstore-write $DOMU_KEY/backend "/local/domain/0/backend/$DEVICE/$DOMU_ID/0"

# Tell the dom0 about the new device and its frontend
xenstore-write $DOM0_KEY/frontend-id $DOMU_ID
xenstore-write $DOM0_KEY/frontend "/local/domain/$DOMU_ID/device/$DEVICE/0"

# Make sure the domU can read the dom0 data
xenstore-chmod $DOM0_KEY r

# Activate the device, dom0 needs to be activated last
xenstore-write $DOMU_KEY/state 1
xenstore-write $DOM0_KEY/state 1

This adds 3 paths per domain, setting up the virtual device and thus activating the driver. Once you executed that, you again check dmesg. You should now see the Probe called message.

The full code can be found in the example repository.

novemb.rs Code Sprint Weekend 2016 - Retrospective

(by )

This post is a tiny bit late, but better late than never.

So on 19th and 20th of November, just over a week ago, we had the very first novemb.rs Code Sprint. In 10 locations in Europe and the US as well as online, people gathered to hack on projects, start new ones or just to learn Rust. Bringing people together is one goal of the Rust community and coding, learning and having fun together is a lot of fun as well.

novemb.rs @ C4

We opened doors at Chaos Computer Club Cologne

I was part of the novemb.rs Event in Cologne. On both days about a dozen people showed up, from noon to late in the evening.

In Cologne, we had several different projects and ideas being worked on:

Thanks to our attendees and thanks to my co-organizers Colin, Flo & Pascal. And another big thanks to Mozilla for sponsoring the pizza and the C4 for offering the space.

Things that worked well

Things that we should improve

Rust @ Whiteboard

Coding on the whiteboard (only half of the attendees in the picture)

Try again?

Definitely! Given that the overall organization effort is low I’d like to make this a more regular thing. With the experience from this year we have enough points to improve and we can plan this in advance. Plus, we already have the domain anyway ;)

Until then, there are of course other opportunities to get in contact with fellow Rustaceans. We have a monthly meetup in Cologne, with a Christmas-themed impl Glühwein for RustCologne happening next week. Feel free to join us on the Weihnachtsmarkt or in January at another more regular meetup. If you are not from the Rhein area, take a look if there’s a Rust User Group near you.

Rust

Signs showed the way