Optimising the unikernel - Thomas Leonard's blog

After creating my REST queuing service as a Mirage unikernel, I reported that it could serve the data at 2.46 MB/s from my ARM CubieTruck dev board. That's fast enough for my use (it's faster than my Internet connection), but I was curious why it was slower than the Linux guest, which serves files with nc at 20 MB/s.

( this post also appeared on Hacker News and Reddit )

Table of Contents

The TCP test-case
Compiler optimisations
Profiling support
Profiling the console
Profiling UDP
Profiling TCP
Profiling disk access
- Update: Linux is slow too!
Profiling the queuing service
Conclusions

The TCP test-case

To avoid confusing things by testing the disk and the network at the same time, I made a simpler test case that waits for a TCP connection and transmits a pre-allocated buffer multiple times:

unikernel.ml

module Main (C: V1_LWT.CONSOLE) (S: V1_LWT.STACKV4) = struct
  let buffer = Io_page.pages 16 |> List.map Io_page.to_cstruct

  let start c s =
    S.listen_tcpv4 s ~port:80 (fun flow ->
      let warmups = 10 in
      let iterations = 100 in
      let bytes = Cstruct.lenv buffer * iterations in
      let rec loop = function
        | 0 -> return ()
        | i ->
            lwt () = S.TCPV4.writev flow buffer in
            loop (i - 1) in

      loop warmups >>= fun () ->
      lwt time = Profile.time (fun () -> loop iterations) in
      S.TCPV4.close flow >>= fun () ->
      C.log_s c (Printf.sprintf "Wrote %d bytes in %.3f seconds (%.2f KB/s)"
        bytes time
        (float_of_int bytes /. time /. 1024.))
    );

    S.listen s
end

Profile.time just runs the function and returns how long it took in seconds. I do a few warm-up iterations at the start because TCP starts slowly and we don't want to benchmark that.

Compiler optimisations

While looking at the assembler output during some earlier debugging, I'd noticed that gcc was generating very poor code. For example:

0041a550 <_xmalloc>:
  41a550:	e92d4800 	push	{fp, lr}
  41a554:	e28db004 	add	fp, sp, #4
  41a558:	e24dd028 	sub	sp, sp, #40	; 0x28
  41a55c:	e50b0028 	str	r0, [fp, #-40]	; 0x28
  41a560:	e50b102c 	str	r1, [fp, #-44]	; 0x2c
  41a564:	e3a03000 	mov	r3, #0
  41a568:	e50b300c 	str	r3, [fp, #-12]
  41a56c:	e3a03010 	mov	r3, #16
  41a570:	e50b3014 	str	r3, [fp, #-20]
  41a574:	e51b002c 	ldr	r0, [fp, #-44]	; 0x2c
  41a578:	e3a01004 	mov	r1, #4
  41a57c:	ebffff5d 	bl	41a2f8 <align_up>
  41a580:	e50b002c 	str	r0, [fp, #-44]	; 0x2c
  41a584:	e51b002c 	ldr	r0, [fp, #-44]	; 0x2c
  ...

gcc is using registers very inefficiently here. For example, it stores r1 to [fp, #-44] and then a few lines later loads from there into r0, when it could just have moved it directly. The last two lines show it saving r0 to the stack and then immediately loading it back again into the same register!

The fix here turned out to be simple. Mini-OS by default compiles in debug mode with no optimisations. Compiling with debug=n fixes this, and I updated mirage-xen-minios to do this.

Optimisations	TCP download speed
none	6.92 MB/s
-O3	11.93 MB/s

Even though Mirage is almost all OCaml, it does use Mini-OS's C functions for various low-level operations and these optimisations make a big difference!

Profiling support

The OCaml compiler provides a profiling option, which works the same way as gcc's -pg option for C code. To enable it, you add true: profile to your _tags file and rebuild.

I decided to see what would happen if I enabled this for my Xen unikernel:

_build/main.native.o: In function `caml_program':
:(.text+0x2): undefined reference to `__gnu_mcount_nc'

Profiling works by inserting a call to __gnu_mcount_nc at the start of every function. It looks like this:

00000000 <caml_program>:
       0:       b500            push    {lr}
       2:       f7ff fffe       bl      0 <__gnu_mcount_nc>
       ...

The __gnu_mcount_nc function gets the address of the callee function (caml_program in this example) in the link register (lr/r14) and the address of its caller on the stack (pushed by the code fragment above). Normally, the profiler would use this information to build up a static call graph (saying which functions call which other functions). Using a regular timer interrupt to sample the program counter it can estimate how much time was spent in each function, and using the call graph it can show cumulative totals (time spent in each function plus time spent in its children).

I decided to start with something a bit simpler. I wrote some ARM code for __gnu_mcount_nc that simply writes the caller, callee and current time to a trace buffer (when the buffer is full, it stops tracing). Ideally, I'd like to get notified each time we leave a function too. gcc can do that for C code with its -finstrument-functions option, but I didn't see an option for that in OCaml. Instead, I assume that every function runs until I see a call whose caller is not its parent. This works surprisingly well, though it does mean that if a function seems to take a long time you need to check its parents too, and it might get confused for recursive calls. Also, for tail calls, we see the parent as the function we will return to rather than the function that actually called us.

At the end, I dump out the trace buffer to the console with some OCaml code. Back on my laptop, I wrote some code to parse this output and look up each address in the ELF image to get the function name for each address. (This code isn't public yet as it needs a lot of cleaning up.)

One thing I quickly discovered: compiling just the unikernel with profiling isn't sufficient. As soon as you call a non-profiled function it can no longer construct the call graph and the results are useless. I manually recompiled every C and OCaml library I was using with profiling, which was quite tedious.

Update: Thomas Gazagnaire has added an OPAM profiling switch which should make this much easier in future.

Initially, the trace buffer filled up almost instantly with calls to stub_evtchn_test_and_clear. It seems that we call this once for each of the 4096 channels every time we look for work. To avoid clutter, I reduced the number of event channels to 10 (this had no noticeable effect on performance). I also tried removing the memset which zeroes out newly allocated IO pages. This also made no difference.

I measured the overhead added by the tracing, both when compiled in but inactive and when actively writing to the trace buffer:

So, not too bad.

Profiling the console

The TCP code's trace was quite complicated, so I decided to start by profiling the much simpler console device, which I'd noticed was surprisingly slow at dumping the trace results.

A Xen virtual console is a pair of ring buffers (one for input from the keyboard, one for output to the screen) in a shared memory page, defined like this:

console.h

struct xencons_interface {
    char in[1024];
    char out[2048];
    XENCONS_RING_IDX in_cons, in_prod;
    XENCONS_RING_IDX out_cons, out_prod;
};

We're only interested in the "out" side here. The producer (i.e. our unikernel) writes the data to the buffer and advances the out_prod counter. The consumer (xenconsoled, running in Dom0) reads the data and advances out_cons. If the consumer catches up with the producer it sleeps until the producer notifies it there is more data. If the producer catches up with the consumer (the buffer is full) it sleeps until the consumer notifies it there is space available again.

Here's my console test-case - writing a string to the console in a loop:

module Main (C: V1_LWT.CONSOLE) = struct
  let start c =
    let len = 6 in
    let msg = String.make (len - 1) 'X' in
    let iterations = 10000 in
    let bytes = len * iterations in
    let rec loop = function
      | 0 -> return ()
      | i ->
          C.log_s c msg >>= fun () -> loop (i - 1) in

    loop 1000 >>= fun () ->  (* Warm up *)
    let t0 = Clock.time () in
    lwt () = loop iterations in
    let time = Clock.time () -. t0 in
    C.log_s c (Printf.sprintf "Wrote %d bytes in %.3f seconds (%.2f KB/s)"
      bytes time
      (float_of_int bytes /. time /. 1024.)) >>= fun () ->
    OS.Time.sleep 2.0
end

I got around 45 KB/s. The trace output looked like this:

Start	Function	Duration
118684	- - `camlUnikernel__loop_1270`	219
118730	- - > `camlConsole__write_all_low_1123`	158
118737	- - > - `camlRing__repeat_1279`	89
118739	- - > - - `camlRing__write_1270`	87
118830	- - > - `camlEventchn__fun_1138`	58
118831	- - > - - `stub_evtchn_notify`	57
118907	- - `camlUnikernel__loop_1270`	211
118944	- - > `camlConsole__write_all_low_1123`	168
118951	- - > - `camlRing__repeat_1279`	100
118953	- - > - - `camlRing__write_1270`	98
119054	- - > - `camlEventchn__fun_1138`	58
119055	- - > - - `stub_evtchn_notify`	57

The start time and duration are measured in counter clock ticks, and the counter is running at 24 MHz. The -->-->--> indicates the level of nesting (I vary the character to make it easier to scan vertically with the eye). The output shows two iterations of the loop taken from the middle of the sample. To make the output more readable, my analysis script prunes the tree at calls that took less than 50 ticks, and removes calls to the Lwt library (while still showing the functions they called as a result). The durations include the times for their children (including pruned children).

You can see that on each iteration we call Console.write_all_low, which writes the string to the shared memory ring and notifies the console daemon in Dom0. Each iteration is taking roughly 200 ticks, which is about 8 us per iteration. So we'd expect the speed to be around 6 bytes / 8 us, which is about 700 KB/s.

Looking at the cumulative time spent in each function, the top entries are:

Function	Ticks (at 24 MHz)
`caml_c_call`	9002738
`caml_block_domain`	9001576
`block_domain`	9001462
`camlUnikernel__loop_1270`	374418
`camlConsole__write_all_low_1123`	298735

Note: the trace only includes calls until the trace buffer was full, so these aren't the total times for the whole run. But we can immediately see that we spent most of the time in block_domain, which is what Mirage calls when it has nothing to do and is waiting for an external event. Here's a graph showing how many iterations of the test loop we had started over time:

So, we wrote 679 messages very quickly, then waited a long time, then wrote 1027 more, then waited again, etc. I thought there might be a bug in block_domain causing it to miss a wake-up event, so I limited the time it would spend blocking. It didn't make any difference; it would keep waking up, seeing that it had nothing to do, and going back to sleep again.

In case the problem was with Mirage's implementation of the shared rings or console device, I tried writing the same test directly in C in Mini-OS's test.c and got the same result (I had to modify it slightly because by default Mini-OS's console_print discards data when the buffer is full instead of waiting). Finally, I tried it from a Linux guest and got 25 KB/s (interestingly, Linux uses 100% CPU while doing this). The times were highly variable (each point on this plot is from writing the message 10,000 times and calculating the average):

After some investigation, it turned out that Xen was deliberately limiting the rate:

xen/tools/console/daemon/io.c

/* How many events are allowed in each time period */
#define RATE_LIMIT_ALLOWANCE 30
/* Duration of each time period in ms */
#define RATE_LIMIT_PERIOD 200

Mystery solved, although I don't know why the rates are so variable. Mirage wasn't doing anything except running the test case and Linux was booted with init=/bin/bash, so there was nothing else running there either.

Lessons:

Just scrolling through the raw trace can be misleading. It appears that it keeps calling the loop function, but in fact almost all the time was spent in the two very rare block_domain calls. Graphing iterations over time can show these problems effectively.
Compare with the speed on Linux. Sometimes, Xen really is that slow and it's not our fault.
Compile everything for profiling or the results aren't much use.

Profiling UDP

TCP involves ack packets, expanding windows and other complications, so I next looked at the simpler UDP protocol. Here, we can throw packets out continuously without worrying about the other end.

With a payload size of 1476 bytes (the maximum possible for UDP), I got 17 MB/s. All packets were successfully received on my laptop. My Linux guest got 13.4 MB/s with nc -u < /dev/zero, so we're actually faster!

Here's a sample iteration from the trace:

Start	Function	Duration
243418	- - > - `camlUnikernel__loop_1287`	938
243436	- - > - - `camlUdpv4__fun_1430`	266
243439	- - > - - > `camlIpv4__allocate_frame_1369`	190
243440	- - > - - > - `camlIo_page__get_1122`	72
243632	- - > - - > `camlIpv4__fun_1509`	69
243818	- - > - - `camlNetif__fun_2893`	185
243852	- - > - - > `camlNetif__fun_2618`	103
243895	- - > - - > - `camlLwt_ring__fun_1223`	58
244007	- - > - - `camlNetif__fun_2931`	132
244009	- - > - - > `camlNetif__xmit_1509`	123
244028	- - > - - > - `camlNetif__fun_2618`	76
244141	- - > - - `camlNetif__fun_2958`	177
244144	- - > - - > `camlRing__push_requests_and_check_notify_1112`	174
244150	- - > - - > - `camlRing__sring_push_requests_1070`	158
244151	- - > - - > - - `caml_memory_barrier`	77
244231	- - > - - > - - `caml_memory_barrier`	77

Here's a graph of loop iterations (packets sent) over time (each blue dot is one packet sent):

The gaps indicate places where we were not sending packets. The garbage collector shows up twice in the trace (both times in Ring.ack_responses oddly). However, we spend more time in block_domain than doing GC, indicating that we're often waiting for Xen. Looking at the trace just before it blocks, I see calls to Netif.wait_for_free_tx, which seems reasonable.

Profiling TCP

The TCP header is larger than the UDP one, making it less efficient even in the best case, and TCP needs to process acks, keep track of window sizes, and handle retransmissions. Strange, then, that the Linux guest manages 39 MB/s over TCP compared with just 13.4 MB/s for UDP! (even stranger is that I got 47.2 MB/s for Linux when I tried it for last month's post; however I am using a different version of Linux in dom0 now)

I capture some packets sent by the Linux guest using tshark running in Dom0. Loading it into Wireshark on my laptop, I see that all the TCP checksums are wrong, so it looks like Linux is using TCP checksum offloading.

According to Question about TCP checksum offload in Xen:

A domain has no way of knowing how any given packet is going to leave the host (or even if it is) so it can't know ahead of time whether to calculate any checksums: the skb's [socket buffers] are just marked with "checksum needed" as usual and either the egress NIC will do the job or dom0 will do it.

Getting this working on Mirage was a bit tricky. The TCP layer can avoid adding the checksum only if the network device says it's capable of doing it itself, and packets have to be flagged as needing the checksum. You can't just flag all packets because the Linux dom0 silently drops non-TCP/UDP packets with it set (e.g. ARP packets). I hacked something together and got a modest speed improvement.

Here's a graph for the TCP test, where each iteration of the loop is sending one TCP packet (segment):

Note: we send many warm up packets before starting the trace as TCP starts slowly (which looks pretty but isn't relevant here).

Zooming in, the picture is quite interesting (where it had gaps, I searched for a typical function that occurred in the gap and added a symbol for it):

It looks like we start by transmitting packets steadily, until the current window is full. Then we start buffering the packets instead of sending them, which is very fast. At some point the TCP system stops accepting more data, which causes the main loop to block, allowing us to process other events. rx_poll response indicates one iteration of the Netif.rx_poll loop, which seems to be dealing with acks from Xen saying that our packets have been transmitted (and the memory can therefore be recycled). After a while, the TCP ack packets arrive and we process them, which opens up the transmit window again. Then we send out the buffered packets, before returning to the main loop.

So, in each cycle we spend about 60% of the time transmitting packets, a quarter dealing with acks from Xen and the rest handling TCP acks from the remote host. It might be possible to optimise things a bit here by reusing grant references, but I didn't investigate further.

Profiling disk access

My next test case reads a series of sectors sequentially from the disk and then writes them. Reading or writing one sector (4096 bytes) at a time was very slow (2.7 MB/s read, 0.7 MB/s write). Using larger buffers, so that we transfer more in each operation, helped but even at 64 sectors per op I only got 12.3 MB/s read / 5.12 MB/s write (the device is capable of 20 MB/s read and 10 MB/s write). Here's a trace where we read using 32-sector buffers (10.9 MB/s):

We spend a lot of time waiting for each block to arrive, although there are some curious ack messages, which we deal with quickly. What if we have two requests in flight at once? This gets us 18.27 MB/s:

Strangely, the two blocks arrive close together. Although it takes us longer to get the first one (I don't know why), we get them more quickly after that. Having three requests in flight doesn't help though (18.25 MB/s):

Looking at the block driver code, it batches requests into groups of 11. This probably explains why 32 sectors-per-read did well - it's very close to 33.

For writing, the number of requests in flight makes little difference, but writing 8 sectors in each request is by far the best (7 MB/s).

I don't understand why we're not getting the full speed of the card here, since we're spending most of the time blocking. However, we are pretty close (18r/7w out of a possible 20r/10w), which is good enough for today.

Update: Linux is slow too!

I originally tested with hdparm, which reports about 20 MB/s as expected:

$ hdparm -t /dev/mmcblk0
 Timing buffered disk reads:  62 MB in  3.07 seconds =  20.21 MB/sec

But testing with dd, I don't get this speed. dd's speed seems to depend a lot on the block size. Using 4096 * 11 bytes (which I assume is what dom0 would do in response to a single guest request), I get just 16.9 MB/s:

$ dd iflag=direct if=/dev/vg0/bench of=/dev/null bs=45056 count=1000
1000+0 records in
1000+0 records out
45056000 bytes (45 MB) copied, 2.65911 s, 16.9 MB/s

Block size (pages)	Linux dom0	Linux domU
11	17.0 MB/s	14.5 MB/s
16	18.8 MB/s	16.3 MB/s
32	20.8 MB/s	18.6 MB/s

So perhaps Mirage is doing pretty well already - it's about as fast as the Linux guest. Xen seems to be the limiting factor here, because it doesn't allow us to make large enough requests.

Profiling the queuing service

Finally, I looked at applying all this new information to my queuing service. As a baseline, wget reports that I can currently download from it at 4.6 MB/s, with profiling compiled in but disabled:

There's some complicated copying going on because we're using the HTTP Chunked encoding, which writes the size of each chunk of data followed by the data itself, then the next chunk, etc. Since we know the length at the start, we can use the simpler Fixed encoding. This increases the speed to 5.2 MB/s. It's a shame the HTTP API uses strings everywhere: we have to copy the data from the disk buffer to a string on the heap to give it to the HTTP API, which then copies it back into a new buffer to send it to the network card. If it took a stream of buffers, we could just pass them straight through.

Finally, I added the read-ahead support from the block profiling above, which increased the speed to 6.8 MB/s. Here's the new graph, showing that we're sending packets much faster (note the change in the Y-scale):

I used a queue length of 5, with 33 sectors per request. I tried increasing it to 10, but that caused more GC work.

Conclusions

Even the unoptimised service is faster than my current (ADSL) Internet connection, so optimising it isn't currently necessary, but it's interesting to look at performance and get a feel for where the bottlenecks are.

Mirage doesn't have any specific profiling support, but the fact that the whole OS is a single executable makes profiling it quite easy. OCaml's profile option isn't a perfect fit for tracing because it doesn't record when a function finishes, but you can still get useful results from it. Graphing some metric (e.g. packets sent) over time seemed the most useful way to look at the data. I'm currently just using libreoffice's chart tool, but I should probably find something more suitable. It would be great to be able to zoom in easily, show durations (not just events), filter the trace display easily, etc. I'd also like support for following Lwt threads even when they block. Recommendations for good visualisation tools welcome!

Writing to the Xen console from Mirage is slow because xenconsoled rate limits us. Mirage still gets better performance than Linux though, and uses far less CPU (looks like Linux is just spinning). My UDP test kernel sent data faster than Linux's nc utility (probably because nc made a poor choice of payload size). Linux does very well on TCP. I don't know why it's so fast. Using Xen's TCP checksum offloading does help a bit though. SD card performance on Mirage is close to what the hardware supports when I choose the right request size and keep two requests in flight at once. It's surprising we don't manage the full speed, though. For networking and disk access, managing Xen's grant refs for the shared memory pages seems to take up a lot of time - maybe there are ways to optimise that.

With a few modifications (TCP checksum offload, HTTP fixed encoding, keeping multiple disk reads in flight and using optimal buffer sizes), I increased the download speed of my test service running on my ARM dev board from 2.46 MB/s to 7.24 MB/s (when compiled without profiling). I'm sure people more familiar with Mirage will have more suggestions.