I wanted to make a simple REST service for queuing file uploads, deployable as a virtual machine. The traditional way to do this is to download a Linux cloud image, install the software inside it, and deploy that. Instead I decided to try a unikernel.
Unikernels promise some interesting benefits. The Ubuntu 14.04 amd64-disk1.img cloud image is 243 MB unconfigured, while the unikernel ended up at just 5.2 MB (running the queue service). Ubuntu runs a large amount of C code in security-critical places, while the unikernel is almost entirely type-safe OCaml. And besides, trying new things is fun.
( this post also appeared on Reddit and Hacker News )
Table of Contents
Regular readers will know that a few months ago I began a new job at Cambridge University. Working for an author of Real World OCaml and leader of OCaml Labs, on a project building pure-OCaml distributed systems, who found me through my blog posts about learning OCaml, I thought they might want me to write some OCaml.
But no. They've actually had me porting the tiny Mini-OS kernel to ARM, using a mixture of C and assembler, to let the Mirage unikernel run on ARM devices. Of course, I got curious and wanted to write a Mirage application for myself...
Introduction
Linux, like many popular operating systems, is a multi-user system. This design dates back to the early days of computing, when a single expensive computer, running a single OS, would be shared between many users. The goal of the kernel is to protect itself from its users, and to protect the users from each other.
Today, computers are cheap and many people own several. Even when a physical computer is shared (e.g. in cloud computing), this is typically done by running multiple virtual machines, each serving a single user. Here, protecting the OS from its (usually single) application is pointless.
Removing the security barrier between the kernel and the application greatly simplifies things; we can run the whole system (kernel + application) as a single, privileged, executable - a unikernel.
And while we're rewriting everything anyway, we might as well replace C with a modern memory safe language, eliminating whole classes of bugs and security vulnerabilities, allowing decent error reporting, and providing structured data types throughout.
In the past, two things have made writing a completely new OS impractical:
- Legacy applications won't run on it.
- It probably won't support your hardware.
Virtualisation removes both obstacles: legacy applications can run in their own legacy VMs, and drivers are only needed for the virtual devices - e.g. a single network driver and a single block driver will cover all real network cards and hard drives.
A hello world kernel
The mirage tutorial starts by showing the easy, fully-automated way to build a unikernel. If you want to get started quickly you may prefer to read that and skip this section, but since one of the advantages of unikernels is their relative simplicity, let's do things the "hard" way first to understand how it works behind the scenes.
Here's the normal "hello world" program in OCaml:
1 2 |
|
To compile and run as a normal application, we'd do:
$ ocamlopt hw.ml -o hw
$ ./hw
Hello, world!
How can we make a unikernel that does the equivalent?
As it turns out, the above code works unmodified (though the Mirage people might frown at you for doing it this way).
We compile hw.ml to a hw.native.o
file and then link with the unikernel libraries instead of the standard C library:
$ export OPAM_DIR=$(opam config var prefix)
$ export PKG_CONFIG_PATH=$OPAM_DIR/lib/pkgconfig
$ ocamlopt -output-obj -o hw.native.o hw.ml
$ ld -d -static -nostdlib --start-group \
$(pkg-config --static --libs openlibm libminios-xen) \
hw.native.o \
$OPAM_DIR/lib/mirage-xen/libocaml.a \
$OPAM_DIR/lib/mirage-xen/libxencaml.a \
--end-group \
$(gcc -print-libgcc-file-name) \
-o hw.xen
We now have a kernel image, hw.xen
, which can be booted as a VM under the Xen hypervisor (as used by Amazon, Rackspace, etc to host VMs). But first, let's look at the libraries we added:
- openlibm
-
This is a standard maths library. It provides functions such as
sin
,cos
, etc. - libminios-xen
-
This provides the architecture-specific boot code, a
printk
function for debugging,malloc
for allocating memory and some low-level functions for talking to Xen. - libocaml.a
- The OCaml runtime (the garbage collector, etc).
- libxencaml.a
- OCaml bindings for libminios and some boot code.
- libgcc.a
- Support functions for code that gcc generates (actually, not needed on x86).
To deploy the new unikernel, we create a Xen configuration file for it (here, I'm giving it 16 MB of RAM):
1 2 3 4 5 |
|
Setting on_crash
and on_poweroff
to preserve
lets us see any output or errors, which would otherwise be missed if the VM exits too quickly.
We can now boot our new VM:
$ xl create -c hw.xl
Xen Minimal OS!
start_info: 000000000009b000(VA)
nr_pages: 0x800
shared_inf: 0x6ee97000(MA)
pt_base: 000000000009e000(VA)
nr_pt_frames: 0x5
mfn_list: 0000000000097000(VA)
mod_start: 0x0(VA)
mod_len: 0
flags: 0x0
cmd_line:
stack: 0000000000055e00-0000000000075e00
Mirage: start_kernel
MM: Init
_text: 0000000000000000(VA)
_etext: 000000000003452d(VA)
_erodata: 000000000003c000(VA)
_edata: 000000000003e4d0(VA)
stack start: 0000000000055e00(VA)
_end: 0000000000096d64(VA)
start_pfn: a6
max_pfn: 800
Mapping memory range 0x400000 - 0x800000
setting 0000000000000000-000000000003c000 readonly
skipped 0000000000001000
MM: Initialise page allocator for a8000(a8000)-800000(800000)
MM: done
Demand map pfns at 801000-2000801000.
Initialising timer interface
Initialising console ... done.
gnttab_table mapped at 0000000000801000.
xencaml: app_main_thread
getenv(OCAMLRUNPARAM) -> null
getenv(CAMLRUNPARAM) -> null
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Hello, world!
main returned 0
( Note: I'm testing locally by running Xen under VirtualBox. Not all of Xen's features can be used in this mode, but it works for testing unikernels. I'm also using my Git version of mirage-xen
; the official one will display an error after printing the greeting because it expects you to provide a mainloop too. The warnings about lseek
are just OCaml trying to find the current file offsets for stdin
, stdout
and stderr
.)
As you can see, the boot process is quite short.
Execution begins at _start
.
Using objdump -d hw.xen
, you can see that this just sets up the stack pointer register and calls the C function arch_init
:
0000000000000000 <_start>:
0: fc cld
1: 48 8b 25 0f 00 00 00 mov 0xf(%rip),%rsp # 17 <stack_start>
8: 48 81 e4 00 00 ff ff and $0xffffffffffff0000,%rsp
f: 48 89 f7 mov %rsi,%rdi
12: e8 e2 bb 00 00 callq bbf9 <arch_init>
arch_init (in libminios) initialises the traps and FPU and then prints Xen Minimal OS!
and information about various addresses.
It then calls start_kernel
.
start_kernel (in libxencaml) sets up a few more features (events, interrupts, malloc, time-keeping and grant tables), then calls caml_startup
.
caml_startup (in libocaml) initialises the garbage collector and calls caml_program
, which is our hw.native.o
.
We call print_endline
, which libxencaml, as a convenience for debugging, forwards to libminios's console_print
.
Using Mirage libraries
The above was a bit of a hack, which ended up just using the C console driver in libminios (one of the few things it provides, as it's needed for printk).
We can instead use the mirage-console-xen
OCaml library, like this:
1 2 3 4 5 6 7 8 9 10 |
|
Mirage uses the usual Lwt
library for cooperative threading, which I wrote about at last year in Asynchronous Python vs OCaml - >>=
means to wait for the result, allowing other code to run. Everything in Mirage is non-blocking, even looking up the console. OS.Main.run
runs the main event loop.
Since we're using libraries, let's switch to ocamlbuild and give the dependencies in the _tags
file, as usual for OCaml projects:
true: warn(A), strict_sequence, package(mirage-console-xen)
The only unusual thing we have to do here is tell ocamlbuild not to link in the Unix
module when we build hw.native.o
:
$ ocamlbuild -lflags -linkpkg,-dontlink,unix -use-ocamlfind hw.native.o
In the same way, we can use other libraries to access raw block devices (mirage-block-xen), timers (mirage-clock-xen) and network interfaces (mirage-net-xen). Other (non-Xen-specific) OCaml libraries can then be used on top of these low-level drivers. For example, fat-filesystem can provide a filesystem on a block device, while tcpip provides an OCaml TCP/IP stack on a network interface.
The mirage-unix libraries
You may have noticed that the Xen driver libraries we used above ended in -xen
.
In fact, each of these is just an implementation of some generic interface provided by Mirage.
For example, mirage/types defines the abstract CONSOLE
interface as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
By linking against the -unix
versions of libraries rather than the -xen
ones, we can compile our code as an ordinary Unix program and run it directly.
This makes testing and debugging very easy.
To make sure our code is generic enough to do this, we can wrap it in a functor that takes any console module as an input:
1 2 3 4 |
|
The code that provides a Xen or Unix console and calls this goes in main.ml
:
1 2 3 4 5 6 7 8 9 10 11 |
|
The mirage
tool
With the platform-specific code isolated in main.ml
, we can now use the mirage
command-line tool to generate it automatically for the target platform.
mirage
takes a config.ml
configuration file and generates Makefile
and main.ml
based on the current platform and the arguments passed.
1 2 3 4 5 6 7 8 |
|
$ mirage configure --unix
$ make
$ ./mir-hw
Hello, world!
I won't describe this in detail because at this point we've reached the start of the official tutorial, and you can read that instead.
Test case
Because 0install is decentralised, it doesn't need a single centrally-managed repository (or several incompatible repositories, each trying to package every program, as is common with Linux distributions). In 0install, it's possible for every developer to run their own repository, containing just their software, with cross-repository dependencies handled automatically. But just because it's possible doesn't mean we have to go to that extreme: having medium sized repositories each managed by a team of people can be very convenient, especially where package maintainers come and go.
The general pattern for a group repository is to have a public server that accepts new package uploads from developers, and a private (firewalled) server with the repository's GPG key, which downloads from it:
Debian uses an anonymous FTP server for its incoming queue, polling it with a cron job. This turns out to be surprisingly complicated. You need to handle incomplete uploads (not processing them until they're done, or deleting them eventually if they never complete), allow contributors to overwrite or delete their own partial uploads (Debian allows you to upload a GPG-signed command file, which provides some control), etc, as well as keep the service fully patched. Also, the cron system can be annoying: if the package contains a mistake then it will be several minutes before it discovers this and emails the packager.
Perhaps there are some decent systems out there to handle all this, but it seemed like a good opportunity to try making a unikernel.
A particularly nice feature of this test-case is that it doesn't matter too much if it fails: the repository itself will check the developer's signature on the files, so an attacker can't compromise the repository by breaking into the queue; everything in the queue is intended to become public, so we need not worry much about confidentiality; lost uploads can be easily resubmitted; and if it goes down for a bit, it just means that new software can't be added to the repository. So, there's nothing critical about this service, which is reassuring.
Storage
The merge-queues library builds a queue abstraction on top of Irmin, a Git-inspired storage system for Mirage. But my needs are simple, and I wanted to test the more primitive libraries first, so I decided to build my queue directly on a plain filesystem. This was the first interface I came up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
Our unikernel.ml
will use this to make a queue, backed by a filesystem.
Uploaders' HTTP POSTs will be routed to Upload.add
, while the repository's GET and DELETE invocations go to the Download
submodule.
delete
is a separate operation because we want the repository to confirm that it got the item successfully before we delete it, in case of network errors.
Ideally, we might require that the DELETE
comes over the same HTTP connection as the GET
just in case we accidentally run two instances of the repository software, but that's unlikely and it's convenient to test using separate curl
invocations.
We're using another functor here, Upload_queue.Make
, so that our queue will work over any filesystem.
In theory, we can configure our unikernel with a FAT filesystem on a block device when running under Xen,
while using a regular directory when running under Linux (e.g. for testing).
But it doesn't work.
You can see at the top that I had to restrict Mirage's abstract FS
type in two ways:
-
The
read
andwrite
functions inFS
pass the data using the abstractpage_aligned_buffer
type. Since we need to do something with the data, this isn't good enough. I therefore declare that this must be aCstruct.t
(basically, an array of bytes). This is actually OK; mirage-fs-unix also uses this type. -
One of the possible error codes from
FS
is the abstract typeFS.block_device_error
, and I can't see any way to turn one of these into a string using theFS
interface. I therefore require a filesystem implementation that defines it to beFat.Fs.block_error
. Obviously, this means we now only support the FAT filesystem.
This doesn't prevent us from running as a normal process, because we can ask for a Unix "block" device (actually, just a plain disk.img
file) and pass that to the Fat
module, but it would be nice to have the option of using a real directory.
I asked about this on the mailing list - Mirage questions from writing a REST service - and it looks like the FS
type will change soon.
Implementation
For the curious, this initial implementation is in upload_queue.ml.
Internally, the module creates an in-memory queue to keep track of successful uploads. Uploads are streamed to the disk and when an upload completes with the declared size, the filename is added to the queue. If the upload ends with the wrong size (probably because the connection was lost), the file is deleted.
But what if our VM gets rebooted?
We need to scan the file system at start up and work out which uploads are complete and which should be deleted.
My first thought was to name the files NUMBER.part
during the upload and rename on success.
However, the FS
interface currently lacks a rename
method.
Instead, I write an N
byte to the start of each file and set it to Y
on success.
That works, but renaming would be nicer!
For downloading, the peek
function returns the item at the head of the queue.
If the queue is empty, it waits until something arrives.
The repository just makes a GET request - if something is available then it returns immediately,
otherwise the connection stays open until some data is ready, allowing the repository to respond immediately to new uploads.
Unit-testing the storage system
Because our unikernel can run as a process, testing is easy even if you don't have a local Xen deployment. A set of unit-tests test the upload queue module just as for any other program, and the service can be run as a normal process, listening on a normal TCP socket. A slight annoyance here is that the generated Makefile doesn't include any rules to build the tests so you have to add them manually, and if you regenerate the Makefile then it loses the new rule.
As you might expect from such a new system, testing uncovered several problems. The first (minor) problem is that when the disk becomes full, the unhelpful error reported by the filesystem is Failure("Unknown error: Failure(\"fault\")")
.
( I asked about this on the mailing list - Error handling in Mirage - and there seems to be agreement that error handling should change. )
A more serious problem was that deleting files corrupted the FAT directory index. I downloaded the FAT library and added a unit-test for delete, which made it easy to track the problem down (despite my lack of knowledge of FAT). Here's the code for marking a directory entry as deleted in the FAT library:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
It's supposed to take an entry, unmarshal it into an OCaml structure, set the deleted
flag, and marshal the result into a new delta
structure.
These deltas are returned and applied to the device.
The bug is a simple typo: Lfn
(long filename) entries update correctly, but for old Dos
ones it writes the new block to the input, not to delta
.
The fix was simple enough (I also refactored it slightly to encourage the correct behaviour in future):
1 2 3 4 5 6 7 8 |
|
This demonstrates both the good and the bad of Mirage: the bug was easy to find and fix, using regular debugging tools. I'm sure fixing a filesystem corruption bug in the Linux kernel would have been vastly more difficult. On the other hard, Linux is rather well tested, whereas I appear to be the first person ever to try deleting a file in Mirage!
The HTTP server
This turned out to be quite simple. Here's the unikernel's start
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Here, our functor is extended to take a filesystem (using the restricted type required by our Upload_queue
, as noted above) and an HTTP server module as arguments.
The HTTP server calls our callback
each time it receives a request, and this dispatches /uploader
requests to handle_uploader
and /downloader
ones to handle_downloader
. These are also very simple, e.g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The other methods (put
and delete
) are similar.
Buffered reads
Running as a --unix
process, I initially got a download speed of
17.2 KB/s, which was rather disappointing.
Especially as Apache on the same machine gets 615 MB/s!
Increasing the size of the chunks I was reading from the Fat
filesystem (a disk.img file) from 512 bytes to 1MB, I was able to
increase this to 2.83 MB/s, and removing the O_DIRECT
flag from
mirage-block-unix
, download speed increased to 15 MB/s (so this is
with Linux caching the data in RAM).
To check the filesystem was the problem, I removed the F.read
call
(so it would return uninitialised data instead of the actual file contents).
It then managed a very respectable 514 MB/s.
Nothing wrong with the HTTP code then.
Streaming uploads
It all worked nicely running as a Unix process, so the next step was to deploy on Xen. I was hoping that most of the bugs would already have been found during the Unix testing, but in fact there were more lurking.
It worked for very small files, but when uploading larger files it quickly ran
out of memory on my 64-bit x86 test system. I also tried it on my 32-bit CubieTruck
ARM board, but that failed even sooner, with Invalid_argument("String.create")
(on 32-bit
platforms, OCaml strings are limited to 16 MB).
In both cases, the problem was that the cohttp library tried to read the entire upload in one go.
I found the read function in Transfer_io
:
1 2 3 4 5 6 7 8 |
|
I changed it to use read
rather than read_exactly
(read
returns whatever data is available, waiting only if there isn't any at all):
1 2 3 4 5 6 7 8 9 |
|
I also had to change the signature to take a mutable reference (remaining
) for the remaining data, otherwise it has no way to know when it's done (patch).
Buffered writes
With the uploads now split into chunks, upload speed with --unix
was 178 KB/s.
Batching up the chunks (which were generally 4 KB each) into a 64 KB buffer increased the speed to 2083 KB/s.
With a 1 MB buffer, I got 6386 KB/s.
Here's the code I used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Asking on the mailing list confirmed that Fat is not well optimised. This isn't actually a problem for my service, since it's still faster than my Internet connection, but there's clearly more work needed here.
Upload speed on Xen
Testing on my little CubieTruck board, I then got:
Upload speed | 74 KB/s |
Download speed | 1.6 KB/s |
Hmm. To get a feel for what the board is capable of, I ran nc -l -p 8080 < /dev/zero
on the board
and nc cubietruck 8080 | pv > /dev/null
on my laptop, getting 29 MB/s.
Still, my unikernel is running as a guest, meaning it has the overhead of using the virtual network interface (it has to pass the data to dom0, which then sends it over the real interface). So I installed a Linux guest and tried from there. 47.2 MB/s. Interesting. I have no idea why it's faster than dom0!
I loaded up Wireshark to see what was happening with the unikernel transfers. The upload transfer mostly went fast, but stalled in the middle for 15 seconds and then for 12 seconds at the end. Wireshark showed that the unikernel was ack'ing the packets but reducing the TCP window size, indicating that the packets weren't being processed by the application code. The delays corresponded to the times when we were flushing the data to the SD card, which makes sense. So, this looks like another filesystem problem (we should be able to write to the SD card much faster than this).
TCP retransmissions
For the download, Wireshark showed that many of the packets had incorrect TCP checksums and were having to be retransmitted. I was already familiar with this bug from a previous mailing list discussion: wireshark capture of failed download from mirage-www on ARM. That turned out be a Linux bug - the privileged dom0 code responsible for sending our virtual network packets to the real network becomes confused if two packets occupy the same physical page in memory.
Here's what happens:
- We read 1 MB of data from the disk and send it to the HTTP layer as the next chunk.
Chunked.write
does the HTTP chunking and sends it to the TCP/IP channel.Channel.write_string
writes the HTTP output into pages (aligned 4K blocks of memory).Pcb.writefn
then determines that each page is too big for a TCP packet and splits each one into smaller chunks, sharing the single underlying page:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
My original fix changed mirage-net-xen
to wait until the first buffer had been read before sending the second one.
That fixed the retransmissions, but all the waiting meant I still only got 56 KB/s.
Instead, I changed writefn
to copy remaining_bit
into a new IO page, and with that I got 495 KB/s.
Replacing the filesystem read with a simple String.create
of the same length, I got 3.9 MB/s, showing that once again the
FAT filesystem was now the limiting factor.
Adding a block cache
I tried adding a block cache layer between mirage-block-xen
and fat-filesystem
, like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
With this in place, upload speed remains at 76 KB/s, but the download speed increases to 1 MB/s (for a 20 MB file, which therefore doesn't fit in the cache). This suggests that the FAT filesystem is reading the same disk sectors many times. Enlarging the memory cache to cover the whole file, the download speed only increases to 1.3 MB/s, so the FAT code must be doing some inefficient calculations too.
Replacing FAT
Since most of my problems seemed to be coming from using FAT, I decided to try a new approach.
I removed all the FAT code and the block cache and changed upload_queue.ml
to write directly to the block
device.
With that (no caching), I get:
Upload speed | 2.27 MB/s |
Download speed | 2.46 MB/s |
That's not too bad. It's faster than my Internet connection, which means that the unikernel is no longer the limiting factor.
Here's the new version: upload_queue.ml
.
The big simplification comes from knowing that the queue will spend most of its time empty (another good reason to use a small VM for it).
The code has a next_free_sector
which it advances every time an upload starts.
When the queue becomes empty and there are no uploads in progress this variable is reset back to sector 1 (sector 0 holds the index).
This does mean that we may report disk full errors to uploaders even when there is free space on the disk, but this won't happen in typical usage because the repository downloads things as soon as they're uploaded (if it does happen, it just means uploaders have to wait a couple of minutes until the repository empties the queue).
Managing the block device manually brought a few more advantages over FAT:
- No need to generate random file names for the uploads.
- No need to delete incomplete uploads (we only write the file's index entry to disk on success).
- The system should recover automatically from filesystem corruption because invalid entries can be detected reliably at boot time and discarded.
- Disk full errors are reported correctly.
- The queue ordering isn't lost on reboot.
Conclusions
Modern operating systems are often extremely complex, but much of this is historical baggage which isn't needed on a modern system where you're running a single application as a VM under a hypervisor. Mirage allows you to create very small VMs which contain almost no C code. These VMs should be easier to write, more reliable and more secure.
Creating a bootable OCaml kernel is surprisingly easy, and from there adding support for extra devices is just a matter of pulling in the appropriate libraries. By programming against generic interfaces, you can create code that runs under Linux/Unix/OS X or as a virtual machine under Xen, and switch between configurations using the mirage
tool.
Mirage is still very young, and I found many rough edges while writing my queuing service for 0install:
- While Linux provides fast, reliable filesystems as standard, Mirage currently only provides a basic FAT implementation.
- Linux provides caching as standard, while you have to implement this yourself on Mirage.
- Error reporting should be a big improvement over C's error codes, but getting friendly error messages from Mirage is currently difficult.
- The system has clearly been designed for high performance (the APIs generally write to user-provided buffers to avoid copying, much like C libraries do), but many areas have not yet been optimised.
- Buffers often have extra requirements (e.g. must be page-aligned, a single page, immutable, etc) which are not currently captured in the type system, and this can lead to run-time errors which would ideally have been detected at compile time.
However, there is a huge amount of work happening on Mirage right now and it looks like all of these problems are being worked on. If you're interested in low-level OS programming and don't want to mess about with C, Mirage is a lot of fun, and it can be useful for practical tasks already with a bit of effort.
There are still many areas I need to find out more about. In particular, using the new pure-OCaml TLS stack to secure the system and trying the Irmin Git-like distributed, branchable storage to provide the queue instead of writing it myself. I hope to try those soon...