Thomas Leonard's blog

My First Unikernel

I wanted to make a simple REST service for queuing file uploads, deployable as a virtual machine. The traditional way to do this is to download a Linux cloud image, install the software inside it, and deploy that. Instead I decided to try a unikernel.

Unikernels promise some interesting benefits. The Ubuntu 14.04 amd64-disk1.img cloud image is 243 MB unconfigured, while the unikernel ended up at just 5.2 MB (running the queue service). Ubuntu runs a large amount of C code in security-critical places, while the unikernel is almost entirely type-safe OCaml. And besides, trying new things is fun.

( this post also appeared on Reddit and Hacker News )

Table of Contents

Regular readers will know that a few months ago I began a new job at Cambridge University. Working for an author of Real World OCaml and leader of OCaml Labs, on a project building pure-OCaml distributed systems, who found me through my blog posts about learning OCaml, I thought they might want me to write some OCaml.

But no. They've actually had me porting the tiny Mini-OS kernel to ARM, using a mixture of C and assembler, to let the Mirage unikernel run on ARM devices. Of course, I got curious and wanted to write a Mirage application for myself...

Introduction

Linux, like many popular operating systems, is a multi-user system. This design dates back to the early days of computing, when a single expensive computer, running a single OS, would be shared between many users. The goal of the kernel is to protect itself from its users, and to protect the users from each other.

Today, computers are cheap and many people own several. Even when a physical computer is shared (e.g. in cloud computing), this is typically done by running multiple virtual machines, each serving a single user. Here, protecting the OS from its (usually single) application is pointless.

Removing the security barrier between the kernel and the application greatly simplifies things; we can run the whole system (kernel + application) as a single, privileged, executable - a unikernel.

And while we're rewriting everything anyway, we might as well replace C with a modern memory safe language, eliminating whole classes of bugs and security vulnerabilities, allowing decent error reporting, and providing structured data types throughout.

In the past, two things have made writing a completely new OS impractical:

  • Legacy applications won't run on it.
  • It probably won't support your hardware.

Virtualisation removes both obstacles: legacy applications can run in their own legacy VMs, and drivers are only needed for the virtual devices - e.g. a single network driver and a single block driver will cover all real network cards and hard drives.

A hello world kernel

The mirage tutorial starts by showing the easy, fully-automated way to build a unikernel. If you want to get started quickly you may prefer to read that and skip this section, but since one of the advantages of unikernels is their relative simplicity, let's do things the "hard" way first to understand how it works behind the scenes.

Here's the normal "hello world" program in OCaml:

hw.ml
1
2
let () =
  print_endline "Hello, world!"

To compile and run as a normal application, we'd do:

$ ocamlopt hw.ml -o hw
$ ./hw 
Hello, world!

How can we make a unikernel that does the equivalent? As it turns out, the above code works unmodified (though the Mirage people might frown at you for doing it this way). We compile hw.ml to a hw.native.o file and then link with the unikernel libraries instead of the standard C library:

$ export OPAM_DIR=$(opam config var prefix)
$ export PKG_CONFIG_PATH=$OPAM_DIR/lib/pkgconfig
$ ocamlopt -output-obj -o hw.native.o hw.ml
$ ld -d -static -nostdlib --start-group \
    $(pkg-config --static --libs openlibm libminios-xen) \
    hw.native.o \
    $OPAM_DIR/lib/mirage-xen/libocaml.a \
    $OPAM_DIR/lib/mirage-xen/libxencaml.a \
    --end-group \
    $(gcc -print-libgcc-file-name) \
    -o hw.xen

We now have a kernel image, hw.xen, which can be booted as a VM under the Xen hypervisor (as used by Amazon, Rackspace, etc to host VMs). But first, let's look at the libraries we added:

openlibm
This is a standard maths library. It provides functions such as sin, cos, etc.
libminios-xen
This provides the architecture-specific boot code, a printk function for debugging, malloc for allocating memory and some low-level functions for talking to Xen.
libocaml.a
The OCaml runtime (the garbage collector, etc).
libxencaml.a
OCaml bindings for libminios and some boot code.
libgcc.a
Support functions for code that gcc generates (actually, not needed on x86).

To deploy the new unikernel, we create a Xen configuration file for it (here, I'm giving it 16 MB of RAM):

hw.xl
1
2
3
4
5
name = 'hw'
kernel = 'hw.xen'
memory = 16
on_crash = 'preserve'
on_poweroff = 'preserve'

Setting on_crash and on_poweroff to preserve lets us see any output or errors, which would otherwise be missed if the VM exits too quickly.

We can now boot our new VM:

$ xl create -c hw.xl
Xen Minimal OS!
  start_info: 000000000009b000(VA)
    nr_pages: 0x800
  shared_inf: 0x6ee97000(MA)
     pt_base: 000000000009e000(VA)
nr_pt_frames: 0x5
    mfn_list: 0000000000097000(VA)
   mod_start: 0x0(VA)
     mod_len: 0
       flags: 0x0
    cmd_line: 
       stack: 0000000000055e00-0000000000075e00
Mirage: start_kernel
MM: Init
      _text: 0000000000000000(VA)
     _etext: 000000000003452d(VA)
   _erodata: 000000000003c000(VA)
     _edata: 000000000003e4d0(VA)
stack start: 0000000000055e00(VA)
       _end: 0000000000096d64(VA)
  start_pfn: a6
    max_pfn: 800
Mapping memory range 0x400000 - 0x800000
setting 0000000000000000-000000000003c000 readonly
skipped 0000000000001000
MM: Initialise page allocator for a8000(a8000)-800000(800000)
MM: done
Demand map pfns at 801000-2000801000.
Initialising timer interface
Initialising console ... done.
gnttab_table mapped at 0000000000801000.
xencaml: app_main_thread
getenv(OCAMLRUNPARAM) -> null
getenv(CAMLRUNPARAM) -> null
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Hello, world!
main returned 0

( Note: I'm testing locally by running Xen under VirtualBox. Not all of Xen's features can be used in this mode, but it works for testing unikernels. I'm also using my Git version of mirage-xen; the official one will display an error after printing the greeting because it expects you to provide a mainloop too. The warnings about lseek are just OCaml trying to find the current file offsets for stdin, stdout and stderr.)

As you can see, the boot process is quite short. Execution begins at _start. Using objdump -d hw.xen, you can see that this just sets up the stack pointer register and calls the C function arch_init:

0000000000000000 <_start>:
       0:   fc                      cld    
       1:   48 8b 25 0f 00 00 00    mov    0xf(%rip),%rsp        # 17 <stack_start>
       8:   48 81 e4 00 00 ff ff    and    $0xffffffffffff0000,%rsp
       f:   48 89 f7                mov    %rsi,%rdi
      12:   e8 e2 bb 00 00          callq  bbf9 <arch_init>

arch_init (in libminios) initialises the traps and FPU and then prints Xen Minimal OS! and information about various addresses. It then calls start_kernel.

start_kernel (in libxencaml) sets up a few more features (events, interrupts, malloc, time-keeping and grant tables), then calls caml_startup.

caml_startup (in libocaml) initialises the garbage collector and calls caml_program, which is our hw.native.o.

We call print_endline, which libxencaml, as a convenience for debugging, forwards to libminios's console_print.

Using Mirage libraries

The above was a bit of a hack, which ended up just using the C console driver in libminios (one of the few things it provides, as it's needed for printk). We can instead use the mirage-console-xen OCaml library, like this:

hw.ml
1
2
3
4
5
6
7
8
9
10
open Lwt

let main =
  Console.connect "0" >>= function
  | `Error _ -> failwith "Failed to connect to console"
  | `Ok default_console ->
      Console.log_s default_console "Hello, world!"

let () =
  OS.Main.run main

Mirage uses the usual Lwt library for cooperative threading, which I wrote about at last year in Asynchronous Python vs OCaml - >>= means to wait for the result, allowing other code to run. Everything in Mirage is non-blocking, even looking up the console. OS.Main.run runs the main event loop.

Since we're using libraries, let's switch to ocamlbuild and give the dependencies in the _tags file, as usual for OCaml projects:

true: warn(A), strict_sequence, package(mirage-console-xen)

The only unusual thing we have to do here is tell ocamlbuild not to link in the Unix module when we build hw.native.o:

$ ocamlbuild -lflags -linkpkg,-dontlink,unix -use-ocamlfind hw.native.o

In the same way, we can use other libraries to access raw block devices (mirage-block-xen), timers (mirage-clock-xen) and network interfaces (mirage-net-xen). Other (non-Xen-specific) OCaml libraries can then be used on top of these low-level drivers. For example, fat-filesystem can provide a filesystem on a block device, while tcpip provides an OCaml TCP/IP stack on a network interface.

The mirage-unix libraries

You may have noticed that the Xen driver libraries we used above ended in -xen. In fact, each of these is just an implementation of some generic interface provided by Mirage. For example, mirage/types defines the abstract CONSOLE interface as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
module type CONSOLE = sig
  (** Text console input/output operations. *)

  type error = [
    | `Invalid_console of string
  ]
  (** The type representing possible errors when attaching a console. *)

  include DEVICE with
    type error := error

  val write : t -> string -> int -> int -> int
  (** [write t buf off len] writes up to [len] chars of [String.sub buf
      off len] to the console [t] and returns the number of bytes
      written. Raises {!Invalid_argument} if [len > buf - off]. *)

  val write_all : t -> string -> int -> int -> unit io
  (** [write_all t buf off len] is a thread that writes [String.sub buf
      off len] to the console [t] and returns when done. Raises
      {!Invalid_argument} if [len > buf - off]. *)

  val log : t -> string -> unit
  (** [log str] writes as much characters of [str] that can be written
      in one write operation to the console [t], then writes
      "\r\n" to it. *)

  val log_s : t -> string -> unit io
  (** [log_s str] is a thread that writes [str ^ "\r\n"] in the
      console [t]. *)

end

By linking against the -unix versions of libraries rather than the -xen ones, we can compile our code as an ordinary Unix program and run it directly. This makes testing and debugging very easy.

To make sure our code is generic enough to do this, we can wrap it in a functor that takes any console module as an input:

unikernel.ml
1
2
3
4
module Main (C : V1_LWT.CONSOLE) = struct
  let start c =
    C.log_s c "Hello, world!"
end

The code that provides a Xen or Unix console and calls this goes in main.ml:

main.ml
1
2
3
4
5
6
7
8
9
10
11
open Lwt

let console =
  Console.connect "0" >>= function
  | `Error _ -> failwith "Failed to connect to console"
  | `Ok c -> return c

module U = Unikernel.Main(Console)

let () =
  OS.Main.run (console >>= U.start)

The mirage tool

With the platform-specific code isolated in main.ml, we can now use the mirage command-line tool to generate it automatically for the target platform. mirage takes a config.ml configuration file and generates Makefile and main.ml based on the current platform and the arguments passed.

config.ml
1
2
3
4
5
6
7
8
open Mirage

let main = foreign "Unikernel.Main" (console @-> job)

let () =
  register "hw" [
    main $ default_console
  ]
$ mirage configure --unix
$ make
$ ./mir-hw 
Hello, world!

I won't describe this in detail because at this point we've reached the start of the official tutorial, and you can read that instead.

Test case

Because 0install is decentralised, it doesn't need a single centrally-managed repository (or several incompatible repositories, each trying to package every program, as is common with Linux distributions). In 0install, it's possible for every developer to run their own repository, containing just their software, with cross-repository dependencies handled automatically. But just because it's possible doesn't mean we have to go to that extreme: having medium sized repositories each managed by a team of people can be very convenient, especially where package maintainers come and go.

The general pattern for a group repository is to have a public server that accepts new package uploads from developers, and a private (firewalled) server with the repository's GPG key, which downloads from it:

Debian uses an anonymous FTP server for its incoming queue, polling it with a cron job. This turns out to be surprisingly complicated. You need to handle incomplete uploads (not processing them until they're done, or deleting them eventually if they never complete), allow contributors to overwrite or delete their own partial uploads (Debian allows you to upload a GPG-signed command file, which provides some control), etc, as well as keep the service fully patched. Also, the cron system can be annoying: if the package contains a mistake then it will be several minutes before it discovers this and emails the packager.

Perhaps there are some decent systems out there to handle all this, but it seemed like a good opportunity to try making a unikernel.

A particularly nice feature of this test-case is that it doesn't matter too much if it fails: the repository itself will check the developer's signature on the files, so an attacker can't compromise the repository by breaking into the queue; everything in the queue is intended to become public, so we need not worry much about confidentiality; lost uploads can be easily resubmitted; and if it goes down for a bit, it just means that new software can't be added to the repository. So, there's nothing critical about this service, which is reassuring.

Storage

The merge-queues library builds a queue abstraction on top of Irmin, a Git-inspired storage system for Mirage. But my needs are simple, and I wanted to test the more primitive libraries first, so I decided to build my queue directly on a plain filesystem. This was the first interface I came up with:

upload_queue.mli
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
module type FS = V1_LWT.FS with
  type page_aligned_buffer = Cstruct.t and
  type block_device_error = Fat.Fs.block_error

(** An upload.
 * To avoid loading complete uploads into RAM, we stream them
 * between the network and the disk. *)
type item = {
  size : int64;
  data : string Lwt_stream.t;
}

type add_error = [`Wrong_size of int64 | `Unknown of exn]

module Make : functor (F : FS) -> sig
  (** An upload queue. *)
  type t

  (** Create a new queue, backed by a filesystem. *)
  val create : F.t -> t Lwt.t

  module Upload : sig
    (** Add an upload to the queue.
     * The upload is added only once the end of the stream is
     * reached, and only if the total size matches the size
     * in the record.
     * To cancel an add, just terminate the stream. *)
    val add :
      t -> item -> [ `Ok of unit | `Error of add_error ] Lwt.t
  end

  module Download : sig
    (** Interface for the repository software to fetch items
     * from the queue. Only one client may use this interface
     * at a time, or things will go wrong. *)

    (** Return a fresh stream for the item at the head of the
     * queue, without removing it. After downloading it
     * successfully, the client should call [delete]. If the
     * queue is empty, this blocks until an item is available. *)
    val peek : t -> item Lwt.t

    (** Delete the item previously retrieved by [peek].
     * If the previous item has already been deleted, this does
     * nothing, even if there are more items in the queue. *)
    val delete : t -> unit Lwt.t
  end
end

Our unikernel.ml will use this to make a queue, backed by a filesystem. Uploaders' HTTP POSTs will be routed to Upload.add, while the repository's GET and DELETE invocations go to the Download submodule. delete is a separate operation because we want the repository to confirm that it got the item successfully before we delete it, in case of network errors.

Ideally, we might require that the DELETE comes over the same HTTP connection as the GET just in case we accidentally run two instances of the repository software, but that's unlikely and it's convenient to test using separate curl invocations.

We're using another functor here, Upload_queue.Make, so that our queue will work over any filesystem. In theory, we can configure our unikernel with a FAT filesystem on a block device when running under Xen, while using a regular directory when running under Linux (e.g. for testing).

But it doesn't work. You can see at the top that I had to restrict Mirage's abstract FS type in two ways:

  • The read and write functions in FS pass the data using the abstract page_aligned_buffer type. Since we need to do something with the data, this isn't good enough. I therefore declare that this must be a Cstruct.t (basically, an array of bytes). This is actually OK; mirage-fs-unix also uses this type.

  • One of the possible error codes from FS is the abstract type FS.block_device_error, and I can't see any way to turn one of these into a string using the FS interface. I therefore require a filesystem implementation that defines it to be Fat.Fs.block_error. Obviously, this means we now only support the FAT filesystem.

This doesn't prevent us from running as a normal process, because we can ask for a Unix "block" device (actually, just a plain disk.img file) and pass that to the Fat module, but it would be nice to have the option of using a real directory.

I asked about this on the mailing list - Mirage questions from writing a REST service - and it looks like the FS type will change soon.

Implementation

For the curious, this initial implementation is in upload_queue.ml.

Internally, the module creates an in-memory queue to keep track of successful uploads. Uploads are streamed to the disk and when an upload completes with the declared size, the filename is added to the queue. If the upload ends with the wrong size (probably because the connection was lost), the file is deleted.

But what if our VM gets rebooted? We need to scan the file system at start up and work out which uploads are complete and which should be deleted. My first thought was to name the files NUMBER.part during the upload and rename on success. However, the FS interface currently lacks a rename method. Instead, I write an N byte to the start of each file and set it to Y on success. That works, but renaming would be nicer!

For downloading, the peek function returns the item at the head of the queue. If the queue is empty, it waits until something arrives. The repository just makes a GET request - if something is available then it returns immediately, otherwise the connection stays open until some data is ready, allowing the repository to respond immediately to new uploads.

Unit-testing the storage system

Because our unikernel can run as a process, testing is easy even if you don't have a local Xen deployment. A set of unit-tests test the upload queue module just as for any other program, and the service can be run as a normal process, listening on a normal TCP socket. A slight annoyance here is that the generated Makefile doesn't include any rules to build the tests so you have to add them manually, and if you regenerate the Makefile then it loses the new rule.

As you might expect from such a new system, testing uncovered several problems. The first (minor) problem is that when the disk becomes full, the unhelpful error reported by the filesystem is Failure("Unknown error: Failure(\"fault\")").

( I asked about this on the mailing list - Error handling in Mirage - and there seems to be agreement that error handling should change. )

A more serious problem was that deleting files corrupted the FAT directory index. I downloaded the FAT library and added a unit-test for delete, which made it easy to track the problem down (despite my lack of knowledge of FAT). Here's the code for marking a directory entry as deleted in the FAT library:

1
2
3
4
5
6
7
8
9
10
11
12
    let b = Cstruct.sub block offset sizeof in
    let delta = Cstruct.create sizeof in
    begin match unmarshal b with
      | Lfn lfn ->
	let lfn' = { lfn with lfn_deleted = true } in
	marshal delta (Lfn lfn')
      | Dos dos ->
	let dos' = { dos with deleted = true } in
	marshal b (Dos dos')
      | End -> assert false
    end;
    Update.from_cstruct (Int64.of_int offset) delta :: acc

It's supposed to take an entry, unmarshal it into an OCaml structure, set the deleted flag, and marshal the result into a new delta structure. These deltas are returned and applied to the device. The bug is a simple typo: Lfn (long filename) entries update correctly, but for old Dos ones it writes the new block to the input, not to delta. The fix was simple enough (I also refactored it slightly to encourage the correct behaviour in future):

1
2
3
4
5
6
7
8
    let b = Cstruct.sub block offset sizeof in
    let delta = Cstruct.create sizeof in
    marshal delta begin match unmarshal b with
      | Lfn lfn -> Lfn { lfn with lfn_deleted = true }
      | Dos dos -> Dos { dos with deleted = true }
      | End -> assert false
    end;
    Update.from_cstruct (Int64.of_int offset) delta :: acc

This demonstrates both the good and the bad of Mirage: the bug was easy to find and fix, using regular debugging tools. I'm sure fixing a filesystem corruption bug in the Linux kernel would have been vastly more difficult. On the other hard, Linux is rather well tested, whereas I appear to be the first person ever to try deleting a file in Mirage!

The HTTP server

This turned out to be quite simple. Here's the unikernel's start function:

unikernel.ml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
module Main (C : V1_LWT.CONSOLE)
            (F : Upload_queue.FS)
            (H : Cohttp_lwt.Server) = struct
  module Q = Upload_queue.Make(F)
  [...]
  let start c fs http =
    Log.write := C.log_s c;
    Log.info "starting queue service" >>= fun () ->

    Q.create fs >>= fun q ->

    let callback _conn_id request body =
      match Uri.path request.H.Request.uri with
      | "/uploader" -> handle_uploader q request body
      | "/downloader" -> handle_downloader q request
      | path ->
          H.respond_error
	    ~status:`Bad_request
      	    ~body:(Printf.sprintf "Bad path '%s'\n" path)
	    () in

    let conn_closed _conn_id () =
      Log.info "connection closed" |> ignore in

    http { H.
      callback;
      conn_closed
    }
end

Here, our functor is extended to take a filesystem (using the restricted type required by our Upload_queue, as noted above) and an HTTP server module as arguments.

The HTTP server calls our callback each time it receives a request, and this dispatches /uploader requests to handle_uploader and /downloader ones to handle_downloader. These are also very simple, e.g.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  let get q =
    Q.Download.peek q >>= fun {Upload_queue.size; data} ->
    let body = Cohttp_lwt_body.of_stream data in
    let headers = Cohttp.Header.init_with
      "Content-Length" (Int64.to_string size) in
    (* Adding a content-length loses the transfer-encoding
     * for some reason, so add it back: *)
    let headers = Cohttp.Header.add headers
      "transfer-encoding" "chunked" in
    H.respond ~headers ~status:`OK ~body ()

  let handle_downloader q request =
    match H.Request.meth request with
    | `GET -> get q
    | `DELETE -> delete q
    | `HEAD | `PUT | `POST
    | `OPTIONS | `PATCH -> unsupported_method

The other methods (put and delete) are similar.

Buffered reads

Running as a --unix process, I initially got a download speed of 17.2 KB/s, which was rather disappointing. Especially as Apache on the same machine gets 615 MB/s!

Increasing the size of the chunks I was reading from the Fat filesystem (a disk.img file) from 512 bytes to 1MB, I was able to increase this to 2.83 MB/s, and removing the O_DIRECT flag from mirage-block-unix, download speed increased to 15 MB/s (so this is with Linux caching the data in RAM).

To check the filesystem was the problem, I removed the F.read call (so it would return uninitialised data instead of the actual file contents). It then managed a very respectable 514 MB/s. Nothing wrong with the HTTP code then.

Streaming uploads

It all worked nicely running as a Unix process, so the next step was to deploy on Xen. I was hoping that most of the bugs would already have been found during the Unix testing, but in fact there were more lurking.

It worked for very small files, but when uploading larger files it quickly ran out of memory on my 64-bit x86 test system. I also tried it on my 32-bit CubieTruck ARM board, but that failed even sooner, with Invalid_argument("String.create") (on 32-bit platforms, OCaml strings are limited to 16 MB).

In both cases, the problem was that the cohttp library tried to read the entire upload in one go. I found the read function in Transfer_io:

1
2
3
4
5
6
7
8
let read ~len ic =
  (* TODO functorise string to a bigbuffer *)
  match len with
  |0 -> return Done
  |len ->
    read_exactly ic len >>= function
    |None -> return Done
    |Some buf -> return (Final_chunk buf)

I changed it to use read rather than read_exactly (read returns whatever data is available, waiting only if there isn't any at all):

1
2
3
4
5
6
7
8
9
let read ~remaining ic =
  (* TODO functorise string to a bigbuffer *)
  match !remaining with
  |0 -> return Done
  |len ->
    read ic len >>= fun buf ->
    remaining := !remaining - String.length buf;
    if !remaining = 0 then return (Final_chunk buf)
    else return (Chunk buf)

I also had to change the signature to take a mutable reference (remaining) for the remaining data, otherwise it has no way to know when it's done (patch).

Buffered writes

With the uploads now split into chunks, upload speed with --unix was 178 KB/s. Batching up the chunks (which were generally 4 KB each) into a 64 KB buffer increased the speed to 2083 KB/s. With a 1 MB buffer, I got 6386 KB/s.

Here's the code I used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
let page_buffer = Io_page.get 256 |> Io_page.to_cstruct in

(* Set the first byte to N to indicate that we're not done yet.
 * If we reboot while this flag is set, the partial upload will
 * be deleted. *)
let page_buffer_used = ref 1 in
Cstruct.set_char page_buffer 0 'N';
let file_offset = ref 1 in

let flush_page_buffer () =
  Log.info "Flushing %d bytes to disk" !page_buffer_used >>= fun () ->
  let buffered_data = Cstruct.sub page_buffer 0 !page_buffer_used in
  F.write q.fs name !file_offset buffered_data >>|= fun () ->
  file_offset := !file_offset + !page_buffer_used;
  page_buffer_used := 0;
  return () in

let rec add_data src i =
  let src_remaining = String.length src - i in
  if src_remaining = 0 then return ()
  else (
    let page_buffer_free = Cstruct.len page_buffer - !page_buffer_used in
    let chunk_size = min page_buffer_free src_remaining in
    Cstruct.blit_from_string src i page_buffer !page_buffer_used chunk_size;
    page_buffer_used := !page_buffer_used + chunk_size;
    lwt () =
      if page_buffer_free = chunk_size then flush_page_buffer ()
      else return () in
    add_data src (i + chunk_size)
  ) in

data |> Lwt_stream.iter_s (fun data -> add_data data 0) >>=
flush_page_buffer

Asking on the mailing list confirmed that Fat is not well optimised. This isn't actually a problem for my service, since it's still faster than my Internet connection, but there's clearly more work needed here.

Upload speed on Xen

Testing on my little CubieTruck board, I then got:

Upload speed 74 KB/s
Download speed 1.6 KB/s

Hmm. To get a feel for what the board is capable of, I ran nc -l -p 8080 < /dev/zero on the board and nc cubietruck 8080 | pv > /dev/null on my laptop, getting 29 MB/s.

Still, my unikernel is running as a guest, meaning it has the overhead of using the virtual network interface (it has to pass the data to dom0, which then sends it over the real interface). So I installed a Linux guest and tried from there. 47.2 MB/s. Interesting. I have no idea why it's faster than dom0!

I loaded up Wireshark to see what was happening with the unikernel transfers. The upload transfer mostly went fast, but stalled in the middle for 15 seconds and then for 12 seconds at the end. Wireshark showed that the unikernel was ack'ing the packets but reducing the TCP window size, indicating that the packets weren't being processed by the application code. The delays corresponded to the times when we were flushing the data to the SD card, which makes sense. So, this looks like another filesystem problem (we should be able to write to the SD card much faster than this).

TCP retransmissions

For the download, Wireshark showed that many of the packets had incorrect TCP checksums and were having to be retransmitted. I was already familiar with this bug from a previous mailing list discussion: wireshark capture of failed download from mirage-www on ARM. That turned out be a Linux bug - the privileged dom0 code responsible for sending our virtual network packets to the real network becomes confused if two packets occupy the same physical page in memory.

Here's what happens:

  1. We read 1 MB of data from the disk and send it to the HTTP layer as the next chunk.
  2. Chunked.write does the HTTP chunking and sends it to the TCP/IP channel.
  3. Channel.write_string writes the HTTP output into pages (aligned 4K blocks of memory).
  4. Pcb.writefn then determines that each page is too big for a TCP packet and splits each one into smaller chunks, sharing the single underlying page:
1
2
3
4
5
6
7
8
9
10
11
12
13
  let rec writefn pcb wfn data =
    let len = Cstruct.len data in
    match write_available pcb with
    | 0 ->
      write_wait_for pcb 1 >>
      writefn pcb wfn data
    | av_len when av_len < len -> 
      let first_bit = Cstruct.sub data 0 av_len in
      let remaing_bit = Cstruct.sub data av_len (len - av_len) in
      writefn pcb wfn first_bit  >>
      writefn pcb wfn remaing_bit
    | av_len -> 
      wfn [data]

My original fix changed mirage-net-xen to wait until the first buffer had been read before sending the second one. That fixed the retransmissions, but all the waiting meant I still only got 56 KB/s. Instead, I changed writefn to copy remaining_bit into a new IO page, and with that I got 495 KB/s.

Replacing the filesystem read with a simple String.create of the same length, I got 3.9 MB/s, showing that once again the FAT filesystem was now the limiting factor.

Adding a block cache

I tried adding a block cache layer between mirage-block-xen and fat-filesystem, like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
module Main (C : V1_LWT.CONSOLE)
            (B : V1_LWT.BLOCK)
	    (H : Cohttp_lwt.Server) = struct
  module BC = Block_cache.Make(B)
  module F = Fat.Fs.Make(BC)(Io_page)
  module Q = Upload_queue.Make(F)

  let mem_cache_size = 1024 * 1024	(* 1 MB *)

  let start c b http =
    Log.write := C.log_s c;
    Log.info "start in queue service" >>= fun () ->

    BC.connect (b, mem_cache_size) >>= function
    | `Error _ -> failwith "BC.connect"
    | `Ok bc ->
    F.connect bc >>= function
    | `Error _ -> failwith "F.connect"
    | `Ok fs ->
    Q.create fs >>= fun q ->
...

With this in place, upload speed remains at 76 KB/s, but the download speed increases to 1 MB/s (for a 20 MB file, which therefore doesn't fit in the cache). This suggests that the FAT filesystem is reading the same disk sectors many times. Enlarging the memory cache to cover the whole file, the download speed only increases to 1.3 MB/s, so the FAT code must be doing some inefficient calculations too.

Replacing FAT

Since most of my problems seemed to be coming from using FAT, I decided to try a new approach. I removed all the FAT code and the block cache and changed upload_queue.ml to write directly to the block device. With that (no caching), I get:

Upload speed 2.27 MB/s
Download speed 2.46 MB/s

That's not too bad. It's faster than my Internet connection, which means that the unikernel is no longer the limiting factor.

Here's the new version: upload_queue.ml. The big simplification comes from knowing that the queue will spend most of its time empty (another good reason to use a small VM for it). The code has a next_free_sector which it advances every time an upload starts. When the queue becomes empty and there are no uploads in progress this variable is reset back to sector 1 (sector 0 holds the index). This does mean that we may report disk full errors to uploaders even when there is free space on the disk, but this won't happen in typical usage because the repository downloads things as soon as they're uploaded (if it does happen, it just means uploaders have to wait a couple of minutes until the repository empties the queue).

Managing the block device manually brought a few more advantages over FAT:

  1. No need to generate random file names for the uploads.
  2. No need to delete incomplete uploads (we only write the file's index entry to disk on success).
  3. The system should recover automatically from filesystem corruption because invalid entries can be detected reliably at boot time and discarded.
  4. Disk full errors are reported correctly.
  5. The queue ordering isn't lost on reboot.

Conclusions

Modern operating systems are often extremely complex, but much of this is historical baggage which isn't needed on a modern system where you're running a single application as a VM under a hypervisor. Mirage allows you to create very small VMs which contain almost no C code. These VMs should be easier to write, more reliable and more secure.

Creating a bootable OCaml kernel is surprisingly easy, and from there adding support for extra devices is just a matter of pulling in the appropriate libraries. By programming against generic interfaces, you can create code that runs under Linux/Unix/OS X or as a virtual machine under Xen, and switch between configurations using the mirage tool.

Mirage is still very young, and I found many rough edges while writing my queuing service for 0install:

  • While Linux provides fast, reliable filesystems as standard, Mirage currently only provides a basic FAT implementation.
  • Linux provides caching as standard, while you have to implement this yourself on Mirage.
  • Error reporting should be a big improvement over C's error codes, but getting friendly error messages from Mirage is currently difficult.
  • The system has clearly been designed for high performance (the APIs generally write to user-provided buffers to avoid copying, much like C libraries do), but many areas have not yet been optimised.
  • Buffers often have extra requirements (e.g. must be page-aligned, a single page, immutable, etc) which are not currently captured in the type system, and this can lead to run-time errors which would ideally have been detected at compile time.

However, there is a huge amount of work happening on Mirage right now and it looks like all of these problems are being worked on. If you're interested in low-level OS programming and don't want to mess about with C, Mirage is a lot of fun, and it can be useful for practical tasks already with a bit of effort.

There are still many areas I need to find out more about. In particular, using the new pure-OCaml TLS stack to secure the system and trying the Irmin Git-like distributed, branchable storage to provide the queue instead of writing it myself. I hope to try those soon...