Category: ocaml | Thomas Leonard's blog

OCaml 5 performance part 2

2024-07-22T11:00:00+00:00

The last post looked at using various tools to understand why an OCaml 5 program was waiting a long time for IO. In this post, I'll be trying out some tools to investigate a compute-intensive program that uses multiple CPUs.

Table of Contents

The problem
ThreadSanitizer
perf
mpstat
offcputime
The OCaml garbage collector
statmemprof
magic-trace
Tuning GC parameters
Simplifying further
perf sched
olly
magic-trace on the simple allocator
perf annotate
perf c2c
perf stat
Conclusions

Further discussion about this post can be found on discuss.ocaml.org.

The problem

OCaml 4 allowed running multiple "system threads", but only one can have the OCaml runtime lock, so only one can be running OCaml code at a time. OCaml 5 allows running multiple "domains", all of which can be running OCaml code at the same time (each domain can also have multiple system threads; only one system thread can be running OCaml code per domain).

The ocaml-ci service provides CI for many OCaml programs, and its first step when testing a commit is to run a solver to select compatible versions for its dependencies. Running a solve typically only takes about a second, but it has to do it for each possible test platform, which includes versions of the OCaml compiler from 4.02 to 4.14 and 5.0 to 5.2, multiple architectures (32-bit and 64-bit x86, 32-bit and 64-bit ARM, PPC64 and s390x), operating systems (Alpine, Debian, Fedora, FreeBSD, macos, OpenSUSE and Ubuntu, in multiple versions), etc. In total, this currently does 132 solver runs per commit being tested (which seems too high to me, but let's ignore that for now).

The solves are done by the solver-service, which runs on a couple of ARM machines with 160 cores each. The old OCaml 4 version used to work by spawning lots of sub-processes, but when OCaml 5 came out, I ported it to use a single process with multiple domains. That removed the need for lots of communication logic, and allowed sharing common data such as the package definitions. The code got a lot shorter and simpler, and I'm told it's been much more reliable too.

But the performance was surprisingly bad. Here's a graph showing how the number of solves per second scales with the number of CPUs (workers) being used:

Processes scaling better than domains

The "Processes" line shows performance when forking multiple processes to do the work, which looks pretty good. The "Domains" line shows what happens if you instead spawn domains inside a single process.

Note: The original service used many libraries (a mix of Eio and Lwt ones), but to make investigation easier I simplified it by removing most of them. The simplified version doesn't use Eio or Lwt; it just spawns some domains/processes and has each of them do the same solve in a loop a fixed number of times.

ThreadSanitizer

When converting a single-domain OCaml 4 program to use multiple cores it's easy to introduce races. OCaml has ThreadSanitizer (TSan) support which can detect these. To use it, install an OCaml compiler with the tsan option:

$ opam switch create 5.2.0-tsan ocaml-variants.5.2.0+options ocaml-option-tsan

Things run a lot slower and require more memory with this compiler, but it's good to check:

$ ./_build/default/stress/stress.exe --internal-workers=2
[...]
WARNING: ThreadSanitizer: data race (pid=133127)
  Write of size 8 at 0x7ff2b7814d38 by thread T4 (mutexes: write M88):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Previous write of size 8 at 0x7ff2b7814d38 by thread T1 (mutexes: write M83):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Mutex M88 (0x558368b95358) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Mutex M83 (0x558368b95240) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Thread T4 (tid=133132, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

  Thread T1 (tid=133129, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

SUMMARY: ThreadSanitizer: data race lib/model.ml:70 in camlOpam_0install__Model.group_ors_1288

The two mutexes mentioned in the output, M83 and M88, are the domain_lock, used to ensure only one sys-thread runs at a time in each domain. In this program we only have one sys-thread per domain and so can ignore them.

The output reveals that the solver used a global variable to generate unique IDs:

let fresh_id =
  let i = ref 0 in
  fun () ->
    incr i;           (* model.ml:70 *)
    !i

With that fixed, TSan finds no further problems (in this simplified version). This gives us good confidence that there isn't any shared state: TSan would report use of shared state not protected by a mutex, and since the program was written for OCaml 4 it won't be using any mutexes.

That's good, because if one thread writes to a location that another reads then that requires coordination between CPUs, which is relatively slow (though we could still experience slow-downs due to false sharing, where two separate mutable items end up in the same cache line). However, while important for correctness, it didn't make any noticeable difference to the benchmark results.

perf

perf is the obvious tool to use when facing CPU performance problems. perf record -g PROG takes samples of the program's stack regularly, so that functions that run a lot or for a long time will appear often. perf report provides a UI to explore the results:

$ perf report
  Children      Self  Command     Shared Object      Symbol
+   59.81%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.do_solve_2283
+   59.44%     0.00%  stress.exe  stress.exe         [.] Opam_0install.Solver.solve_1428
+   59.25%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Domain_worker.solve_951
+   58.88%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Stress.run_worker_332
+   58.18%     0.00%  stress.exe  stress.exe         [.] Stdlib.Domain.body_735
+   57.91%     0.00%  stress.exe  stress.exe         [.] caml_start_program
+   34.39%     0.69%  stress.exe  stress.exe         [.] Stdlib.List.iter_366
+   34.39%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.lookup_845
+   34.39%     0.09%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.process_dep_2024
+   33.14%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Sat.run_solver_1446
+   27.28%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.build_problem_2092
+   26.27%     0.02%  stress.exe  stress.exe         [.] caml_call_gc

Looks like we're spending most of our time solving, as expected. But this can be misleading. Because perf only records stack traces when the code is running, it doesn't report any time the process spent sleeping.

$ /usr/bin/time ./_build/default/stress/stress.exe --count=10 --internal-workers=7
73.08user 0.61system 0:12.65elapsed 582%CPU (0avgtext+0avgdata 596608maxresident)k

With 7 workers, we'd expect to see 700%CPU, but we only see 582%.

mpstat

mpstat can show a per-CPU breakdown. Here are a couple of one second intervals on my machine while the solver was running:

$ mpstat --dec=0 -P ALL 1
16:24:39     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:40     all      78      1       2       1       0       0      18
16:24:40       0      19      1       0       1       0       1      78
16:24:40       1      88      1       0       1       0       0      10
16:24:40       2      88      1       0       1       0       0      10
16:24:40       3      88      0       0       0       0       1      11
16:24:40       4      89      1       0       0       0       0      10
16:24:40       5      90      0       0       1       0       0       9
16:24:40       6      79      1       0       1       1       1      17
16:24:40       7      86      0      12       1       1       0       0

16:24:40     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:41     all      80      1       2       1       0       0      17
16:24:41       0      85      0      12       1       0       1       1
16:24:41       1      91      1       0       1       0       0       7
16:24:41       2      90      0       0       1       1       0       8
16:24:41       3      89      1       0       1       0       0       9
16:24:41       4      67      1       0       1       0       0      31
16:24:41       5      52      1       0       0       0       1      46
16:24:41       6      76      1       0       1       0       0      22
16:24:41       7      90      1       0       0       0       0       9

Note: I removed some columns with all zero values to save space.

We might expect to see 7 CPUs running at 100% and one idle CPU, but in fact they're all moderately busy. On the other hand, none of them spent more than 91% of its time running the solver code.

offcputime

offcputime will show why a process wasn't using a CPU (it's like offwaketime, which we saw earlier, but doesn't record the waker). Here I'm using pidstat to see all running threads and then examining one of the workers, to avoid the problem we saw last time where the diagram included multiple threads:

$ pidstat 1 -t
...
^C
Average:      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000     78304         -  550.50    9.41    0.00    0.00  559.90     -  stress.exe
Average:     1000         -     78305   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78307    8.42    0.99    0.00    0.00    9.41     -  |__stress.exe
Average:     1000         -     78308   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78310   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78312   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78314   89.11    1.49    0.00    0.00   90.59     -  |__stress.exe
Average:     1000         -     78316   89.60    1.98    0.00    0.00   91.58     -  |__stress.exe

$ sudo offcputime-bpfcc -f -t 78310 > off-cpu

Note: The ARM machine's kernel was too old to run offcputime, so I ran this on my machine instead, with one main domain and six workers. As I needed good stacks for C functions too, I ran stress.exe in an Ubuntu 24.04 docker container, as recent versions of Ubuntu compile with frame pointers by default.

The raw output was very noisy, showing it waiting in many different places. Looking at a few, it was clear it was mostly the GC (which can run from almost anywhere). The output is just a text-file with one line per stack-trace, and bit of sed cleaned it up:

$ sed -E 's/stress.exe;.*;(caml_call_gc|caml_handle_gc_interrupt|caml_poll_gc_work|asm_sysvec_apic_timer_interrupt|asm_sysvec_reschedule_ipi);/stress.exe;\\1;/' off-cpu > off-cpu-gc
$ flamegraph.pl --colors=blue off-cpu-gc > off-cpu-gc.svg

That removes the part of the stack-trace before any of various interrupt-type functions that can be called from anywhere. The graph is blue to indicate that it shows time when the process wasn't running.

Time spent off-CPU

There are rather a lot of traces where we missed the user stack. However, the results seem clear enough: when our worker is waiting, it's in the garbage collector, calling caml_plat_spin_wait. This is used to sleep when a spin-lock has been spinning for too long (after 1000 iterations).

The OCaml garbage collector

OCaml has a major heap for long-lived values, plus one fixed-size minor heap for each domain. New allocations are made sequentially on the allocating domain's minor heap (which is very fast, just adjusting a pointer by the size required).

When the minor heap is full the program performs a minor GC, moving any values that are still reachable to the major heap and leaving the minor heap empty.

Garbage collection of the major heap is done in small slices so that the application doesn't pause for long, and domains can do marking and sweeping work without needing to coordinate (except at the very end of a major cycle, when they briefly synchronise to agree a new cycle is starting).

However, as minor GCs move values that other domains may be using, they do require all domains to stop.

Although the simplified test program doesn't use Eio, we can still use eio-trace to record GC events (we just don't see any fibers). Here's a screenshot of the solver running with 24 domains on the ARM machine, showing it performing GC work (not all domains are visible in the picture):

GC work shown in eio-trace

The orange/red parts show when the GC is running and the yellow regions show when the domain is waiting for other domains. The thick columns with yellow edges are minor GCs, while the thin (almost invisible) red columns without any yellow between them are major slices. The second minor GC from the left took longer than usual because the third domain from the top took a while to respond. It also didn't do a major slice before that; perhaps it was busy doing something, or maybe Linux scheduled a different process to run then.

Traces recorded by eio-trace can also be viewed in Perfetto, which shows the nesting better: Here's a close-up of a single minor GC, corresponding to the bottom two domains from the second column from the left:

Close-up in Perfetto

The domain triggering the GC (the bottom one here) enters a "stw_leader" (stop-the-world) phase and waits for the other domains to stop.
One by one, the other domains stop and enter "stw_api_barrier" until all domains have stopped.
All domains perform a minor GC, clearing their minor heaps.
They then enter a "minor_leave_barrier" phase, waiting until all domains have finished.
Each domain returns to running application code.

We can now see why the solver spends so much time sleeping; when a domain performs a minor GC, it spends most of the time waiting for other domains.

(the above is a slight simplification; domains may do some work on the major GC while waiting)

statmemprof

One obvious solution to GC slowness is to produce less garbage in the first place. To do that, we need to find out where the most costly allocations are coming from. Tracing every memory allocation tends to make programs unusably slow, so OCaml instead provides a statistical memory profiler.

It was temporarily removed in OCaml 5 because it needed updating for the new multicore GC, but has recently been brought back and will be in OCaml 5.3. There's a backport to 5.2, but I couldn't get it to work, so I just removed the domains stuff from the test and did a single-domain run on OCaml 4.14. You need the memtrace library to collect samples and memtrace_viewer to view them:

$ opam install memtrace memtrace_viewer

Put this at the start of the program to enable it:

let () = Memtrace.trace_if_requested ~context:"solver-test" ()

Then running with MEMTRACE set records a trace:

$ MEMTRACE=solver.ctf ./stress.exe --count=10
Solved warm-up request in: 1.99s
Running another 10 * 1 solves...

$ memtrace-viewer solver.ctf
Processing solver.ctf...
Serving http://localhost:8080/

The memtrace viewer UI

The flame graph in the middle shows functions scaled by the amount of memory they allocated. Initially it showed two groups, one for the warm-up request and one for the 10 runs. To simplify the display, I used the filter panel (on the left) to show only allocations after the 2 second warm-up. We can immediately see that OpamVersionCompare.compare is the source of most memory use.

Focusing on that function shows that it performed 54.1% of all allocations. The display now shows allocations performed within it above it (in green), and all the places it's called from in blue below:

The compare function is expensive!

The bulk of the allocations are coming from this loop:

(* [skip_while_from i f w m] yields the index of the leftmost character
 * in the string [s], starting from [i], and ending at [m], that does
 * not satisfy the predicate [f], or [length w] if no such index exists.  *)
let skip_while_from i f w m =
  let rec loop i =
    if i = m then i
    else if f w.[i] then loop (i + 1) else i
  in loop i

let skip_zeros x xi xl = skip_while_from xi (fun c -> c = '0') x xl

It's used when processing a version like 1.2.3 to skip any leading "0" characters (so that would compare equal to 1.02.3). The loop function refers to other variables (such as f) from its context, and so OCaml allocates a closure on the heap to hold these variables. Even though these allocations are small, we have to do it for every component of every version. And we compare versions a lot: for every version of a package that says it requires e.g. libfoo { >= "1.2" }, we have to check the formula against every version of libfoo.

The solution is rather simple (and shorter than the original!):

let rec skip_while_from i f w m =
  if i = m then i
  else if f w.[i] then skip_while_from (i + 1) f w m else i

Removing the other allocations from compare too reduces total memory allocations from 21.8G to 9.6G! The processes benchmark got about 14% faster, while the domains one was 23% faster:

Effect of reducing allocations. Old values are shown in grey.

A nice optimisation, but using domains is still nowhere close to even the original version with separate processes.

magic-trace

The traces above show the solver taking a long time for all domains to enter the stw_api_barrier phase. What was the slow domain doing to cause that? magic-trace let's us tell it when to save the ring buffer and we can use this to get detailed information. Tracing multiple threads with magic-trace doesn't seem to work well (each thread gets a very small buffer, they don't stop at quite the same time, and triggers don't work) so I find it's better to trace just one thread.

I modified the OCaml runtime so that the leader (the domain requesting the GC) records the time. As each domain enters stw_api_barrier it checks how late it is and calls a function to print a warning if it's above a threshold. Then I attached magic-trace to one of the worker threads and told it to save a sample when that function got called:

A domain being slow to join a minor GC

In the example above, magic-trace saved about 7ms of the history of a domain up to the point where it entered stw_api_barrier. The first few ms show the solver working normally. Then it needs to do a minor GC and tries to become the leader. But another domain has the lock and so it spins, calling handle_incoming 293,711 times in a loop for 2.5ms.

I had a look at the code in the OCaml runtime. When a domain wants to perform a minor GC, the steps are:

Acquire all_domains_lock.
Populate the stw_request global.
Interrupt all domains.
Release all_domains_lock.
Wait for all domains to get the interrupt.
Mark self as ready, allowing GC work to start.
Do minor GC.
The last domain to finish its minor GC signals all_domains_cond and everyone resumes.

I added some extra event reporting to the GC, showing when a domain is trying to perform a GC (try), when the leader is signalling other domains (signal), and when a domain is sleeping waiting for something (sleep). Here's what that looks like (in some places):

One sleeping domain delays all the others

The top domain finished its minor collection quickly (as it's mostly idle and had nothing to do), and started waiting for the other domains to finish. For some reason, this sleep call took 3ms to run.
The other domains resume work. One by one, they fill their minor heaps and try to start a GC.
They can't start a new GC, as the old one hasn't completely finished yet, so they spin.
Eventually the top domain wakes up and finishes the previous STW section.
One of the other domains immediately starts a new minor GC and the pattern repeats.

These try events seem useful; the program is spending much more time stuck in GC than the original traces indicated!

One obvious improvement here would be for idle domains to opt out of GC. Another would be to tell the kernel when to wake instead of using sleeps — and I see there's a PR already: OS-based Synchronisation for Stop-the-World Sections.

Another possibility would be to let domains perform minor GCs independently. The OCaml developers did make a version that worked that way, but it requires changes to all C code that uses the OCaml APIs, since a value in another domain's minor heap might move while it's running.

Finally, I wonder if the code could be simplified a bit using a compare-and-set instead of taking a lock to become leader. That would eliminate the try state, where a domain knows another domain is the leader, but doesn't know what it wants to do. It's also strange that there's a state where the top domain has finished its critical section and allowed the other domains to resume, but is not quite finished enough to let a new GC start.

We can work around this problem by having the main domain do work too. That could be a problem for interactive applications (where the main domain is running the UI and needs to respond fast), but it should be OK for the solver service. This was about 15% faster on my machine, but appeared to have no effect on the ARM server. Lesson: get traces on the target machine!

Tuning GC parameters

Another way to reduce the synchronisation overhead of minor GCs is to make them less frequent. We can do that by increasing the size of the minor heap, doing a few long GCs rather than many short ones. The size is controlled by the setting e.g. OCAMLRUNPARAM=s=8192k. On my machine, this actually makes things slower, but it's about 18% faster on the ARM server with 80 domains.

Here are the first few domains (from a total of 24) on the ARM server with different minor heap sizes (both are showing 1s of execution):

The default minor heap size (256k words) With a larger minor heap (8192k works) Note that the major slices also get fewer and larger, as they happen half way between minor slices.

Also, there's still a lot of variation between the time each domain spends doing GC, (despite the fact that they're all running exactly the same task), so they still end up waiting a lot.

Simplifying further

This is all still pretty odd, though. We're getting small performance increases, but still nothing like when forking. Can the test-case be simplified further? Yes, it turns out! This simple function takes much longer to run when using domains, compared to forking!

let run_worker n =
  for _i = 1 to n * 10000000 do
    ignore (Sys.opaque_identity (ref ()))
  done

ref () allocates a small block (2 words, including the header) on the minor heap. opaque_identity is to make sure the compiler doesn't optimise this pointless allocation away.

Time to run the loop on the 128-core ARM server (lower is better)

Here's what I would expect here:

The domains all start to fill their minor heaps. One fills it and triggers a minor GC.
The triggering domain sets an indicator in each domain saying a GC is due. None of the domains is sleeping, so the OS isn't involved in any wake-ups here.
The other domains check the indicator on their next allocation, which happens immediately since that's all they're doing.
The GCs all proceed quickly, since there's nothing to scan and nothing to promote (except possibly the current single allocation).
The all resume quickly and continue.

So ideally the lines would be flat. In practice, we may hit physical limits due to memory bandwidth, CPU temperature or kernel limitations; I assume this is why the "Processes" time starts to rise eventually. But it looks like this minor slow-down causes knock-on effects in the "Domains" case.

If I remove the allocation, then the domains and processes versions take the same amount of time.

perf sched

perf sched record records kernel scheduling events, allowing it to show what is running on each CPU at all times. perf sched timehist displays a report:

$ sudo perf sched record -k CLOCK_MONOTONIC
^C

$ sudo perf sched timehist
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  185296.715345 [0000]  sway[175042]                        1.694      0.025      0.775 
  185296.716024 [0002]  crosvm_vcpu2[178276/178217]         0.012      0.000      2.957 
  185296.717031 [0003]  main.exe[196519]                    0.006      0.000      4.004 
  185296.717044 [0003]  rcu_preempt[18]                     4.004      0.015      0.012 
  185296.717260 [0001]  main.exe[196526]                    1.760      0.000      2.633 
  185296.717455 [0001]  crosvm_vcpu1[193502/193445]        63.809      0.015      0.194 
  ...

The first line here shows that sway needed to wait for 1.694 ms for some reason (possibly a sleep), and then once it was due to resume, had to wait a further 0.025 ms for CPU 0 to be free. It then ran for 0.775 ms. I decided to use perf sched to find out what the system was doing when a domain failed to respond quickly.

To make the output easier to read, I hacked eio-trace to display it on the traces. perf script -g python will generate a skeleton Python script that can format all the events found in the perf.data file, and I used that to convert the output to CSV. To correlate OCaml domains with Linux threads, I also modified OCaml to report the thread ID (TID) for each new domain (it was previously reporting the PID instead for some reason).

Here's a trace of the simple allocator from the previous section:

eio-trace with perf sched data

Note: the colour of stw_api_barrier has changed: previously eio-trace coloured it yellow to indicate sleeping, but now we have the individual sleep events we can see exactly which part of it was sleeping.

The horizontal green bars show when each domain was running on the CPU. Here, we see that most of the domains ran until they called sleep. When the sleep timeout expires, the thread is ready to run again and goes on the run-queue. Time spent waiting on the queue is shown with a black bar.

When switching to or from another process, the process name is shown. Here we can see that crosvm_vcpu6 interrupted one of our domains, making it late to respond to the GC request.

Here we see another odd feature of the protocol: even though the late domain was the last to be ready, it wasn't able to start its GC even then, because only the leader is allowed to say when everyone is ready. Several domains wake after the late one is ready and have to go back to sleep again.

The diagram also shows when Linux migrated our OCaml domains between CPUs. For example:

The bottom domain was initially running on CPU 0.
After sleeping briefly, it spent a while waiting to resume and Linux moved it to CPU 6 (the leader domain, which was idle then).
Once there, the bottom domain slept briefly again, and again was slow to wake, getting moved to CPU 7.

Here's another example:

Two domains on the same CPU

The bottom domain's sleep finished a while ago, and it's been stuck on the queue because it's on the same CPU as another domain.
All the other domains are spinning, trying to become the leader for the next minor GC.
Eventually, Linux preempts the 5th domain from the top to run the bottom domain (the vertical green line indicates a switch between domains in the same process).
The bottom domain finishes the previous minor GC, allowing the 3rd from top to start a new one.
The new GC is delayed because the 5th domain is now waiting while the bottom domain spins.
Eventually the bottom domain sleeps, allowing 5 to join and the GC starts.

I tried using the processor package to pin each domain to a different CPU. That cleaned up the traces a fair bit, but didn't make much difference to the runtime on my machine.

I also tried using chrt to run the program as a high-priority "real-time" task, which also didn't seem to help. I wrote a bpftrace script to report if one of our domains was ready to resume and the scheduler instead ran something else. That showed various things. Often Linux was migrating something else out of the way and we had to wait for that, but there were also some kernel tasks that seemed to be even higher priority, such as GPU drivers or uring workers. I suspect to make this work you'd need to set the affinity of all the other processes to keep them away from the cores being used (but that wouldn't work in this example because I'm using all of them!). Come to think of it, running a CPU intensive task on every CPU at realtime priority was a dumb idea; had it worked I wouldn't have been able to do anything else with the computer!

olly

Exploring the scheduler behaviour was interesting, and might be needed for latency-sensitive tasks, but how often do migrations and delays really cause trouble? The slow GCs are interesting, but there are also sections like this where everything is going smoothly, and minor GCs take less than 4 microseconds:

GCs going well

olly can be used get summary statistics:

$ olly gc-stats './_build/default/stress/stress.exe --count=6 --internal-workers=24'
...
Solved 144 requests in 25.44s (0.18s/iter) (5.66 solves/s)

Execution times:
Wall time (s):	28.17
CPU time (s):	1.66
GC time (s):	169.88
GC overhead (% of CPU time):	10223.84%

GC time per domain (s):
Domain0: 	0.47
Domain1: 	9.34
Domain2: 	6.90
Domain3: 	6.97
Domain4: 	6.68
Domain5: 	6.85
Domain6: 	6.59
...

10223.84% GC overhead sounds like a lot but I think this is a misleading, for a few reasons:

The CPU time looks wrong. time reports about 6 minutes which sounds more likely.
GC time (as we've seen) includes time spent sleeping, while CPU time doesn't.
It doesn't include time spent trying to become a GC leader.

To double-check, I modified eio-trace to report GC statistics for a saved trace:

Solved 144 requests in 26.84s (0.19s/iter) (5.36 solves/s)
...

$ eio-trace gc-stats trace.fxt
./trace.fxt:

Ring  GC/s     App/s    Total/s   %GC
  0   10.255   19.376   29.631    34.61
  1    7.986   10.201   18.186    43.91
  2    8.195   10.648   18.843    43.49
  3    9.521   14.398   23.919    39.81
  4    9.775   16.537   26.311    37.15
  5    8.084   10.635   18.719    43.19
  6    7.977   10.356   18.333    43.51
...
 24    7.920   10.802   18.722    42.30

All  213.332  308.578  521.910    40.88

Note: all times are wall-clock and so include time spent blocking.

It ran slightly slower under eio-trace, perhaps because recording a trace file is more work than maintaining some counters, but it's similar. So this indicates that with 24 domains GC is taking about 40% of the total time (including time spent sleeping).

But something doesn't add up, on my machine at least:

With processes, the simple allocator test's main process spends 2% of its time in GC and takes 2.4s to run.
With domains, the main domain spends 20% of its time in GC and takes 8.2s.

Even if that 20% were removed completely, it should only save 20% of the 8.2s. So with domains, the code must be running more slowly even when it's not in the GC.

magic-trace on the simple allocator

I tried running magic-trace to see what it was doing outside of the GC. Since it wasn't calling any functions, it didn't show anything, but we can fix that:

let foo () =
  for _i = 1 to 100 do
    ignore (Sys.opaque_identity (ref ()))
  done
[@@inline never] [@@local never] [@@specialise never]

let run_worker n =
  for _i = 1 to n * 100000 do
    foo ()
  done

Here we do blocks of 100 allocations in a function called foo. The annotations are to ensure the compiler doesn't inline it. The trace was surprisingly variable!

magic-trace of foo between GCs

I see times for foo ranging from 50ns to around 750ns!

Note: the extra foo call above was probably due to a missed end event somewhere.

perf annotate

I ran perf record on the simplified version:

let foo () =
  for _i = 1 to 100 do
    ignore (Sys.opaque_identity (ref ()))
  done

Here the code is simple enough that we don't need stack-traces (so no -g):

$ sudo perf record ./_build/default/main.exe
$ sudo perf annotate

       │    camlDune__exe__Main.foo_273():
       │      mov  $0x3,%eax
  0.04 │      cmp  $0xc9,%rax
       │    ↓ jg   39
  7.34 │ d:   sub  $0x10,%r15
 13.37 │      cmp  (%r14),%r15
  0.09 │    ↓ jb   3f
  0.21 │16:   lea  0x8(%r15),%rbx
 70.26 │      movq $0x400,-0x8(%rbx)
  6.66 │      movq $0x1,(%rbx)
  0.73 │      mov  %rax,%rbx
  0.00 │      add  $0x2,%rax
  0.01 │      cmp  $0xc9,%rbx
  0.66 │    ↑ jne  d
  0.28 │39:   mov  $0x1,%eax
  0.34 │    ← ret
  0.00 │3f: → call caml_call_gc
       │    ↑ jmp  16

The code starts by (pointlessly) checking if 1 > 100 in case it can skip the whole loop. After being disappointed, it:

Decreases %r15 (young_ptr) by 0x10 (two words).
Checks if that's now below young_limit, calling caml_call_gc if so to clear the minor heap.
Writes 0x400 to the first newly-allocated word (the block header, indicating 1 word of data).
Writes 1 to the second word, which represents ().
Increments the loop counter and loops, unless we're at the end.
Returns ().

Looks like we spent most of the time (77%) writing the block, which makes sense. Reading young_limit took 13% of the time, which seems reasonable too. If there was contention between domains, we'd expect to see it here.

The output looked similar whether using domains or processes.

perf c2c

To double-check, I also tried perf c2c. This reports on cache-to-cache transfers, where two CPUs are accessing the same memory, which requires the processors to communicate and is therefore relatively slow.

$ sudo perf c2c record
^C

$ sudo perf c2c report
  Load Operations                   :      11898
  Load L1D hit                      :       4140
  Load L2D hit                      :         93
  Load LLC hit                      :       3750
  Load Local HITM                   :        251
  Store Operations                  :     116386
  Store L1D Hit                     :     104763
  Store L1D Miss                    :      11622
...
# ----- HITM -----  ------- Store Refs ------  ------- CL --------                      ---------- cycles ----------    Total       cpu                                    Shared                       
# RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A    Off  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol    Object      Source:Line  Node
...
      7        0        7        4        0        0      0x7f90b4002b80
  ----------------------------------------------------------------------
    0.00%  100.00%    0.00%    0.00%    0.00%    0x0     0       1            0x44a704         0       144       107        8         1  [.] Dune.exe.Main.foo_273       main.exe  main.ml:7        0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7b9         0         0         0        1         1  [.] caml_interrupt_all_signal_  main.exe  domain.c:318     0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7e2         0         0       323       49         1  [.] caml_reset_young_limit      main.exe  domain.c:1658    0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ce94d         0         0         0        1         1  [.] caml_empty_minor_heap_prom  main.exe  minor_gc.c:622   0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ceed2         0         0         0        1         1  [.] caml_alloc_small_dispatch   main.exe  minor_gc.c:874   0

This shows a list of cache lines (memory addresses) and how often we loaded from a modified address. There's a lot of information here and I don't understand most of it. But I think the above is saying that address 0x7f90b4002b80 (young_limit, at offsets 0) was accessed by these places across domains:

main.ml:7 (ref ()) checks against young_limit so see if we need to call into the GC.
domain.c:318 sets the limit to UINTNAT_MAX to signal that another domain wants a GC.
domain.c:1658 sets it back to young_trigger after being signalled.

The same cacheline was also accessed at offset 8, which contains young_ptr (address of last allocation):

minor_gc.c:622 sets young_ptr to young_end after a GC.
minor_gc.c:874 adjusts young_ptr to re-do the allocation that triggered the GC.

This indicates false sharing: young_ptr only gets accessed from one domain but it's in the same cache line as young_limit.

The main thing is that the counts are all very low, indicating that this doesn't happen often.

I tried adding an incr x on a global variable in the loop, and got some more operations reported. But using Atomic.incr massively increased the number of records:

	Original	incr	Atomic.incr
Load Operations	11,898	25,860	2,658,364
Load L1D hit	4,140	15,181	326,236
Load L2D hit	93	163	295
Load LLC hit	3,750	3,173	2,321,704
Load Local HITM	251	299	2,317,885
Store Operations	116,386	462,162	3,909,500
Store L1D Hit	104,763	389,492	3,908,947
Store L1D Miss	11,622	72,667	550

See C2C - False Sharing Detection in Linux Perf for more information about all this.

perf stat

perf stat shows statistics about a process. I ran it with -I 1000 to collect one-second samples. Here are two samples from the test case on my machine, one when it was running processes and one while it was using domains:

$ perf stat -I 1000

# Processes
      8,032.71 msec cpu-clock         #    8.033 CPUs utilized
         2,475      context-switches  #  308.115 /sec
            51      cpu-migrations    #    6.349 /sec
            44      page-faults       #    5.478 /sec
35,268,665,452      cycles            #    4.391 GHz
48,673,075,188      instructions      #    1.38  insn per cycle
 9,815,905,270      branches          #    1.222 G/sec
    48,986,037      branch-misses     #    0.50% of all branches

# Domains
      8,008.11 msec cpu-clock         #    8.008 CPUs utilized
        10,970      context-switches  #    1.370 K/sec
           133      cpu-migrations    #   16.608 /sec
           232      page-faults       #   28.971 /sec
34,606,498,021      cycles            #    4.321 GHz
25,120,741,129      instructions      #    0.73  insn per cycle
 5,028,578,807      branches          #  627.936 M/sec
    24,402,161      branch-misses     #    0.49% of all branches

We're doing a lot more context switches with domains, as expected due to the sleeps, and we're executing many fewer instructions, which isn't surprising. Reporting the counts for individual CPUs gets more interesting though:

$ sudo perf stat -I 1000 -e instructions -Aa
# Processes
     1.000409485 CPU0        5,106,261,160      instructions
     1.000409485 CPU1        2,746,012,554      instructions
     1.000409485 CPU2       14,235,084,764      instructions
     1.000409485 CPU3        7,545,940,906      instructions
     1.000409485 CPU4        2,605,655,333      instructions
     1.000409485 CPU5        6,023,131,238      instructions
     1.000409485 CPU6        2,860,656,865      instructions
     1.000409485 CPU7        8,195,416,048      instructions
     2.001406580 CPU0        5,674,686,033      instructions
     2.001406580 CPU1        2,774,756,912      instructions
     2.001406580 CPU2       12,231,014,682      instructions
     2.001406580 CPU3        8,292,824,909      instructions
     2.001406580 CPU4        2,592,461,540      instructions
     2.001406580 CPU5        7,182,922,668      instructions
     2.001406580 CPU6        2,742,731,223      instructions
     2.001406580 CPU7        7,219,186,119      instructions
     3.002394302 CPU0        4,676,179,731      instructions
     3.002394302 CPU1        2,773,345,921      instructions
     3.002394302 CPU2       13,236,080,365      instructions
     3.002394302 CPU3        5,142,640,767      instructions
     3.002394302 CPU4        2,580,401,766      instructions
     3.002394302 CPU5       13,600,129,246      instructions
     3.002394302 CPU6        2,667,830,277      instructions
     3.002394302 CPU7        4,908,168,984      instructions

$ sudo perf stat -I 1000 -e instructions -Aa
# Domains
     1.002680009 CPU0        3,134,933,139      instructions
     1.002680009 CPU1        3,140,191,650      instructions
     1.002680009 CPU2        3,155,579,241      instructions
     1.002680009 CPU3        3,059,035,269      instructions
     1.002680009 CPU4        3,102,718,089      instructions
     1.002680009 CPU5        3,027,660,263      instructions
     1.002680009 CPU6        3,167,151,483      instructions
     1.002680009 CPU7        3,214,267,081      instructions
     2.003692744 CPU0        3,009,806,420      instructions
     2.003692744 CPU1        3,015,194,636      instructions
     2.003692744 CPU2        3,093,562,866      instructions
     2.003692744 CPU3        3,005,546,617      instructions
     2.003692744 CPU4        3,067,126,726      instructions
     2.003692744 CPU5        3,042,259,123      instructions
     2.003692744 CPU6        3,073,514,980      instructions
     2.003692744 CPU7        3,158,786,841      instructions
     3.004694851 CPU0        3,069,604,047      instructions
     3.004694851 CPU1        3,063,976,761      instructions
     3.004694851 CPU2        3,116,761,158      instructions
     3.004694851 CPU3        3,045,677,304      instructions
     3.004694851 CPU4        3,101,053,228      instructions
     3.004694851 CPU5        2,973,005,489      instructions
     3.004694851 CPU6        3,109,177,113      instructions
     3.004694851 CPU7        3,158,349,130      instructions

In the domains case all CPUs are doing roughly the same amount of work. But when running separate processes the CPUs differ wildly! Over the last 1-second interval, for example, CPU5 executed 5.3 times as many instructions as CPU4. And indeed, some of the test processes are finishing much sooner than the others, even though they all do the same work.

Setting /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference to performance didn't make it faster, but setting it to power (power-saving mode) did make the processes benchmark much slower, while having little effect on the domains case!

So I think what's happening here with separate processes is that the CPU is boosting the performance of one or two cores at a time, allowing them to make lots of progress.

But with domains this doesn't happen, either because no domain runs long enough before sleeping to trigger the boost, or because as soon as it does it needs to stop and wait for the other domains for a GC and loses it.

Conclusions

The main profiling and tracing tools used were:

perf to take samples of CPU use, find hot functions and hot instructions within them, record process scheduling, look at hardware counters, and find sources of cache contention.
statmemprof to find the source of allocations.
eio-trace to visualise GC events and as a generic canvas for custom visualisations.
magic-trace to see very detailed traces of recent activity when something goes wrong.
olly to report on GC statistics.
bpftrace for quick experiments about kernel behaviour.
offcputime to see why a process is sleeping.

I think OCaml 5's runtime events tracing was the star of the show here, making it much easier to see what was going on with GC, especially in combination with perf sched. statmemprof is also an essential tool for OCaml, and I'll be very glad to get it back with OCaml 5.3. I think I need to investigate perf more; I'd never used many of these features before. Though it is important to use it with offcputime etc to check you're not missing samples due to sleeping.

Unlike the previous post's example, where the cause was pretty obvious and led to a massive easy speed-up, this one took a lot of investigation and revealed several problems, none of which seem very easy to fix. I'm also a lot less confident that I really understand what's happening here, but here is a summary of my current guess:

OCaml applications typically allocate lots of short-lived values.
With a single domain this isn't much of a problem; minor GCs are fast. With multiple domains however we have to wait for every domain to enter the GC, and then wait again for them all to exit.
This can be very fast (4 microseconds or so per GC), but if one domain is late due to OS scheduling then it can be much longer (several ms in some cases).
When a domain needs to wait for another it spins for a bit and then sleeps. If the other domain runs on the same CPU then spinning delays it from running. On the other hand, sleeping introduces longer delays and can cause the CPU to slow down.
Idle domains are currently expensive. An idle domain requires a syscall to wake it, and often causes all the other domains to sleep waiting for it. When the idle domain does wake, it still can't start the GC and has to wait again for the leader.
If the leader gets suspended while holding the lock, all the other domains will spin waiting for it (without ever sleeping). This time isn't accounted for in the GC events reported by OCaml 5.2.

Since the sleeping mechanism will be changing in OCaml 5.3, it would probably be worthwhile checking how that performs too. I think there are some opportunities to improve the GC, such as letting idle domains opt out of GC after one collection, and it looks like there are opportunities to reduce the amount of synchronisation done (e.g. by letting late arrivers start the GC without having to wait for the leader, or using a lock-free algorithm for becoming leader).

For the solver, it would be good to try experimenting with CPU affinity to keep a subset of the 160 cores reserved for the solver. Increasing the minor heap size and doing work in the main domain should also reduce the overhead of GC, and improving the version compare function in the opam library would greatly reduce the need for it. And if my goal was really to make it fast (rather than to improve multicore OCaml and its tooling) then I'd probably switch it back to using processes.

Finally, it was really useful that both of these blog posts examined performance regressions, so I knew it must be possible to go faster. Without a good idea of how fast something should be, it's easy to give up too early.

Anyway, I hope you found some useful new tool in these posts!

OCaml 5 performance problems

2024-07-22T10:00:00+00:00

Linux and OCaml provide a huge range of tools for investigating performance problems. In this post I try using some of them to understand a network performance problem. In part 2, I'll investigate a problem in a CPU-intensive multicore program.

Table of Contents

The problem
time
eio-trace
strace
bpftrace
tcpdump
ss
offwaketime
magic-trace
Summary script
Fixing it
Conclusions

The problem

While porting capnp-rpc from Lwt to Eio, to take advantage of OCaml 5's new effects system, I tried running the benchmark to see if it got any faster:

$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44933.359573 # The old Lwt version
echo_bench.exe: [INFO] rate = 511.963565   # The (buggy) Eio version

The benchmark records the number of echo RPCs per second. Clearly, something is very wrong here! In fact, the new version was so slow I had to reduce the number of iterations so it would finish.

time

The old time command can immediately give us a hint:

$ /usr/bin/time ./echo_bench.exe
1.85user 0.42system 0:02.31elapsed 98%CPU  # Lwt
0.16user 0.05system 0:01.95elapsed 11%CPU  # Eio (buggy)

(many shells provide their own time built-in with different output formats; I'm using /usr/bin/time here)

time's output shows time spent in user-mode (running the application's code on the CPU), time spent in the kernel, and the total wall-clock time. Both versions ran for around 2 seconds (doing a different number of iterations), but the Lwt version was using the CPU 98% of the time, while the Eio version was mostly sleeping.

eio-trace

eio-trace can be used to see what an Eio program is doing. Tracing is always available (you don't need to recompile the program to get it).

$ eio-trace run -- ./echo_bench.exe

eio-trace run runs the command and displays the trace in a window. You can also use eio-trace record to save a trace and examine it later.

Trace of slow benchmark (12 concurrent requests)

The benchmark runs 12 test clients at once, making it a bit noisy. To simplify thing, I set it to run only one client:

Trace of slow benchmark (one request at a time)

I've zoomed the image to show the first four iterations. The first is so quick it's not really visible, but the next three take about 40ms each. The yellow regions labelled "suspend-domain" show when the program is sleeping, waiting for an event from Linux. Each horizontal bar is a fiber (a light-weight thread). From top to bottom they are:

Three rows for the test client:
- The main application fiber performing the RPC call (mostly awaiting responses).
- The network's write fiber, sending outgoing messages (mostly waiting for something to send).
- The network's read fiber, reading incoming messages (mostly waiting to something to read).
Four rows for the server:
- A loop accepting new incoming TCP connections.
- A short-lived fiber that accepts the new connection, then short-lived fibers each handling one request.
- The server's network write fiber.
- The server's network read fiber.
One fiber owned by Eio itself (used to wake up the event loop in some situations).

This trace immediately raises a couple of questions:

Why is there a 40ms delay in each iteration of the test loop?
Why does the program briefly wake up in the middle of the first delay, do nothing, and return to sleep? (notice the extra "suspend-domain" at the top)

Zooming in on a section between the delays, let's see what it's doing when it's not sleeping:

Zoomed in on the active part

After a 40ms delay, the server's read fiber receives the next request (the running fiber is shown in green). The read fiber spawns a fiber to handle the request, which finishes quickly, starts the next read, and then the write fiber transmits the reply.

The client's read fiber gets the reply, the write fiber outputs a message, then the application fiber runs and another message is sent. The server reads something (presumably the first message, though it happens after the client had sent both), then there is another long 40ms delay, then (far off the right of the image) the pattern repeats.

To get more context in the trace, I configured the logging library to write the (existing) debug-level log messages to the trace buffer too:

With log messages

Log messages tend to be a bit long for the trace display, so they overlap and you have to zoom right in to read them, but they do help navigate. With this, I can see that the first client write is "Send finish" and the second is "Calling Echo.ping".

Looks like we're not buffering the output, so it's doing two separate writes rather than combining them. That's a little inefficient, and if you've done much network programming, you also probably already know why this might cause a 40ms delay, but let's pretend we don't know so we can play with a few more tools...

strace

strace can be used to trace interactions between applications and the Linux kernel (-tt -T shows when each call was started and how long it took):

$ strace -tt -T ./echo_bench.exe
...
11:38:58.079200 write(2, "echo_bench.exe: [INFO] Accepting"..., 73) = 73 <0.000008>
11:38:58.079253 io_uring_enter(4, 4, 0, 0, NULL, 8) = 4 <0.000032>
11:38:58.079341 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000020>
11:38:58.079408 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000021>
11:38:58.079471 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000018>
11:38:58.079525 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000019>
11:38:58.079580 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000013>
11:38:58.079611 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000009>
11:38:58.079637 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffc1661a480, 24) = -1 ETIME (Timer expired) <0.018913>
11:38:58.098669 futex(0x5584542b767c, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000105>
11:38:58.098889 futex(0x5584542b7690, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000059>
11:38:58.098976 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.021355>

On Linux, Eio defaults to using the io_uring mechanism for submitting work to the kernel. io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 means we asked to submit 2 new operations to the ring on FD 4, and the kernel accepted them.

The call at 11:38:58.079637 timed out after 19ms. It then woke up some futexes and then waited again, getting woken up after a further 21ms (for a total of 40ms).

Futexes are used to coordinate between system threads. strace -f will follow all spawned threads (and processes), not just the main one:

$ strace -T -f ./echo_bench.exe
...
[pid 48451] newfstatat(AT_FDCWD, "/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=40, ...}, 0) = 0 <0.000011>
...
[pid 48451] futex(0x561def43296c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY 
...
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffe1d5d1c90, 24) = -1 ETIME (Timer expired) <0.018899>
[pid 48449] futex(0x561def43296c, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000106>
[pid 48451] <... futex resumed>)        = 0 <0.019981>
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 
...
[pid 48451] exit(0)                     = ?
[pid 48451] +++ exited with 0 +++
[pid 48449] <... io_uring_enter resumed>) = 0 <0.021205>
...

The benchmark connects to "127.0.0.1" and Eio uses getaddrinfo to look up addresses (we can't use uring for this). Since getaddrinfo can block for a long time, Eio creates a new system thread (pid 48451) to handle it (we can guess this thread is doing name resolution because we see it read resolv.conf).

As creating system threads is a little slow, Eio keeps the thread around for a bit after it finishes in case it's needed again. The timeout is when Eio decides that the thread isn't needed any longer and asks it to exit. So this isn't relevant to our problem (and only happens on the first 40ms delay, since we don't look up any further addresses).

However, strace doesn't tell us what the uring operations were, or their return values. One option is to switch to the posix backend (which is the default on Unix systems). In fact, it's a good idea with any performance problem to check if it still happens with a different backend:

$ EIO_BACKEND=posix strace -T -tt ./echo_bench.exe
...
11:53:52.935976 writev(7, [{iov_base="\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0"..., iov_len=40}], 1) = 40 <0.000170>
11:53:52.936308 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, {tv_sec=0, tv_nsec=0}, NULL, 8) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=0, tv_nsec=0}) <0.000044>
11:53:52.936500 writev(7, [{iov_base="\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0"..., iov_len=136}], 1) = 136 <0.000055>
11:53:52.936831 readv(8, [{iov_base="\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0"..., iov_len=4096}], 1) = 40 <0.000056>
11:53:52.937516 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, NULL, NULL, 8) = 1 ([{fd=8, revents=POLLIN}]) <0.038972>
11:53:52.977751 readv(8, [{iov_base="\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0"..., iov_len=4096}], 1) = 136 <0.000398>

(to reduce clutter, I removed calls that returned EAGAIN and ppoll calls that returned 0 ready descriptors)

The problem still occurs, and now we can see the two writes:

The client writes 40 bytes to its end of the socket (FD 7), after which the server's end (FD 8) is ready for reading (revents=POLLIN).
The client then writes another 136 bytes.
The server reads 40 bytes and then uses ppoll to await further data.
After 39ms, ppoll says FD 8 is now ready, and the server reads the other 136 bytes.

bpftrace

Alternatively, we can trace uring operations using bpftrace. bpftrace is a little scripting language similar to awk, except that instead of editing a stream of characters, it live-patches the running Linux kernel. Apparently this is safe to run in production (and I haven't managed to crash my kernel with it yet).

Here is a list of uring tracepoints we can probe:

$ sudo bpftrace -l 'tracepoint:io_uring:*'
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_cqe_overflow
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_defer
tracepoint:io_uring:io_uring_fail_link
tracepoint:io_uring:io_uring_file_get
tracepoint:io_uring:io_uring_link
tracepoint:io_uring:io_uring_local_work_run
tracepoint:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_req_failed
tracepoint:io_uring:io_uring_short_write
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_task_add
tracepoint:io_uring:io_uring_task_work_run

io_uring_complete looks promising:

$ sudo bpftrace -vl tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_complete
    void * ctx
    void * req
    u64 user_data
    int res
    unsigned cflags
    u64 extra1
    u64 extra2

Here's a script to print out the time, process, operation name and result for each completion:

uringtrace.bt

BEGIN {
  @op[IORING_OP_NOP] = "NOP";
  @op[IORING_OP_READV] = "READV";
  ...
}

tracepoint:io_uring:io_uring_complete {
  $req = (struct io_kiocb *) args->req;
  printf("%dms: %s: %s %d\n",
    elapsed / 1e6,
    comm,
    @op[$req->opcode],
    args->res);
}

END {
  clear(@op);
}

$ sudo bpftrace uringtrace.bt
Attaching 3 probes...
...
1743ms: echo_bench.exe: WRITE_FIXED 40
1743ms: echo_bench.exe: READV 40
1743ms: echo_bench.exe: WRITE_FIXED 136
1783ms: echo_bench.exe: READV 136

In this output, the order is slightly different: we see the server's read get the 40 bytes before the client sends the rest, but we still see the 40ms delay between the completion of the second write and the corresponding read. The change in order is because we're seeing when the kernel knew the read was complete, not when the application found out about it.

tcpdump

An obvious step with any networking problem is the look at the packets going over the network. tcpdump can be used to capture packets, which can be displayed on the console or in a GUI with wireshark.

$ sudo tcpdump -n -ttttt -i lo
...
...041330 IP ...37640 > ...7000: Flags [P.], ..., length 40
...081975 IP ...7000 > ...37640: Flags [.], ..., length 0
...082005 IP ...37640 > ...7000: Flags [P.], ..., length 136
...082071 IP ...7000 > ...37640: Flags [.], ..., length 0

Here we see the client (on port 37640) sending 40 bytes to the server (port 7000), and the server replying with an ACK (with no payload) 40ms later. After getting the ACK, the client socket sends the remaining 136 bytes.

Here we can see that while the application made the two writes in quick succession, TCP waited before sending the second one. Searching for "delayed ack 40ms" will turn up an explanation.

ss

ss displays socket statistics. ss -tin shows all TCP sockets (-t) with internals (-i):

$ ss -tin 'sport = 7000 or dport = 7000'
State   Recv-Q   Send-Q  Local Address:Port  Peer Address:Port
ESTAB   0        0       127.0.0.1:7000      127.0.0.1:56224
 ato:40 lastsnd:34 lastrcv:34 lastack:34
ESTAB   0        176     127.0.0.1:56224     127.0.0.1:7000
 ato:40 lastsnd:34 lastrcv:34 lastack:34 unacked:1 notsent:136

There's a lot of output here; I've removed the irrelevant bits. ato:40 says there's a 40ms timeout for "delay ack mode". lastsnd, etc, say that nothing had happened for 34ms when this information was collected. unacked and notsent aren't documented in the man-page, but I guess it means that the client (now port 56224) is waiting for 1 packet to be ack'd and has 136 bytes waiting until then.

The client socket still has both messages (176 bytes total) in its queue; it can't forget about the first message until the server confirms receiving it, since the client might need to send it again if it got lost.

This doesn't quite lead us to the solution, though.

offwaketime

offwaketime records why a program stopped using the CPU, and what caused it to resume:

$ sudo offwaketime-bpfcc -f -p (pgrep echo_bench.exe) > wakes
$ flamegraph.pl --colors=chain wakes > wakes.svg

Time spent suspended along with wakeup reason

offwaketime records a stack-trace when a process is suspended (shown at the bottom and going up) and pairs it with the stack-trace of the thread that caused it to be resumed (shown above it and going down).

The taller column on the right shows Eio being woken up due to TCP data being received from the network, confirming that it was the TCP ACK that got things going again.

The shorter column on the left was unexpected, and the [UNKNOWN] in the stack is annoying (probably C code compiled without frame pointers). gdb gets a better stack trace. It turned out to be OCaml's tick thread, which wakes every 50ms to prevent one sys-thread from hogging the CPU:

$ strace -T -e pselect6 -p (pgrep echo_bench.exe) -f
strace: Process 20162 attached with 2 threads
...
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) <0.050441>
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) <0.050318>

Having multiple threads shown on the same diagram is a bit confusing. I should probably have used -t to focus only on the main one.

Also, note that when using profiling tools that record the OCaml stack, it's useful to compile with frame pointers enabled. To install e.g. OCaml 5.2.0 with frame pointers enabled, use:

$ opam switch create 5.2.0-fp ocaml-variants.5.2.0+options ocaml-option-fp

magic-trace

magic-trace allows capturing a short trace of everything the CPUs were doing just before some event. It uses Intel Processor Trace to have the CPU record all control flow changes (calls, branches, etc) to a ring-buffer, with fairly low overhead (2% to 10%, due to extra memory bandwidth needed). When something interesting happens, we save the buffer and use it to reconstruct the recent history.

Normally we'd need to set up a trigger to grab the buffer at the right moment, but since this program is mostly idle it doesn't record much and I just attached at a random point and immediately pressed Ctrl-C to grab a snapshot and detach:

$ sudo magic-trace attach -multi-thread -trace-include-kernel \
    -p (pgrep echo_bench.exe)
[ Attached. Press Ctrl-C to stop recording. ]
^C

As before, we see 40ms periods of waiting, with bursts of activity between them:

Magic trace showing 40ms delays

The output is a bit messed up because magic-trace doesn't understand that there are multiple OCaml fibers here, each with their own stack. It also doesn't seem to know that exceptions unwind the stack.

In each 40ms column, Eio_posix.Flow.single_read (3rd line from top) tried to do a read with readv, which got EAGAIN and called Sched.next to switch to the next fiber. Since there was nothing left to run, the Eio scheduler called ppoll. Linux didn't have anything ready for this process, and called the schedule kernel function to switch to another process.

I recorded an eio-trace at the same time, to see the bigger picture. Here's the eio-trace zoomed in to show the two client writes (just before the 40ms wait), with the relevant bits of the magic-trace stack pasted below them:

Zoomed in on the two client writes, showing eio-trace and magic-trace output together

We can see the OCaml code calling writev, entering the kernel, tcp_write_xmit being called to handle it, writing the IP packet to the network and then, because this is the loopback interface, the network receive logic handling the packet too. The second call is much shorter; tcp_write_xmit returns quickly without sending anything.

Note: I used the eio_posix backend here so it's easier to correlate the kernel operations to the application calls (uring queues them up and runs them later). The uring-trace project should make this easier in future, but doesn't integrate with eio-trace yet.

Zooming in further, it's easy to see the difference between the two calls to tcp_write_xmit:

The start of the first tcp_write_xmit and the whole of the second Looking at the source for tcp_write_xmit, we finally find the magic word "nagle"!

if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
			     (tcp_skb_is_last(sk, skb) ?
			      nonagle : TCP_NAGLE_PUSH))))
	break;

Summary script

Having identified a load of interesting events I wrote summary-posix.bt, a bpftrace script to summarise them. This includes log messages written by the application (by tracing write calls to stderr), reads and writes on the sockets, and various probed kernel functions seen in the magic-trace output and when reading the kernel source.

The output is specialised to this application (for example, TCP segments sent to port 7000 are displayed as "to server", while others are "to client"). I think this is a useful way to double-check my understanding, and any fix:

$ sudo bpftrace summary-posix.bt
[...]
844ms: server: got ping request; sending reply
844ms: server reads from socket (EAGAIN)
844ms: server: writev(96 bytes)
844ms:   tcp_write_xmit (to client, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 96 bytes to client
844ms: tcp_v4_rcv: got 96 bytes
844ms:   timer_start (tcp_delack_timer, 40 ms)
844ms: client reads 96 bytes from socket
844ms: client: enqueue finish message
844ms: client: enqueue ping call
844ms: client reads from socket (EAGAIN)
844ms: client: writev(40 bytes)
844ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 40 bytes to server
845ms: tcp_v4_rcv: got 40 bytes
845ms:   timer_start (tcp_delack_timer, 40 ms)
845ms: client: writev(136 bytes)
845ms:   tcp_write_xmit (to server, nagle-on, packets_out=1)
845ms: server reads 40 bytes from socket
845ms: server reads from socket (EAGAIN)
885ms: tcp_delack_timer_handler (ACK to client)
885ms:   tcp_v4_send_check: sending 0 bytes to client
885ms: tcp_delack_timer_handler (ACK to server)
885ms: tcp_v4_rcv: got 0 bytes
885ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
885ms:   tcp_v4_send_check: sending 136 bytes to server

The server replies to a ping request, sending a 96 byte reply. Nagle is on, but nothing is awaiting an ACK (packets_out=0) so it gets sent immediately.
The client receives the data. It starts a 40ms timer to send an ACK for it.
The client enqueues a "finish" message, followed by another "ping" request.
The client's write fiber sends the 40 byte "finish" message. Nothing is awaiting an ACK (packets_out=0) so the kernel sends it immediately.
The client sends the 136 byte ping request. As the last message hasn't been ACK'd, it isn't sent yet.
The server receives the 40 byte finish message.
40ms pass. The server's delayed ACK timer fires and it sends the ACK to the client.
The client's delayed ACK timer fires, but there's nothing to do (it sent the ACK with the "finish").
The client socket gets the ACK for its "finish" message and sends the delayed ping request.

Fixing it

The problem seemed clear: while porting from Lwt to Eio I'd lost the output buffering. So I looked at the Lwt code to see how it did it and... it doesn't! So how was it working?

As I did with Eio, I set the Lwt benchmark's concurrency to 1 to simplify it for tracing, and discovered that Lwt with 1 client thread has exactly the same problem as the Eio version. Well, that's embarrassing! But why is Lwt fast with 12 client threads?

With only minor changes (e.g. write vs writev), the summary script above also worked for tracing the Lwt version. With 1 or 2 client threads, Lwt is slow, but with 3 it's fairly fast. The delay only happens if the client sends a "finish" message when the server has no replies queued up (otherwise the finish message unblocks the replies, which carry the ACK to the client immediately). So, it works mostly by fluke! Lwt just happens to schedule the threads in such a way that Nagle's algorithm mostly doesn't trigger with 12 concurrent requests.

Anyway, adding buffering to the Eio version fixed the problem:

Before After (same scale)

An interesting thing to notice here is that not only did the long delay go away, but the CPU operations while it was active were faster too! I think the reason is that the CPU goes into power-saving mode during the long delays. cpupower monitor shows my CPUs running at around 1 GHz with the old code and around 4.7 GHz when running the new version.

Here are the results for the fixed version:

$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44425.962625 # The old Lwt version
echo_bench.exe: [INFO] rate = 59653.451934 # The fixed Eio version

60k RPC requests per second doesn't seem that impressive, but at least it's faster than the old version, which is good enough for now! There's clearly scope for improvement here (for example, the buffering I added is quite inefficient, making two extra copies of every message, as the framing library copies it from a cstruct to a string, and then I have to copy the string back to a cstruct for the kernel).

Conclusions

There are lots of great tools available to help understand why something is running slowly (or misbehaving), and since programmers usually don't have much time for profiling, a little investigation will often turn up something interesting! Even when things are working correctly, these tools are a good way to learn more about how things work.

time will quickly tell you if the program is taking lots of time in application code, in the kernel, or just sleeping. If the problem is sleeping, offcputime and offwaketime can tell you why it was waiting and what woke it in the end. My own eio-trace tool will give a quick visual overview of what an Eio application is doing. strace is great for tracing interactions between applications and the kernel, but it doesn't help much when the application is using uring. To fix that, you can either switch to the eio_posix backend or use bpftrace with the uring tracepoints. tcpdump, wireshark and ss are all useful to examine network problems specifically.

I've found bpftrace to be really useful for all kinds of tasks. Being able to write quick one-liners or short scripts gives it great flexibility. Since the scripts run in the kernel you can also filter and aggregate data efficiently without having to pass it all to userspace, and you can examine any kernel data structures. We didn't need that here because the program was running so slowly, but it's great for many problems. In addition to using well-defined tracepoints, it can also probe any (non-inlined) function in the kernel or the application. I also think using it to create a "summary script" to confirm a problem and its solution seems useful, though this is the first time I've tried doing that.

magic-trace is great for getting really detailed function-by-function tracing through the application and kernel. Its ability to report the last few ms of activity after you notice a problem is extremely useful (though not needed in this example). It would be really useful if you could trigger magic-trace from a bpftrace script, but I didn't see a way to do that.

However, it was surprisingly difficult to get any of the tools to point directly at the combination of Nagle's algorithm with delayed ACKs as the cause of this common problem!

This post was mainly focused on what was happening in the kernel. In part 2, I'll investigate a CPU-intensive problem instead.

Lambda Capabilities

2023-04-26T10:00:00+00:00

"Is this software safe?" is a question software engineers should be able to answer, but doing so can be difficult. Capabilities offer an elegant solution, but seem to be little known among functional programmers. This post is an introduction to capabilities in the context of ordinary programming (using plain functions, in the style of the lambda calculus).

Even if you're not interested in security, capabilities provide a useful way to understand programs; when trying to track down buggy behaviour, it's very useful to know that some component couldn't have been the problem.

Table of Contents

The Problem
Option 1: Security as a separate concern
Option 2: Purity
Option 3: Capabilities
Practical considerations
Conclusions

( this post also appeared on Reddit, Hacker News and Lobsters )

The Problem

We have some application (for example, a web-server) that we want to run. The application is many thousands of lines long and depends on dozens of third-party libraries, which get updated on a regular basis. I would like to be able to check, quickly and easily, that the application cannot do any of these things:

Delete my files.
Append a line to my ~/.ssh/authorized_keys file.
Act as a relay, allowing remote machines to attack other computers on my local network.
Send telemetry to a third-party.
Anything else bad that I forget to think about.

For example, here are some of the OCaml packages I use just to generate this blog:

Dependency graph for this blog

Having to read every line of every version of each of these packages in order to decide whether it's safe to generate the blog clearly isn't practical.

I'll start by looking at traditional solutions to this problem, using e.g. containers or VMs, and then show how to do better using capabilities.

Option 1: Security as a separate concern

A common approach to access control treats securing software as a separate activity to writing it. Programmers write (insecure) software, and a security team writes a policy saying what it can do. Examples include firewalls, containers, virtual machines, seccomp policies, SELinux and AppArmor.

The great advantage of these schemes is that security can be applied after the software is written, treating it as a black box. However, it comes with many problems:

Confused deputy problem

Some actions are OK for one use but not for another.

For example, if the client of a web-server requests https://example.com/../../etc/httpd/server-key.pem then we don't want the server to read this file and send it to them. But the server does need to read this file for other reasons, so the policy must allow it.

Coarse-grained controls

All the modules making up the program are treated the same way, even though you probably trust some more than others.

For example, we might trust the TLS implementation with the server's private key, but not the templating engine, and I know the modules I wrote myself are not malicious.

Even well-typed programs go wrong

Programming in a language with static types is supposed to ensure that if the program compiles then it won't crash. But the security policy can cause the program to fail even though it passed the compiler's checks.

For example, the server might sometimes need to send an email notification. If it didn't do that while the security policy was being written, then that will be blocked. Or perhaps the web-server didn't even have a notification system when the policy was written, but has since been updated.

Policy language limitations

The security configuration is written in a new language, which must be learned. It's usually not worth learning this just for one program, so the people who write the program struggle to write the policy. Also, the policy language often cannot express the desired policy, since it may depend on concepts unique to the program (e.g. controlling access based on a web-app user's ID, rather than local Unix user ID).

All of the above problems stem from trying to separate security from the code. If the code were fully correct, we wouldn't need the security layer. Checking that code is fully correct is hard, but maybe there are easy ways to check automatically that it does at least satisfy our security requirements...

Option 2: Purity

One way to prevent programs from performing unwanted actions is to prevent all actions. In pure functional languages, such as Haskell, the only way to interact with the outside world is to return the action you want to perform from main. For example:

f :: Int -> String
f x = ...

main :: IO ()
main = putStr (f 42)

Even if we don't look at the code of f, we can be sure it only returns a String and performs no other actions (assuming Safe Haskell is being used). Assuming we trust putStr, we can be sure this program will only output a string to stdout and not perform any other actions.

However, writing only pure code is quite limiting. Also, we still need to audit all IO code.

Option 3: Capabilities

Consider this code (written in a small OCaml-like functional language, where ref n allocates a new memory location initially containing n, and !x reads the current value of x):

let f a = ...

let () =
  let x = ref 5 in
  let y = ref 10 in
  f x;
  assert (!y = 10)

Can we be sure that the assert won't fail, without knowing the definition of f? Assuming the language doesn't provide unsafe backdoors (such as OCaml's Obj.magic), we can. f x cannot change y, because f x does not have access to y.

So here is an access control system, built in to the lambda calculus itself! At first glance this might not look very promising. For example, while f doesn't have access to y, it does have access to any global variables defined before f. It also, typically, has access to the file-system and network, which are effectively globals too.

To make this useful, we ban global variables. Then any top-level function like f can only access things passed to it explicitly as arguments. Avoiding global variables is usually considered good practise, and some systems ban them for other reasons anyway (for example, Rust doesn't allow global mutable state as it wouldn't be able to prevent races accessing it from multiple threads).

Returning to the Haskell example above (but now in OCaml syntax), it looks like this in our capability system:

let f x = ...

let main ch = output_string ch (f 42)

Since f is a top-level function, we know it does not close over any mutable state, and our 42 argument is pure data. Therefore, the call f 42 does not have access to, and therefore cannot affect, any pre-existing state (including the filesystem). Internally, it can use mutation (creating arrays, etc), but it has nowhere to store any mutable values and so they will get GC'd after it returns. f therefore appears as a pure function, and calling it multiple times will always give the same result, just as in the Haskell version.

output_string is also a top-level function, closing over no mutable state. However, the function resulting from evaluating output_string ch is not top-level, and without knowing anything more about it we should assume it has full access to the output channel ch.

If main is invoked with standard output as its argument, it may output a message to it, but cannot affect other pre-existing state.

In this way, we can reason about the pure parts of our code as easily as with Haskell, but we can also reason about the parts with side-effects. Haskell's purity is just a special case of a more general rule: the effects of a (top-level) function are bounded by its arguments.

Attenuation

So far, we've been thinking about what values are reachable through other values. For example, the set of ref-cells that can be modified by f x is bounded by the union of the set of ref cells reachable from the closure f with the set of ref cells reachable from x.

One powerful aspect of capabilities is that we can use functions to implement whatever access controls we want. For example, let's say we only want f to be able to set the ref-cell, but not read it. We can just pass it a suitable function:

let x = ref 0 in
let set v =
  x := v
in
f set

Or perhaps we only want to allow inserting positive integers:

let set v =
  if v > 0 then set v
  else invalid_arg "Positive values only!"

Or we can allow access to be revoked:

let r = ref (Some set) in
let set v =
  match !r with
  | Some fn -> fn v
  | None -> invalid_arg "Access revoked!"
in
...
r := None		(* Revoke *)

Or we could limit the number of times it can be used:

let used = ref 0 in
let set v =
  if !used < 3 then (incr used; set v)
  else invalid_arg "Quota exceeded"

Or log each time it is used, tagged with a label that's meaningful to us (e.g. the function to which we granted access):

let log = ref [] in
let set name v =
  let msg = sprintf "%S set it to %d" name v in
  log := msg :: !log;
  set v
in
f (set "f");
g (set "g")

Or all of the above.

In these examples, our function f never got direct access (permission) to x, yet was still able to affect it. Therefore, in capability systems people often talk about "authority" rather than permission. Roughly speaking, the authority of a subject is the set of actions that the subject could cause to happen, now or in the future, on currently-existing resources. Since it's only things that might happen, and we don't want to read all the code to find out exactly what it might do, we're usually only interested in getting an upper-bound on a subject's authority, to show that it can't do something.

The examples here all used a single function. We may want to allow multiple operations on a single value (e.g. getting and setting a ref-cell), and the usual techniques are available for doing that (e.g. having the function take the operation as its first argument, or collecting separate functions together in a record, module or object).

Web-server example

Let's look at a more realistic example. Here's a simple web-server (we are defining the main function, which takes two arguments):

let main net htdocs =
  ...

To use it, we pass it access to some network (net) and a directory tree with the content (htdocs). Immediately we can see that this server does not access any part of the file-system outside of htdocs, but that it may use the network. Here's a picture of the situation:

Initial reference graph

Notes on reading the diagram:

The diagram shows a model of the reference graph, where each node represents some value (function, record, tuple, etc) or aggregated group of values.
An arrow from A to B indicates the possibility that some value in the group A holds a reference to some value in the group B.
The model is typically an over-approximation, so the lack of an arrow from A to B means that no such reference exists, while the presence of an arrow just means we haven't ruled it out.
Orange nodes here represent OCaml values.
White boxes are directories. They include all contained files and subdirectories, except those shown separately. I've pulled out htdocs so we can see that app doesn't have access to the rest of home. Just for emphasis, I also show .ssh separately. I'm assuming here that a directory doesn't give access to its parent, so htdocs can only be used to read files within that sub-tree.
net represents the network and everything else connected to it.
In most operating systems, directories exist in the kernel's address space, and so you cannot have a direct reference to them. That's not a problem, but for now you may find it easier to imagine a system where the kernel and applications are all a single program, in a single programming language.
This diagram represents the state at a particular moment in time (when starting the application). We could also calculate and show all the references that might ever come to exist, given what we know about the behaviour of app and net. Since we don't yet know anything about either, we would have to assume that app might give net access to htdocs and to itself.

So, the diagram above shows the application app has been given references to net and to htdocs as arguments.

Looking at our checklist from the start:

It can't delete all my files, but it might delete the ones in htdocs.
It can't edit ~/.ssh/authorized_keys.
It might act as a relay, allowing remote machines to attack other computers on my local network.
It might send telemetry to a third-party.

We can read the body of the function to learn more:

let main net htdocs =
  let socket = Net.listen net (`Tcp 8080) in
  let handler = static_files htdocs in
  Http.serve socket handler

Note: Net.listen net is typical OCaml style for performing the listen operation on net. We could also have used a record and written net.listen instead, which may look more familiar to some readers.

Here's an updated diagram, showing the moment when Http.serve is called. The app group has been opened to show socket and handler separately:

After reading the code of main

We can see that the code in the HTTP library can only access the network via socket, and can only access htdocs by using handler. Assuming Net.listen is trust-worthy (we'll normally trust the platform's networking layer), it's clear that the application doesn't make out-bound connections, since net is used only to create a listening socket.

To know what the application might do to htdocs, we only have to read the definition of static_files:

let static_files dir request =
  Path.load (dir / request.path)

Now we can see that the application doesn't change any files; it only uses htdocs to read them.

Finally, expanding Http.serve:

let serve socket handle_request =
  while true do
    let conn = Net.accept socket in
    handle_connection conn handle_request
  done

We see that handle_connection has no way to share telemetry information between connections, given that handle_request never stores anything.

We can tell these things after only looking at the code for a few seconds, even though dozens of libraries are being used. In particular, we didn't have to read handle_connection or any of the HTTP parsing logic.

Now let's enable TLS. For this, we will require a configuration directory containing the server's key:

let main ~tls_config net htdocs =
  let socket = Net.listen net (`Tcp 8443) in
  let tls_socket = Tls.wrap ~tls_config socket in
  let handler = static_files htdocs in
  Http.serve tls_socket handler

OCaml syntax note: I used ~ to make tls_config a named argument; we wouldn't want to get this directory confused with htdocs!

We can see that only the TLS library gets access to the key. The HTTP library interacts only with the TLS socket, which presumably does not reveal it.

Updated graph showing TLS

Notice too how this fixes the problem we had with our original policy enforcement system. There, an attacker could request https://example.com/../tls_config/server.key and the HTTP server might send the key. But here, the handler cannot do that even if it wants to. When handler loads a file, it does so via htdocs, which does not have access to tls_config.

The above server has pretty good security properties, even though we didn't make any special effort to write secure code. Security-conscious programmers will try to wrap powerful capabilities (like net) with less powerful ones (like socket) as early as possible, making the code easier to understand. A programmer uninterested in readability is likely to mix in more irrelevant code you have to skip through, but even so it shouldn't take too long to track down where things like net and htdocs end up. And even if they spread them throughout their entire application, at least you avoid having to read all the libraries too!

By contrast, consider a more traditional (non-capability) style. We start with:

let main htdocs = ...

Here, htdocs would be a plain string rather than a reference to a directory, and the network would be reached through a global. We can't tell anything about what this server could do from looking at this one line, and even if we expand it, we won't be able to tell what all the functions it calls do, either. We will end up having to follow every function call recursively through all of the server's dependencies, and our analysis will be out of date as soon as any of them changes.

Use at different scales

We've seen that we can create an over-approximation of the reference graph by looking at just a small part of the code, and then get a closer bound on the possible effects as needed by expanding groups of values until we can prove the desired property. For example, to prove that the application didn't modify htdocs, we followed htdocs by expanding main and then static_files.

Within a single process, a capability is a reference (pointer) to another value in the process's memory. However, the diagrams also included arrows (capabilities) to things outside of the process, such as directories. We can regard these as references to privileged proxy functions in the process that make calls to the OS kernel, or (at a higher level of abstraction) we can consider them to be capabilities to the external resources themselves.

It is possible to build capability operating systems (in fact, this was the first use of capabilities). Just as we needed to ban global variables to make a safe programming language, we need to ban global namespaces to make a capability operating system. For example, on FreeBSD this is done (on a per-process basis) by invoking the cap_enter system call.

We can zoom out even further, and consider a network of computers. Here, an arrow between machines represents some kind of (unforgeable) network address or connection. At the IP level, any process can connect to any address, but a capability system can be implemented on top. CapTP (the Capability Transport Protocol) was an early system for this, but Cap'n Proto (Capabilities and Protocols) is the modern way to do it.

So, thinking in terms of capabilities, we can zoom out to look at the security properties of the whole network, yet still be able to expand groups as needed right down to the level of individual closures in a process.

Key points

Library code can be imported and called without it getting access to any pre-existing state, except that given to it explicitly. There is no "ambient authority" available to the library.
A function's side-effects are bounded by its arguments. We can understand (get a bound on) the behaviour of a function call just by looking at it.
If a has access to b and to c, then a can introduce them (e.g. by performing the function call b c). Note that there is no capability equivalent to making something "world readable"; to perform an introduction, you need access to both the resource being granted and to the recipient ("only connectivity begets connectivity").
Instead of passing the name of a resource, we pass a capability reference (pointer) to it, thereby proving that we have access to it and sharing that access ("no designation without authority").
The caller of a function decides what it should access, and can provide restricted access by wrapping another capability, or substituting something else entirely.

I am sometimes unable to install a messaging app on my phone because it requires me to grant it access to my address book. A capability system should never say "This application requires access to the address book. Continue?"; it should say "This application requires access to an address book; which would you like to use?".
A capability must behave the same way regardless of who uses it. When we do f x, f can perform exactly the same operations on x that we can.

It is tempting to add a traditional policy language alongside capabilities for "extra security", saying e.g. "f cannot write to x, even if it has a reference to it". However, apart from being complicated and annoying, this creates an incentive for f to smuggle x to another context with more powers. This is the root cause of many real-world attacks, such as click-jacking or cross-site request forgery, where a URL permits an attack if a victim visits it, but not if the attacker does. One of the great benefits of capability systems is that you don't need to worry that someone is trying to trick you into doing something that you can do but they can't, because your ability to access the resource they give you comes entirely from them in the first place.

All of the above follow naturally from using functions in the usual way, while avoiding global variables.

Practical considerations

The above discussion argues that capabilities would have been a good way to build systems in an ideal world. But given that most current operating systems and programming languages have not been designed this way, how useful is this approach? I'm currently working on Eio, an IO library for OCaml, and using these principles to guide the design. Here are a few thoughts about applying capabilities to a real system.

Plumbing capabilities everywhere

A lot of people worry about cluttering up their code by having to pass things explicitly everywhere. This is actually not much of a problem, for a couple of reasons:

We already do this with most things anyway. If your program uses a database, you probably establish a connection to it at the start and pass the connection around as needed. You probably also pass around open file handles, configuration settings, HTTP connection pools, arrays, queues, ref-cells, etc. Handling "the file-system" and "the network" the same way as everything else isn't a big deal.
You can often bundle up a capability with something else. For example, a web-server will likely let the user decide which directory to serve, so you're already passing around a pathname argument. Passing a path capability instead is no extra work.

Consider a request handler that takes the address of a Redis server:

Http.serve socket (handle_request redis_url)

It might seem that by using capabilities we'd need to pass the network in here too:

Http.serve socket (handle_request net redis_url)

This is both messy and unnecessary. Instead, handle_request can take a function for connecting to Redis:

Http.serve socket (handle_request redis)

Then there is only one argument to pass around again. Instead of writing the connection logic in handle_request, we write the same logic outside and just pass in the function. And now someone looking at the code can see "the handler can connect to Redis", rather than the less precise "the handler accesses the network". Of course, if Redis required more than one configuration setting then you'd probably already be doing it this way.

The main problematic case is providing defaults. For example, a TLS library might allow us to specify the location of the system's certificate store, but it would like to provide a default (e.g. /etc/ssl/certs/). This is particularly important if the default location varies by platform. If the TLS library decides the location, then we must give it (read-only at least) access to the whole system! We may just decide to trust the library, or we might separate out the default paths into a trusted package.

Levels of support

Ideally, our programming language would provide a secure implementation of capabilities that we could depend on. That would allow running untrusted code safely and protect us from compromised packages. However, converting a non-capability language to a capability-secure one isn't easy, and isn't likely to happen any time soon for OCaml (but see Emily for an old proof-of-concept).

Even without that, though, capabilities help to protect non-malicious code from malicious inputs. For example, the request handler above forgot to sanitise the URL path from the remote client, but it still can't access anything outside of htdocs.

And even if we don't care about security at all, capabilities make it easy to see what a program does; they make it easy to test programs by replacing OS resources with mocks; and preventing access to globals helps to avoid race conditions, since two functions that access the same resource must be explicitly introduced.

Running on a traditional OS

A capability OS would let us run a program's main function and provide the capabilities it wanted directly, but most systems don't work like that. Instead, each program requires a small trusted entrypoint that has the full privileges of the process. In Eio, an application will typically start something like this:

Eio_main.run @@ fun env ->
let net = Eio.Stdenv.net env in
let fs = Eio.Stdenv.fs env in
Eio.Path.with_open_dir (fs / "/srv/www") @@ fun htdocs ->
main net htdocs

Eio_main.run starts the Eio event loop and then runs the callback. The env argument gives full access to the process's environment. Here, the callback extracts network and filesystem access from this, gets access to just "/srv/www" from fs, and then calls the main function as before.

Note that Eio_main.run itself is not a capability-safe function (it magics up env from nothing). A capability-enforcing compiler would flag this bit up as needing to be audited manually.

Use with existing security mechanisms

Maybe you're not convinced by all this capability stuff. Traditional security systems are more widely available, better tested, and approved by your employer, and you want to use that instead. Still, to write the policy, you're going to need a list of resources the program might access. Looking at the above code, we can immediately see that the policy need allow access only to the "/srv/www" directory, and so we could call e.g. unveil here. And if main later changes to use TLS, the type-checker will let us know to update this code to provide the TLS configuration and we'll know to update the policy at the same time.

If you want to drop privileges, such a program also makes it easy to see when it's safe to do that. For example, looking at main we can see that net is never used after creating the socket, so we don't need the bind system call after that, and we never need connect. We know, for instance, that this program isn't hiding an XML parser that needs to download schema files to validate documents.

Thread-local storage

In addition to global and local variables, systems often allow us to attach data to threads as a sort of middle ground. This could allow unexpected interactions. For example:

let x = ref 0 in
f x;
g ()

Here, we'd expect that g doesn't have access to x, but f could pass it using thread-local storage. To prevent that, Eio instead provides Fiber.with_binding, which runs a function with a binding but then puts things back how they were before returning, so f can't make changes that are still active when g runs.

This also allows people who don't want capabilities to disable the whole system easily:

let everything = Fiber.create_key ()

let f () =
  let env = Option.get (Fiber.get everything) in
  ...

let main env =
  Fiber.with_binding everything env f

It looks like f () doesn't have access to anything, but in fact it can recover env and get access to everything! However, anyone trying to understand the code will start following env from the main entrypoint and will then see that it got put in fiber-local storage. They then at least know that they must read all the code to understand anything about what it can do.

More usefully, this mechanism allows us to make just a few things ambiently available. For example, we don't want to have to plumb stderr through to a function every time we want to do some printf debugging, so it makes sense to provide a tracing function this way (and Eio does this by default). Tracing allows all components to write debug messages, but it doesn't let them read them. Therefore, it doesn't provide a way for components to communicate with each other.

It might be tempting to use Fiber.with_binding to restrict access to part of a program (e.g. giving an HTTP server network access this way), but note that this is a non-capability way to do things, and suffers the same problems as traditional security systems, separating designation from authority. In particular, supposedly sandboxed code in other parts of the application can try to escape by tricking the HTTP server part into running a callback function for them. But fiber local storage is fine for things to which you don't care to restrict access.

Symlinks

Symlinks are a bit of a pain! If I have a capability reference to a directory, it's useful to know that I can only access things beneath that directory. But the directory may contain a symlink that points elsewhere.

One option would be to say that a symlink is a capability itself, but this means that you could only create symlinks to things you can access yourself, and this is quite a restriction. For example, you might be forbidden from extracting a tarball because tar didn't have permission to the target of a symlink it wanted to create.

The other option is to say that symlinks are just strings, and it's up to the user to interpret them. This is the approach FreeBSD uses. When you use a system call like openat, you pass a capability to a base directory and a string path relative to that. In the case of our web-server, we'd use a capability for htdocs, but use strings to reference things inside it, allowing the server to follow symlinks within that sub-tree, but not outside.

The main problem is that it makes the API a bit confusing. Consider:

save_to (htdocs / "uploads")

It might look like save_to is only getting access to the "uploads" directory, but in Eio it actually gets access to the whole of htdocs. If you want to restrict access, you have to do that explicitly (as we did when creating htdocs from fs).

The advantage, however, is that we don't break software that relies on symlinks. Also, restricting access is quite expensive on some systems (FreeBSD has the handy O_BENEATH open flag, and Linux has RESOLVE_BENEATH, but not all systems provide this), so might not be a good default. I'm not completely satisfied with the current API, though.

Time and randomness

It is also possible to use capabilities to restrict access to time and randomness. The security benefits here are less clear. Tracking access to time can be useful in preventing side-channel attacks that depend on measuring time accurately, but controlling access to randomness makes it difficult to e.g. randomise hash functions to help prevent denial-of-service-attacks.

However, controlling access to these does have the advantage of making code deterministic by default, which is a great benefit, especially for expect-style testing. Your top level test function is called with no arguments, and therefore has no access to non-determinism, instead creating deterministic mocks to use with the code under test. You can then just record a good trace of a test's operations and check that it doesn't change.

Power boxes

Interactive applications that load and save files present a small problem: since the user might load or save anywhere, it seems they need access to the whole file-system. The solution is a "powerbox". The powerbox has access to the file-system and the rest of the application only has access to the powerbox. When the application wants to save a file, it asks the powerbox, which pops up a GUI asking the user to choose the location. Then it opens the file and passes that back to the application.

Conclusions

Currently-popular security mechanisms are complex and have many shortcomings. Yet, the lambda calculus already contains an excellent security mechanism, and making use of it requires little more than avoiding global variables.

This is known as "capability-based security". The word "capabilities" has also been used for several unrelated concepts (such as "POSIX capabilities"), and for clarity much of the community rebranded a while back as "Object Capabilities", but this can make it seem irrelevant to functional programmers. In fact, I wrote this blog post because several OCaml programmers have asked me what the point of capabilities is. I was expecting it to be quite short (basically: applying functions to arguments good, global variables bad), but it's got quite long; it seems there is a fair bit that follows from this simple idea!

Instead of seeing security as an extra layer that runs separately from the code and tries to guess what it meant to do, capabilities fit naturally into the language. The key difference with traditional security is that the ability to do something depends on the reference used to do it, not on the identity of the caller. This way of thinking about security works not only for controlling access to resources within a single program, but also for controlling interactions between processes running on a machine, and between machines on a network. We can group together resources and zoom out to see the overall picture, or expand groups to zoom in and get a closer bound on the behaviour.

Even ignoring security, a key question is: what can a function do? Should a function call be able to do anything at all that the process can do, or should its behaviour be bounded in some way that is obvious just by looking at it? If we say that you must read the source code of a function to see what it does, then this applies recursively: we must also read all the functions that it calls, and so on. To understand the main function, we end up having to read the code of every library it uses!

If you want to read more, the What Are Capabilities? blog post provides a good overview; Part II of Robust Composition contains a longer explanation; Capability Myths Demolished does a good job of enumerating security properties provided by capabilities; my own SERSCIS Access Modeller paper shows how to analyse systems where some components have unknown behaviour; and, for historical interest, see Dennis and Van Horn's 1966 Programming Semantics for Multiprogrammed Computations, which introduced the idea.

Isolating Xwayland in a VM

2021-10-30T10:00:00+00:00

In my last post, Qubes-lite with KVM and Wayland, I described setting up a Qubes-inspired Linux system that runs applications in virtual machines. A Wayland proxy running in each VM connects its applications to the host Wayland compositor over virtwl, allowing them to appear on the desktop alongside normal host applications. In this post, I extend this to support X11 applications using Xwayland.

Table of Contents

Overview
Introduction to X11
Running Xwayland
The X11 protocol
Initialising the window manager
Windows
Performance
Pointer events
Keyboard events
Pointer cursor
Selections
Drag-and-drop
Bonus features
Conclusions

( this post also appeared on Hacker News )

Overview

A graphical desktop typically allows running multiple applications on a single display (e.g. by showing each application in a separate window). Client applications connect to a server process (usually on the same machine) and ask it to display their windows.

Until recently, this service was an X server, and applications would communicate with it using the X11 protocol. However, on newer systems the display is managed by a Wayland compositor, using the Wayland protocol.

Many older applications haven't been updated yet. Xwayland can be used to allow unmodified X11 applications to run in a Wayland desktop environment. However, setting this up wasn't as easy as I'd hoped. Ideally, Xwayland would completely isolate the Wayland compositor from needing to know anything about X11:

Fantasy Xwayland architecture

However, it doesn't work like this. Xwayland handles X11 drawing operations, but it doesn't handle lots of other details, including window management (e.g. telling the Wayland compositor what the window title should be), copy-and-paste, and selections. Instead, the Wayland compositor is supposed to connect back to Xwayland over the X11 protocol and act as an X11 window manager to provide the missing features:

Actual Xwayland architecture

This is a problem for several reasons:

It means that every Wayland compositor has to implement not only the new Wayland protocol, but also the old X11 protocol.
The compositor is part of the trusted computing base (it sees all your keystrokes and window contents) and this adds a whole load of legacy code that you'd need to audit to have confidence in it.
It doesn't work when running applications in VMs, because each VM needs its own Xwayland service and existing compositors can only manage one.

Because Wayland (unlike X11) doesn't allow applications to mess with other applications' windows, we can't have a third-party application act as the X11 window manager. It wouldn't have any way to ask the compositor to put Xwayland's surfaces into a window frame, because Xwayland is a separate application.

There is another way to do it, however. As I mentioned in the last post, I already had to write a Wayland proxy (wayland-proxy-virtwl) to run in each VM and relay Wayland messages over virtwl, so I decided to extend it to handle Xwayland too. As a bonus, the proxy can also be used even without VMs, avoiding the need for any X11 support in Wayland compositors at all. In fact, I found that doing this avoided several bugs in Sway's built-in Xwayland support.

Sommelier already has support for this, but it doesn't work for the applications I want to use. For example, popup menus appear in the center of the screen, text selections don't work, and it generally crashes after a few seconds (often with the error xdg_surface has never been configured). So instead I'd been using ssh -Y vm from the host to forward X11 connections to the host's Xwayland, managed by Sway. That works, but it's not at all secure.

Introduction to X11

Unlike Wayland, where applications are mostly unaware of each other, X is much more collaborative. The X server maintains a tree of windows (rectangles) and the applications manipulate it. The root of the tree is called the root window and fills the screen. You can see the tree using the xwininfo command, like this:

$ xwininfo -tree -root

xwininfo: Window id: 0x47 (the root window) (has no name)

  Root window id: 0x47 (the root window) (has no name)
  Parent window id: 0x0 (none)
     9 children:
     0x800112 "~/Projects/wayland/wayland-proxy-virtwl": ("ROX-Filer" "ROX-Filer")  2184x2076+0+0  +0+0
        1 child:
        0x800113 (has no name): ()  1x1+-1+-1  +-1+-1
     0x800123 (has no name): ()  1x1+-1+-1  +-1+-1
     0x800003 "ROX-Filer": ()  10x10+-100+-100  +-100+-100
     0x800001 "ROX-Filer": ("ROX-Filer" "ROX-Filer")  10x10+10+10  +10+10
        1 child:
        0x800002 (has no name): ()  1x1+-1+-1  +9+9
     0x600002 "main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1": ("gvim" "Gvim")  1648x1012+0+0  +0+0
        1 child:
        0x600003 (has no name): ()  1x1+-1+-1  +-1+-1
     0x600007 (has no name): ()  1x1+-1+-1  +-1+-1
     0x600001 "Vim": ("gvim" "Gvim")  10x10+10+10  +10+10
     0x200002 (has no name): ()  1x1+0+0  +0+0
     0x200001 (has no name): ()  1x1+0+0  +0+0

This tree shows the windows of two X11 applications, ROX-Filer and GVim, as well as various invisible utility windows (mostly 1x1 or 10x10 pixels in size).

Applications can create, move, resize and destroy windows, draw into them, and request events from them. The X server also allows arbitrary data to be attached to windows in properties. You can see a window's properties with xprop. Here are some of the properties on the GVim window:

$ xprop -id 0x600002
WM_HINTS(WM_HINTS):
		Client accepts input or input focus: True
		Initial state is Normal State.
		window id # of group leader: 0x600001
_NET_WM_WINDOW_TYPE(ATOM) = _NET_WM_WINDOW_TYPE_NORMAL
WM_NORMAL_HINTS(WM_SIZE_HINTS):
		program specified minimum size: 188 by 59
		program specified base size: 188 by 59
		window gravity: NorthWest
WM_CLASS(STRING) = "gvim", "Gvim"
WM_NAME(STRING) = "main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1"
...

The X server itself doesn't know anything about e.g. window title bars. Instead, a window manager process connects and handles that. A window manager is just another X11 application. It asks to be notified when an application tries to show ("map") a window inside the root, and when that happens it typically creates a slightly larger window (with room for the title bar, etc) and moves the other application's window inside that.

This design gives X a lot of flexibility. All kinds of window managers have been implemented, without needing to change the X server itself. However, it is very bad for security. For example:

Open an xterm.
Use xwininfo to find its window ID (you need the nested child window, not the top-level one).
Run xev -id 0x80001b -event keyboard in another window (using the ID you got above).
Use sudo or similar inside xterm and enter a password.

As you type the password into xterm, you should see the characters being captured by xev. An X application can easily spy on another application, send it synthetic events, etc.

Running Xwayland

Xwayland is a version of the xorg X server that treats Wayland as its display hardware. If you run it as e.g. Xwayland :1 then it opens a single Wayland window corresponding to the X root window, and you can use it as a nested desktop. This isn't very useful, because these windows don't fit in with the rest of your desktop. Instead, it is normally used in rootless mode, where each child of the X root window may have its own Wayland window.

$ WAYLAND_DEBUG=1 Xwayland :1 -rootless
[3991465.523]  -> wl_display@1.get_registry(new id wl_registry@2)
[3991465.531]  -> wl_display@1.sync(new id wl_callback@3)
...

When run this way, however, no windows actually appear. If we run DISPLAY=:1 xterm then we see Xwayland creating some buffers, but no surfaces:

[4076460.506]  -> wl_shm@4.create_pool(new id wl_shm_pool@15, fd 9, 540)
[4076460.520]  -> wl_shm_pool@15.create_buffer(new id wl_buffer@24, 0, 9, 15, 36, 0)
[4076460.526]  -> wl_shm_pool@15.destroy()
...

We need to run Xwayland as Xwayland :1 -rootless -wm FD, where FD is a socket we will use to speak the X11 protocol and act as a window manager.

It's a little hard to find information about Xwayland's rootless mode, because "rootless" has two separate common meanings in xorg:

Running xorg without root privileges.
Using xorg's miext/rootless extension to display application windows on some other desktop.

After a while, it became clear that Xwayland's rootless mode isn't either of these, but a third xorg feature also called "rootless".

The X11 protocol

libxcb provides C bindings to the X11 protocol, but I wanted to program in OCaml. Luckily, the X11 protocol is well documented, and generating the messages directly didn't look any harder than binding libxcb, so I wrote a little OCaml library to do this (ocaml-x11).

At first, I hard-coded the messages. For example, here's the code to delete a property on a window:

module Delete = struct
  [%%cstruct
    type req = {
      window : uint32_t;
      property : uint32_t;
    } [@@little_endian]
  ]

  let send t window property =
    Request.send_only t ~major:19 sizeof_req @@ fun r ->
    set_req_window r window;
    set_req_property r property
end

I'm using the cstruct syntax extension to let me define the exact layout of the message body. Here, it generates sizeof_req, set_req_window and set_req_property automatically.

After a bit, I discovered that there are XML files in xcbproto describing the X11 protocol. This provides a Python library for parsing the XML, which you can use by writing a Python script for your language of choice. For example, this glorious 3394 line Python script generates the C bindings. After studying this script carefully, I decided that hard-coding everything wasn't so bad after all.

I ended up having to implement more messages than I expected, including some surprising ones like OpenFont (see x11.mli for the final list). My implementation came to 1754 lines of OCaml, which is quite a bit shorter than the Python generator script, so I guess I still came out ahead!

In the X11 protocol, client applications send requests and the server sends replies, errors and events. Most requests don't produce replies, but can produce errors. Replies and errors are returned immediately, so if you see a response to a later request, you know all previous ones succeeded. If you care about whether a request succeeded, you may need to send a dummy message that generates a reply after it. Since message sequence numbers are 16-bit, after sending 0xffff consecutive requests without replies, you should send a dummy one with a reply to resynchronise (but window management involves lots of round-trips, so this isn't likely to be a problem for us). Events can be sent by the server at any time.

Unlike Wayland, which is very regular, X11 has various quirks. For example, every event has a sequence number at offset 2, except for KeymapNotify.

Initialising the window manager

Using Xwayland -wm FD actually prevents any client applications from connecting at all at first, because Xwayland then waits for the window manager to be ready before accepting any client connections.

To fix that, we need to claim ownership of the WM_S0 selection. A "selection" is something that can be owned by only one application at a time. Selections were originally used to track ownership of the currently-selected text, and later also used for the clipboard. WM_S0 means "Window Manager for Screen 0" (Xwayland only has one screen).

(* Become the window manager. This allows other clients to connect. *)
let* wm_sn = intern t ~only_if_exists:false ("WM_S" ^ string_of_int i) in
X11.Selection.set_owner x11 ~owner:(Some root) ~timestamp:`CurrentTime wm_sn

Instead of passing things like WM_S0 as strings in each request, X11 requires us to first intern the string. This returns a unique 32-bit ID for it, which we use in future messages. Because intern may require a round-trip to the server, it returns a promise, and so we use let* instead of let to wait for that to resolve before continuing. let* is defined in the Lwt.Syntax module, as an alternative to the more traditional >>= notation.

This lets our clients connect. However, Xwayland still isn't creating any Wayland surfaces. By reading the Sommelier code and stepping through Xwayland with a debugger, I found that I needed to enable the Composite extension.

Composite was originally intended to speed up redraw operations, by having the server keep a copy of every top-level window's pixels (even when obscured), so that when you move a window it can draw it right away without asking the application for help. The application's drawing operations go to the window's buffer, and then the buffer is copied to the screen, either automatically by the X server or manually by the window manager. Xwayland reuses this mechanism, by turning each window buffer into a Wayland surface. We just need to turn that on:

let* composite = X11.Composite.init x11 in
let* () = X11.Composite.redirect_subwindows composite ~window:root ~update:`Manual in

This says that every child of the root window should use this system. Finally, we see Xwayland creating Wayland surfaces:

-> wl_compositor@5.create_surface id:+28

Now we just need to make them appear on the screen!

Windows

As usual for Wayland, we need to create a role object and attach it to the surface. This tells Wayland whether the surface is a window or a dialog, for example, and lets us set the title, etc.

But first we have a problem: we need to know which X11 window corresponds to each Wayland surface. For example, we need the title, which is stored in a property on the X11 window. Xwayland does this by sending the new window a ClientMessage event of type WL_SURFACE_ID containing the Wayland ID. We don't get this message by default, but it seems that selecting SubstructureRedirect on the root does the trick.

SubstructureRedirect is used by window managers to intercept attempts by other applications to change the children of the root window. When an application asks the server to e.g. map a window, the server just forwards the request to the window manager. Operations performed by the window manager itself do not get redirected, so it can just perform the same request the client wanted, or make any changes it requires.

In our case, we don't actually need to modify the request, so we just re-perform the original map operation:

let event_handler = object (_ : X11.Event.handler)
  method map_request ~window = X11.Window.map x11 window

  method client_message ~window ~ty body =
      if ty = wl_surface_id then (
        let wayland_id = Cstruct.LE.get_uint32 body 0 in
        Log.info (fun f -> f "X window %a corresponds to Wayland surface %ld" X11.Window.pp window wayland_id);
        pair_when_ready ~x11 t window wayland_id
      )

Having two separate connections to Xwayland is quite annoying, because messages can arrive in any order. We might get the X11 ClientMessage first and need to wait for the Wayland create_surface, or we might get the create_surface first and need to wait for the ClientMessage.

An added complication is that not all Wayland surfaces correspond to X11 windows. For example, Xwayland also creates surfaces representing cursor shapes, and these don't have X11 windows. However, when we get the ClientMessage we can be sure that a Wayland message is on the way, so I just pause the X11 event handling until that has arrived:

(* We got an X11 message saying X11 [window] corresponds to Wayland surface [wayland_id].
   Turn [wayland_id] into an xdg_surface. If we haven't seen that surface yet, wait until it appears
   on the Wayland socket. *)
let rec pair_when_ready ~x11 t window wayland_id =
  match Hashtbl.find_opt t.unpaired wayland_id with
  | None ->
    Log.info (fun f -> f "Unknown Wayland object %ld; waiting for surface to be created..." wayland_id);
    let* () = Lwt_condition.wait t.unpaired_added in
    pair_when_ready ~x11 t window wayland_id
  | Some { client_surface = _; host_surface; set_configured } ->
    Log.info (fun f -> f "Setting up Wayland surface %ld using X11 window %a" wayland_id X11.Xid.pp window);
    Hashtbl.remove t.unpaired wayland_id;
    Lwt.async (fun () -> pair t ~set_configured ~host_surface window);
    Lwt.return_unit

Another complication is that Wayland doesn't allow you to attach a buffer to a surface until the window has been "configured". Doing so is a protocol error, and Sway will disconnect us if we try! But Xwayland likes to attach the buffer immediately after creating the surface.

To avoid this, I use a queue:

Xwayland asks to create a surface.
We forward this to Sway, add its ID to the unpaired map, and create a queue for further events.
Xwayland asks us to attach a buffer, etc. We just queue these up.
We get the ClientMessage over the X11 connection and create a role for the new surface.
Sway sends us a configure event, confirming it's ready for the buffer.
We forward the queued events.

However, this creates a new problem: if the surface isn't a window then the events will be queued forever. To fix that, when we get a create_surface we also do a round-trip on the X11 connection. If the window is still unpaired when that returns then we know that no ClientMessage is coming, and we flush the queue.

X applications like to create dummy windows for various purposes (e.g. receiving clipboard data), and we need to avoid showing those. They're normally set as override_redirect so the window manager doesn't handle them, but Xwayland redirects them anyway (it needs to because otherwise e.g. tooltips wouldn't appear at all). I'm trying various heuristics to detect this, e.g. that override redirect windows with a size of 1x1 shouldn't be shown.

If Sway asks us to close a window, we need to relay that to the X application using the WM_DELETE_WINDOW protocol, if it supports that:

let toplevel = Xdg_surface.get_toplevel xdg_surface @@ object
    inherit [_] Xdg_toplevel.v1

    method on_close _ =
      Lwt.async (fun () ->
          let* x11 = t.x11 in
          let* wm_protocols = X11.Atom.intern x11 "WM_PROTOCOLS"
          and* wm_delete_window = X11.Atom.intern x11 "WM_DELETE_WINDOW" in
          let* protocols = X11.Property.get_atoms x11 window wm_protocols in
          if List.mem wm_delete_window protocols then (
            let data = Cstruct.create 8 in
            Cstruct.LE.set_uint32 data 0 (wm_delete_window :> int32);
            Cstruct.LE.set_uint32 data 4 0l;
            X11.Window.send_client_message x11 window ~fmt:32 ~propagate:false ~event_mask:0l ~ty:wm_protocols data;
          ) else (
            X11.Window.destroy x11 window
          )
        )
  end

Wayland defaults to using client-side decorations (where the application draws its own window decorations). X doesn't do that, so we need to turn it off (if the Wayland compositor supports the decoration manager extension):

t.decor_mgr |> Option.iter (fun decor_mgr ->
    let decor = Xdg_decor_mgr.get_toplevel_decoration decor_mgr ~toplevel @@ object
        inherit [_] Xdg_decoration.v1
        method on_configure _ ~mode:_ = ()
      end
    in
    Xdg_decoration.set_mode decor ~mode:Xdg_decoration.Mode.Server_side
  )

Dialog boxes are more of a problem. Wayland requires every dialog box to have a parent window, but X11 doesn't. To handle that, the proxy tracks the last window the user interacted with and uses that as a fallback parent if an X11 window with type _NET_WM_WINDOW_TYPE_DIALOG is created without setting WM_TRANSIENT_FOR. That could be a problem if the application closes that window, but it seems to work.

Performance

I noticed a strange problem: scrolling around in GVim had long pauses once a second or so, corresponding to OCaml GC runs. This was surprising, as OCaml has a fast incremental garbage collector, and is normally not a problem for interactive programs. Besides, I'd been using the proxy with the (Wayland) Firefox and xfce4-terminal applications for 6 months without any similar problem.

Using perf showed that Linux was spending a huge amount of time in release_pages. The problem is that Xwayland was sharing lots of short-lived memory pools with the proxy. Each time it shares a pool, we have to ask the VM host for a chunk of memory of the same size. We map both pools into our address space and then copy each frame across (this is needed because we can't export guest memory to the host).

Normally, an application shares a single pool and just refers to regions within it, so we just map once at startup and unmap at exit. But Xwayland was creating, sharing and discarding around 100 pools per second while scrolling in GVim! Because these pools take up a lot of RAM, OCaml was (correctly) running the GC very fast, freeing them in batches of 100 or so each second.

First, I tried adding a cache of host memory, but that only solved half the problem: freeing the client pool was still slow.

Another option is to unmap the pools as soon as we get the destroy message, to spread the work out. Annoyingly, OCaml's standard library doesn't let you free memory-mapped memory explicitly (see the Add BigArray.Genarray.free PR for the current status), but adding this myself with a bit of C code would have been easy enough. We only touch the memory in one place (for the copy), so manually checking it hadn't been freed would have been pretty safe.

Then I noticed something interesting about the repeated log entries, which mostly looked like this:

-> wl_shm@4.create_pool id:+26 fd:(fd) size:8368360
-> wl_shm_pool@26.create_buffer id:+28 offset:0 width:2090 height:1001 stride:8360 format:1
-> wl_shm_pool@26.destroy 
<- wl_display@1.delete_id id:26
-> wl_buffer@28.destroy 
<- wl_display@1.delete_id id:28

Xwayland creates a pool, allocates a buffer within it, destroys the pool (so it can't create more buffers), and then deletes the buffer. But it never uses the buffer for anything!

So the solution was simple: I just made the host buffer allocation and the mapping operations lazy. We force the mapping if a pool's buffer is ever attached to a surface, but if not we just close the FD and forget about it. Would be more efficient if Xwayland only shared the pools when needed, though.

Pointer events

Wayland delivers pointer events relative to a surface, so we simply forward these on to Xwayland unmodified and everything just works.

I'm kidding - this was the hardest bit! When Xwayland gets a pointer event on a window, it doesn't send it directly to that window. Instead, it converts the location to screen coordinates and then pushes the event through the old X event handling mechanism, which looks at the X11 window stack to decide where to send it.

However, the X11 window stack (which we saw earlier with xwininfo -tree -root) doesn't correspond to the Wayland window layout at all. In fact, Wayland doesn't provide us any way to know where our windows are, or how they are stacked.

Sway seems to handle this via a backdoor: X11 applications do get access to location information even though native Wayland clients don't. This is one of the reasons I want to get X11 support out of the compositor - I want to make sure X11 apps don't have any special access. Sommelier has a solution though: when the pointer enters a window we raise it to the top of the X11 stack. Since it's the topmost window, it will get the events.

Unfortunately, the raise request goes over the X11 connection while the pointer events go over the Wayland one. We need to make sure that they arrive in the right order. If the computer is running normally, this isn't much of a problem, but if it's swapping or otherwise struggling it could result in events going to the wrong place (I temporarily added a 2-second delay to test this). This is what I ended up with:

Get a wayland pointer enter event from Sway.
Pause event delivery from Sway.
Flush any pending Wayland events we previously sent to Xwayland by doing a round-trip on the Wayland connection.
Send a raise on the X11 connection.
Do a round-trip on the X11 connection to ensure the raise has completed.
Forward the enter event on the Wayland connection.
Unpause the event stream from Sway.

At first I tried queuing up just the pointer events, but that doesn't work because e.g. keyboard events need to be synchronised with pointer events. Otherwise, if you e.g. Shift-click on something then the click gets delayed but the Shift doesn't and it can do the wrong thing. Also, Xwayland might ask Sway to destroy the window while we're entering it, and Sway might confirm the deletion. Pausing the whole event stream from Sway fixes all these problems.

The next problem was how to do the two round-trips. For X11 we just send an Intern request after the raise and wait to get a reply to that. Wayland provides the wl_display.sync method to clients, but we're acting as a Wayland server to Xwayland, not a client. I remembered that Wayland's xdg-shell extension provides a ping from the server to the client (the compositor can use this to detect when an application is not responding). Unfortunately, Xwayland has no reason to use this extension because it doesn't deal with window roles. Luckily, it uses it anyway (it does need it for non-rootless mode and doesn't bother to check).

wl_display.sync works by creating a fresh callback object, but xdg-shell's ping just sends a pong event to a fixed object, so we also need a queue to keep track of pings in flight so we don't get confused between our pings and any pings we're relaying for Sway. Also, xdg-shell's ping requires a serial number and we don't have one. But since Xwayland is the only app this needs to support, and it doesn't look at that, I cheat and just send zero.

And that's how to get pointer events to go to the right window with Xwayland.

Keyboard events

A very similar problem exists with the keyboard. When Wayland says the focus has entered a window we need to send a SetInputFocus over the X11 connection and then send the keyboard events over the Wayland one, requiring another two round-trips to synchronise the two connections.

Pointer cursor

Some applications set their own pointer shape, which works fine. But others rely on the default and for some reason you get no cursor at all in that case. To fix it, you need to set a cursor on the root window, which applications will then inherit by default. Unlike Wayland, where every application provides its own cursor bitmaps, X very sensibly provides a standard set of cursors, in a font called cursor (this is why I had to implement OpenFont). As cursors have two colours and a mask, each cursor is two glyphs: even numbered glyphs are the image and the following glyph is its mask:

(* Load the default cursor image *)
let* cursor_font = X11.Font.open_font x11 "cursor" in
let* default_cursor = X11.Font.create_glyph_cursor x11
    ~source_font:cursor_font ~mask_font:cursor_font
    ~source_char:68 ~mask_char:69
    ~bg:(0xffff, 0xffff, 0xffff)
    ~fg:(0, 0, 0)
in
X11.Window.create_attributes ~cursor:default_cursor ()
|> X11.Window.change_attributes x11 root

Selections

The next job was to get copying text between X and Wayland working.

In X11:

When you select something, the application takes ownership of the PRIMARY selection.
When you click the middle button or press Shift-Insert, the application requests PRIMARY.
When you press Ctrl-C, the application takes ownership of the CLIPBOARD selection.
When you press Ctrl-V it requests CLIPBOARD.

It's quite neat that adding support for a Windows-style clipboard didn't require changing the X server at all. Good forward-thinking design there.

In Wayland, things are not so simple. I have so far found no less than four separate Wayland protocols for copying text:

gtk_primary_selection supports copying the primary selection, but not the clipboard.
wp_primary_selection_unstable_v1 is identical to gtk_primary_selection except that it renames everything.
wl_data_device_manager supports clipboard transfers but not the primary selection.
zwlr_data_control_manager_v1 supports both, but it's for a "privileged client" to be a clipboard manager.

gtk_primary_selection and wl_data_device_manager both say they're stable, while the other two are unstable. However, Sway dropped support for gtk_primary_selection a while ago, breaking many applications (luckily, I had a handy Wayland proxy and was able to add some adaptor code to route gtk_primary_selection messages to the new "unstable" protocol).

For this project, I went with wp_primary_selection_unstable_v1 and wl_data_device_manager. On the Wayland side, everything has to be written twice for the two protocols, which are almost-but-not-quite the same. In particular, wl_data_device_manager also has a load of drag-and-drop stuff you need to ignore.

For each selection (PRIMARY or CLIPBOARD), we can be in one of two states:

An X11 client owns the selection (and we own the Wayland selection).
A Wayland client owns the selection (and we own the X11 selection).

When we own a selection we proxy requests for it to the matching selection on the other protocol.

At startup, we take ownership of the X11 selection, since there are no X11 apps running yet.
When we lose the X11 selection it means that an X11 client now owns it and we take the Wayland selection.
When we lose the Wayland selection it means that a Wayland client now owns it and we take the X11 selection.

One good thing about the Wayland protocols is that you send the data by writing it to a normal Unix pipe. For X11, we need to write the data to a property on the requesting application's window and then notify it about the data. And we may need to split it into multiple chunks if there's a lot of data to transfer.

A strange problem I had was that, while pasting into GVim worked fine, xterm would segfault shortly after trying to paste into it. This turned out to be a bug in the way I was sending the notifications. If an X11 application requests the special TEXT target, it means that the sender should choose the exact format. You write the property with the chosen type (e.g. UTF8_STRING), but you must still send the notification with the target TEXT. xterm is a C application (thankfully no longer set-uid!) and seems to have a use-after-free bug in the timeout code.

Drag-and-drop

Sadly, I wasn't able to get this working at all. X itself doesn't know anything about drag-and-drop and instead applications look at the window tree to decide where the user dropped things. This doesn't work with the proxy, because Wayland doesn't tell us where the windows really are on the screen.

Even without any VMs or proxies, drag-and-drop from X applications to Wayland ones doesn't work, because the X app can't see the Wayland window and the drop lands on the X window below (if any).

Bonus features

In the last post, I mentioned several other problems, which have also now been solved by the proxy:

HiDPI works

Wayland's support for high resolution screens is a bit strange. I would have thought that applications really only need to know two things:

The size in pixels of the window.
The size in pixels you want some standard thing (e.g. a normal-sized letter M).

Some systems instead provide the size of the window and the DPI (dots-per-inch), but this doesn't work well. For example, a mobile phone might be high DPI but still want small text because you hold it close to your face, while a display board will have very low DPI but want large text.

Wayland instead redefines the idea of pixel to be a group of pixels corresponding to a single pixel on a typical 1990's display. So if you set your scale factor to 2 then 1 Wayland pixel is a 2x2 grid of physical pixels. If you have a 1000x1000 pixel window, Wayland will tell the application it is 500x500 but suggest a scale factor of 2. If the application supports HiDPI mode, it will double all the numbers and render a 1000x1000 image and things work correctly. If not, it will render a 500x500 pixel image and the compositor will scale it up.

Since Xwayland doesn't support this, it just draws everything too small and Sway scales it up, creating a blurry and unusable mess. This might be made worse by subpixel rendering, which doesn't cope well with being scaled.

With the proxy, the solution is simple enough: when talking to Xwayland we just scale everything back up to the real dimensions, scaling all coordinates as we relay them:

let scale_to_client t (x, y) =
  x * t.config.xunscale,
  y * t.config.xunscale

method on_configure _ ~width ~height ~states:_ =
  let width = Int32.to_int width in
  let height = Int32.to_int height in
  if width > 0 && height > 0 then (
    Lwt.async (fun () ->
        let (width, height) = scale_to_client t (width, height) in
        X11.Window.configure x11 window ~width ~height ~border_width:0
      )
  )

This will tend to make things sharp but too small, but X applications already have their own ways to handle high resolution screens. For example, you can set Xft.dpi to make all the fonts bigger. I run this proxy like this, which works for me:

wayland-proxy-virtwl --x-display=0 --xrdb Xft.dpi:150 --x-unscale=2

However, there is a problem. The Wayland specification says:

The new size of the surface is calculated based on the buffer size transformed by the inverse buffer_transform and the inverse buffer_scale. This means that at commit time the supplied buffer size must be an integer multiple of the buffer_scale. If that's not the case, an invalid_size error is sent.

Let's say we have an X11 image viewer that wants to show a 1001-pixel-high image in a 1001-pixel-high window. This isn't allowed by the spec, which can only handle even-sized windows when the scale factor is 2. Regular Wayland applications already have to deal with that somehow, but for X11 applications it becomes our problem.

I tried rounding down, but that has a bad side-effect: if GTK asks for a 1001-pixel high menu and gets a 1000 pixel allocation, it switches to squashed mode and draws two big bumper arrows at the top and bottom of the menu which you must use to scroll it. It looks very silly.

I also tried rounding up, but tooltips look bad with any rounding. Either one border is missing, or it's double thickness. Luckily, it seems that Sway doesn't actually enforce the rule about surfaces being a multiple of the scale factor. So, I just let the application attach a buffer of whatever size it likes to the surface and it seems to work!

The only problem I had was that when using unscaling, the mouse pointer in GVim would get lost. Vim hides it when you start typing, but it's supposed to come back when you move the mouse. The problem seems to be that it hides it by creating a 1x1 pixel cursor. Sway decides this isn't worth showing (maybe because it's 0x0 in Wayland-pixels?), and sends Xwayland a leave event saying the cursor is no longer on the screen. Then when Vim sets the cursor back, Xwayland doesn't bother updating it, since it's not on screen!

The solution was to stop applying unscaling to cursors. They look better doubled in size, anyway. True, this does mean that the sharpness of the cursor changes as you move between windows, but you're unlikely to notice this due to the far more jarring effect of Wayland cursors also changing size and shape at the same time.

Ring-buffer logging

Even without a proxy to complicate things, Wayland applications often have problems. To make investigating this easier, I added a ring-buffer log feature. When on, the proxy keeps the last 512K or so of log messages in memory, and will dump them out on demand.

To use it, you run the proxy with e.g. -v --log-ring-path ~/wayland.log. When something odd happens (e.g. an application crashes, or opens its menus in the wrong place) you can dump out the ring buffer and see what just happened with:

echo dump-log > /run/user/1000/wayland-1-ctl

I also added some filtering options (e.g. --log-suppress motion,shm) to suppress certain classes of noisy messages.

Vim windows open correctly

One annoyance with Sway is that Vim's window always appears blank (even when running on the host, without any proxy). You have to resize it before you can see the text.

My proxy initially suffered from the same problem, although only intermittently. It turned out to be because Vim sends a ConfigureRequest with its desired size and then waits for the confirmation message. Since Sway is a tiling window manager, it ignores the new size and no event is generated. In this case, an X11 window manager is supposed to send a synthetic ConfigureNotify, so I just got the proxy to do that and the problem disappeared (I confirmed this by adding a sleep to Vim's gui_mch_update).

By the way, the GVim start-up code is quite interesting. The code path to opening the window goes though three separate functions which each define a static int recursive = 0 and then proceed to behave differently depending on how many times they've been reentered - see gui_init for an example!

Copy-and-paste without ^M characters

The other major annoyance with Sway is that copy-and-paste doesn't work correctly (Sway bug #1839). Using the proxy avoids that problem completely.

Conclusions

I'm not sure how I feel about this project. It ended up taking a lot longer than I expected, and I could probably have ported several X11 applications to Wayland in the same time. On the other hand, I now have working X support in the VMs with no need for ssh -Y from the host, plus support for HiDPI in Wayland, mouse cursors that are large enough to see easily, windows that open reliably, text pasting that works, and I can get logs whenever something misbehaves.

In fact, I'm now also running an instance of the proxy directly on the host to get the same benefits for host X11 applications. Setting this up is actually a bit tricky: you want to start Sway with DISPLAY=:0 so that every application it spawns knows it has an X11 display, but if you set that then Sway thinks you want it to run nested inside an X window provided by the proxy, which doesn't end well (or, indeed, at all).

Having all the legacy X11 support in a separate binary should make it much easier to write new Wayland compositors, which might be handy if I ever get some time to try that. It also avoids having many thousands of lines of legacy C code in the highly-trusted compositor code.

If Wayland had an official protocol for letting applications know the window layout then I could make drag-and-drop between X11 applications within the same VM work, but it still wouldn't work between VMs or to Wayland applications, so it's probably not worth it.

Having two separate connections to Xwayland creates a lot of unnecessary race conditions. A simple solution might be a Wayland extension that allows the Wayland server to say "please read N bytes from the X11 socket now", and likewise in the other direction. Then messages would always arrive in the order in which they were sent.

The code is all available at https://github.com/talex5/wayland-proxy-virtwl if you want to try it. It works with the applications I use when running under Sway, but will probably require some tweaking for other programs or compositors. Here's a screenshot of my desktop using it:

Screenshot of my desktop

The windows with [dev] in the title are from my Debian VM, while [com] is a SpectrumOS VM I use for email, etc. Gitk, GVim and ROX-Filer are X11 applications using Xwayland, while Firefox and xfce4-terminal are using plain Wayland proxying.

Qubes-lite with KVM and Wayland

2021-03-07T15:00:00+00:00

I've been running QubesOS as my main desktop since 2015. It provides good security, by running applications in different Xen VMs. However, it is also quite slow and has some hardware problems. I've recently been trying out NixOS, KVM, Wayland and SpectrumOS, and attempting to create something similar with more modern/compatible/faster technology.

This post gives my initial impressions of these tools and describes my current setup.

Table of Contents

QubesOS
NixOS
Why use virtual machines?
SpectrumOS
Wayland
Future work

( this post also appeared on Hacker News and Lobsters )

QubesOS

QubesOS aims to provide "a reasonably secure operating system". It does this by running multiple virtual machines under the Xen hypervisor. Each VM's windows have a different colour and tag, but they appear together as a single desktop. The VMs I run include:

com for email and similar (the only VM that sees my email password).
dev for software development.
shopping (the only VM that sees my card number).
personal (with no Internet access)
untrusted (general browsing)

The desktop environment itself is another Linux VM (dom0), used for managing the other VMs. Most of the VMs are running Fedora (the default for Qubes), although I run Debian in dev. There are also a couple of system VMs; one for dealing with the network hardware, and one providing a firewall between the VMs.

You can run qvm-copy in a VM to copy a file to another VM. dom0 pops up a dialog box asking which VM should receive the file, and it arrives there as ~/QubesIncoming/$source_vm/$file. You can also press Ctrl-Shift-C to copy a VM's clipboard to the global clipboard, and then press Ctrl-Shift-V in a window of the target VM to copy to that VM's clipboard, ready for pasting into an application.

I think Qubes does a very good job at providing a secure environment.

However, it has poor hardware compatibility and it feels sluggish, even on a powerful machine. I bought a new machine a while ago and found that the motherboard only provided a single video output, limited to 30Hz. This meant I had to buy a discrete graphics card. With the card enabled, the machine fails to resume from suspend, and locks up from time to time (it's completely stable with the card removed or disabled). I spent some time trying to understand the driver code, but I didn't know enough about graphics, the Linux kernel, PCI suspend, or Xen to fix it.

I was also having some other problems with QubesOS:

Graphics performance is terrible (especially on a 4k monitor). Qubes disables graphics acceleration in VMs for security reasons, but it was slow even for software rendering.
It recently started freezing for a couple of seconds from time to time - annoying when you're trying to type.
It uses LVM thin-pools for VM storage, which I don't understand, and which sometimes need repairing (haven't lost any data, though).
dom0 is out-of-date and generally not usable. This is intentional (you should be using VMs), but my security needs aren't that high and it would be nice to be able to do video conferencing these days. Also, being able to print over USB and use bluetooth would be handy.

Anyway, I decided it was time to try something new. Linux now has its own built-in hypervisor (KVM), and I thought that would probably work better with my hardware. I was also keen to try out Wayland, which is built around shared-memory and I thought it might therefore work better with VMs. How easy would it be to recreate a Qubes-like environment directly on Linux?

NixOS

I've been meaning to try NixOS properly for some time. Ever since I started using Linux, its package management has struck me as absurd. On Debian, Fedora, etc, installing a package means letting it put files wherever it likes; which effectively gives the package author root on your system. Not a good base for sandboxing!

Also, they make it difficult to try out 3rd-party software, or to test newer versions of just some packages.

In 2003 I created 0install to address these problems, and Nix has very similar goals. I thought Nix was a few years younger, but looking at its Git history the first commit was on Mar 12, 2003. I announced the first preview of 0install just two days later, so both projects must have started writing code within a few days of each other!

NixOS is made up of quite a few components. Here is what I've learned so far:

nix-store

The store holds the files of all the programs, and is the central component of the system. Each version of a package goes in its own directory (or file), at /nix/store/$HASH. You can add data to the store directly, like this:

$ echo hello > file

$ nix-store --add-fixed sha256 file
/nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file

$ cat /nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file
hello

Here, the store location is calculated from the hash of the contents of the file we added (as with 0install store add or git hash-object).

However, you can also add things to the store by asking Nix to run a build script. For example, to compile some source code:

You add the source code and some build instructions (a "derivation" file) to the store.
You ask the store to build the derivation. It runs your build script in a container sandbox.
The results are added to the store, using the hash of the build instructions (not the hash of the result) as the directory name.

If a package in the store depends on another one (at build time or run time), it just refers to it by its full path. For example, a bash script in the store will start something like:

#! /nix/store/vnyfysaya7sblgdyvqjkrjbrb0cy11jf-bash-4.4-p23/bin/bash
...

If two users want to use the same build instructions, the second one will see that the hash already exists and can just reuse that. This allows users to compile software from source and share the resulting binaries, without having to trust each other.

Ideally, builds should be reproducible. To encourage this, builds which use the hash of the build instructions for the result path are built in a sandbox without network access. So, you can't submit a build job like "Download and compile whatever is the latest version of Vim". But you can discover the latest version yourself and then submit two separate jobs to the store:

"Download Vim 8.2, with hash XXX" (a fixed-output job, which therefore has network access)
"Build Vim from hash XXX"

You can run nix-collect-garbage to delete everything from the store that isn't reachable via the symlinks under /nix/var/nix/gcroots/. Users can put symlinks to things they care about keeping in /nix/var/nix/gcroots/per-user/$USER/.

By default, the store is also configured with a trusted binary cache service, and will try to download build results from there instead of compiling locally when possible.

nix-instantiate

Writing derivation files by hand is tedious, so Nix provides a templating language to create them easily. The Nix language is dynamically typed and based around maps/dictionaries (which it confusingly refers to as "sets"). nix-instantiate file.nix will generate a derivation from file.nix and add it to the store.

An Nix file looks like this:

derivation { system = "x86_64-linux"; builder = ./myfile; name = "foo"; }

Running nix-instantiate on this will:

Add myfile to the store.
Add the generated foo.drv to the store, including the full store path of myfile.

nix-pkgs

Writing Nix expressions for every package you want would also be tedious. The nixpkgs Git repository contains a Nix expression that evaluates to a set of derivations, one for each package in the distribution. It also contains a library of useful helper functions for packages (e.g. it knows how to handle GNU autoconf packages automatically).

Rather than evaluating the whole lot, you use -A to ask for a single package. For example, you can use nix-instantiate ./nixpkgs/default.nix -A firefox to generate a derivation for Firefox.

nix-build is a quick way to create a derivation with nix-instantiate and build it with nix-store. It will also create a ./result symlink pointing to its path in the store, as well as registering ./result with the garbage collector under /nix/var/nix/gcroots/auto/. For example, to build and run Firefox:

nix-build ./nixpkgs/default.nix -A firefox
./result/bin/firefox

If you use nixpkgs without making any changes, it will be able to download a pre-built binary from the cache service.

nix-env

Keeping track of all these symlinks would be tedious too, but you can collect them all together by making a package that depends on every application you want. Its build script will produce a bin directory full of symlinks to the applications. Then you could just point your $PATH variable at that bin directory in the store.

To make updating easier, you will actually add ~/.nix-profile/bin/ to $PATH and update .nix-profile to point at the latest build of your environment package.

This is essentially what nix-env does, except with yet more symlinks to allow for switching between multiple profiles, and to allow rolling back to previous environments if something goes wrong.

For example, to install Firefox so you can run it via $PATH:

nix-env -i firefox

NixOS

Finally, just as nix-env can create a user environment with bin, man, etc, a similar process can create a root filesystem for a Linux distribution.

nixos-rebuild reads the /etc/nixos/configuration.nix configuration file, generates a system environment, and then updates grub and the /run/current-system symlink to point to it.

In fact, it also lists previous versions of the system environment in the grub file, so if you mess up the configuration you can just choose an earlier one from the boot menu to return to that version.

Installing NixOS

To install NixOS you boot one of the live images at https://nixos.org. Which you use only affects the installation UI, not the system you end up with.

The manual walks you through the installation process, showing how to partition the disk, format and mount the partitions, and how to edit the configuration file. I like this style of installation, where it teaches you things instead of just doing it for you. Most of the effort in switching to a new system is learning about it, so I'd rather spend 3 hours learning stuff following an installation guide than use a 15-minute single-click installer that teaches me nothing.

The configuration file (/etc/nixos/configuration.nix) is just another Nix expression. Most things are set to off by default (I approve), but can be changed easily. For example, if you want sound support you change that setting to sound.enable = true, and if you also want to use PulseAudio then you set hardware.pulseaudio.enable = true too.

Every system service supported by NixOS is controlled from here, with all kinds of options, from programs.vim.defaultEditor = true (so you don't get trapped in nano) to services.factorio.autosave-interval. Use man configuration.nix to see the available settings.

NixOS defaults to an X11 desktop, but I wanted to try Wayland (and Sway). Based on the NixOS wiki instructions, I used this:

  programs.sway = {
    enable = true;
    wrapperFeatures.gtk = true; # so that gtk works properly
    extraSessionCommands = "export MOZ_ENABLE_WAYLAND=1";
    extraPackages = with pkgs; [
      swaylock
      swayidle
      xwayland
      wl-clipboard
      mako
      alacritty
      dmenu
    ];
  };

The xwayland bit is important; without that you can't run any X11 applications.

My only complaint with the NixOS installation instructions is that following them will leave you with an unencrypted system, which isn't very useful. When partitioning, you have to skip ahead to the LUKS section of the manual, which just gives some options but no firm advice. I created two primary partitions: a 1G unencrypted /boot, and a LUKS partition for the rest of the disk. Then I created an LVM volume group from the /dev/mapper/crypted device and added the other partitions in that.

Once the partitions are mounted and the configuration file is complete, nixos-install downloads everything and configures grub. Then you reboot into the new system.

Once running the new system you can made further edits to the configuration file there in the same way, and use nixos-rebuild switch to generate a new system. It seems to be pretty good at updating the running system to the new settings, so you don't normally need to reboot after making changes.

The big mistake I made was forgetting to add /boot to fstab. When I ran nixos-rebuild it put all the grub configuration on the encrypted partition, rendering the system unbootable. I fixed that with chattr +i /boot on the unmounted partition. That way, trying to rebuild with /boot unmounted will just give an error message.

Thoughts on NixOS

I've been using the system for a few weeks now and I've had no problems with Nix so far. Nix has been fast and reliable and there were fairly up-to-date packages for everything I wanted (I'm using the stable release). There is a lot to learn, but plenty of documentation.

When I wanted a newer package (socat with vsock support, only just released) I just told Nix to install it from the latest Git checkout of nixpkgs. Unlike on Debian and similar systems, doing this doesn't interfere with any other packages (such as forcing a system-wide upgrade of libc).

I think Nix does download more data than most other systems, but networks are fast enough now that it doesn't seem to matter. For example, let's say you're running Python 3.9.0 and you want to update to 3.9.1:

With Debian: apt-get upgrade downloads the new version, which gets unpacked over the old one. As the files are unpacked, the system moves through an exciting series of intermediate states no-one has thought about. Running programs may crash as they find their library versions changing under them (though it's usually OK). Only root can update software.
With 0install: 0install update downloads the new version, unpacking it to a new directory. Running programs continue to use the old version. When a new program is started, 0install notices the update and runs the solver again. If the program is compatible with the new Python then it uses that. If not, it continues with the old one. You can run any previous version if there is a problem.
With Nix: nix-env -u downloads the new version, unpacking it to a new directory. It also downloads (or rebuilds) every package depending on Python, creating new directories for each of them. It then creates a new environment with symlinks to the latest version of everything. Running programs continue to use the old version. Starting a new program will use the new version. You can revert the whole environment back to the previous version if there is a problem.
With Docker: docker pull downloads the new version of a single application, downloading most or all of the application's packages, whether Python related or not. Existing containers continue running with the old version. New containers will default to using the new version. You can specify which version to use when starting a program. Other applications continue using the old version of Python until their authors update them (you must update each application individually, rather than just updating Python itself).

The main problem with NixOS is that it's quite different to other Linux systems, so there's a lot to relearn. Also, existing knowledge about how to edit fstab, sudoers, etc, isn't so useful, as you have to provide all configuration in Nix syntax. However, having a single (fairly sane) syntax for everything is a nice bonus, and being able to generate things using the templating language is useful. For example, for my network setup I use a bunch of tap devices (one for each of my VMs). It was easy to write a little Nix function (mktap) to generate them all from a simple list. Here's that section of my configuration.nix:

  networking = {
    useDHCP = false;
    interfaces =
      let mktap = ip: {
          virtual = true;
          virtualOwner = "tal";
          ipv4.addresses = [
            { address = ip; prefixLength = 31; }
          ];
        };
      in
      {
        eno2.useDHCP = true;
        wlo1.useDHCP = true;
        tapdev = mktap "10.0.0.2";
        tapcom = mktap "10.0.0.4";
        tapshopping = mktap "10.0.0.6";
        tapbanking = mktap "10.0.0.8";
        tapuntrusted = mktap "10.0.0.10";
      };
    nat = {
      enable = true;
      externalInterface = "eno2";
      internalIPs = [ "10.0.0.0/8" ];
    };
  };

Overall, I'm very happy with NixOS so far.

Why use virtual machines?

With NixOS I had a nice host environment, but after using Qubes I wanted to run my applications in VMs.

The basic problem is that Linux is the only thing that knows how to drive all the hardware, but Linux security is not ideal. There are several problems:

Linux is written in C. This makes security bugs rather common and, more importantly, means that a bug in one part of the code can impact any other part of the code. Nothing is secure unless everything is secure.
Linux has a rather large API (hundreds of syscalls).
The Linux (Unix) design predates the Internet, and security has been somewhat bolted on afterwards.

For example, imagine that we want to run a program with access to the network, but not to the graphical display. We can create a new Linux container for it using bubblewrap, like this:

$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /run/user/1000/wayland-0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /tmp/.X11-unix/X0

$ bwrap \
    --ro-bind / / \
    --dev /dev \
    --tmpfs /home/tal \
    --tmpfs /run/user \
    --tmpfs /tmp \
    --unshare-all --share-net \
    bash

$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
ls: cannot access '/run/user/1000/wayland-0': No such file or directory
ls: cannot access '/tmp/.X11-unix/X0': No such file or directory

The container has an empty home directory, empty /tmp, and no access to the display sockets. If we run Firefox in this environment then... it opens its window just fine! How? strace shows what happened:

connect(4, {sa_family=AF_UNIX, sun_path="/run/user/1000/wayland-0"}, 27) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path=@"/tmp/.X11-unix/X0"}, 20) = 0

After failing to connect to Wayland, it then tried using X11 (via Xwayland) instead. Why did that work? If the first byte of the socket pathname is \0 then Linux instead interprets it as an "abstract" socket address, not subject to the usual filesystem permission rules.

Trying to anticipate these kinds of special cases is just too much work. Linux really wants everything on by default, and you have to find and disable every feature individually. By contrast, virtual machines tend to have integrations with the host off by default. The also tend to have much smaller APIs (e.g. just reading and writing disk blocks or network frames), with the rich Unix API entirely inside the VM, provided by a separate instance of Linux.

SpectrumOS

I was able to set up a qemu guest and restore my dev Qubes VM in that, but it didn't integrate nicely with the rest of the desktop. Installing ssh allowed me to connect in with ssh -Y dev, allowing apps in the VM to open an X connection to Xwayland on the host. That was somewhat usable, but still a bit slower than Qubes had been (which was already a bit too slow).

Searching for a way to forward the Wayland connection directly, I came across the SpectrumOS project. SpectrumOS aims to use one virtual machine per application, using shared directories so that VM files are stored on the host, simplifying management. It uses crosvm from the ChromiumOS project instead of qemu, because it has a driver that allows forwarding Wayland connections (and also because it's written in Rust rather than C). The project's single developer is currently taking a break from the project, and says "I'm currently working towards a proof of concept".

However, there is some useful stuff in the SpectrumOS repository (which is a fork of nixpkgs). In particular, it contains:

A version of Linux with the virtwl kernel module, which connects to crosvm's Wayland driver.
A package for sommelier, which connects applications to virtwl.
A Nix expression to build a root filesystem for the VM.

Building that, I was able to run the project's demo, which runs the Wayfire compositor inside the VM, appearing in a window on the host. Dragging the nested window around, the pixels flowed smoothly across my screen in exactly the way that pixels on QubesOS don't.

This was encouraging, but I didn't want to run a nested window manager. I tried running Firefox directly (without Wayfire), but it complained that sommelier didn't provide a new enough version of something, and running weston-terminal immediately segfaulted sommelier.

Why do we need the sommelier process anyway? The problem is that, while virtwl mostly proxies Wayland messages directly, it can't send arbitrary FDs to the host. For example, if you want to forward a writable stream from an application to virtwl you must first create a pipe from the host using a special virtwl ioctl, then read from that and copy the data to the application's regular Linux pipe.

With help from the mailing list, I managed to get it somewhat usable:

I enabled VIRTIO_FS, allowing me to mount a host directory into the VM (for sharing files).
I created some tap devices (as mentioned above) to get guest networking going.
Adding ext4 to the kernel image allowed me to mount the VM's LVM partition.
Setting FONTCONFIG_FILE got some usable fonts (otherwise, there was no monospace font for the terminal).
I hacked sommelier to claim it supported the latest protocols, which got Firefox running.
Configuring sommelier for Xwayland let X applications run.
I replaced the non-interactive bash shell with fish so I could edit commands.
I ran (while true; do socat vsock-listen:5000 exec:dash; done) at the end of the VM's boot script. Then I could start e.g. the VM's Firefox with echo 'firefox&' | socat stdin vsock-connect:7:5000 on the host, allowing me to add launchers for guest applications.

Making changes to the root filesystem was fairly easy once I'd read the Nix manuals. To add an application (e.g. libreoffice), you import it at the start of rootfs/default.nix and add it to the path variable. The Nix expression gets the transitive dependencies of path from the Nix store and packs them into a squashfs image.

True, my squashfs image is getting a bit big. Maybe I should instead make a minimal squashfs boot image, plus a shared directory of hard links to the required files. That would allow sharing the data with the host. I could also just share the whole /nix/store directory, if I wanted to make all host software available to guests.

I made another Nix script to add various VM boot commands to my host environment. For example, running qvm-start-shopping boots my shopping VM using crosvm, with the appropriate LVM data partition, network settings, and shared host directory.

I think, ideally, this would be a systemd socket-activated user service rather than a shell script. Then attempting to run Firefox by sending a command to the VM socket would cause systemd to boot the VM (if not already running). For now, I boot each VM manually in a terminal and then press Win-Shift-2 to banish it to workspace 2, with all the other VM root consoles.

The virlwl Wayland forwarding feels pretty fast (much faster than Qubes' X graphics).

Wayland

I now had a mostly functional Qubes-like environment, running most of my applications in VMs, with their windows appearing on the host desktop like any other application. However, I also had some problems:

A stated goal of Wayland is "every frame is perfect". However, applications generally seemed to open at the wrong size and then jump to their correct size, which was a bit jarring.
Vim opened its window with the scrollbar at the far left of the window, making the text invisible until you resized the window.
Wayland is supposed to have better support for high-DPI displays. However, this doesn't work with Xwayland, which turns everything blurry, and the recommended work-around is to use a scale-factor of 1 and configure each application to use bigger fonts. This is easy enough with X applications (e.g. set ft.dpi: 150 with xrdb), but Wayland apps must be configured individually.
Wayland doesn't have cursor themes and you have to configure every application individually to use a larger cursor too.
Copying text didn't seem to work reliably. Sometimes there would be a long delay, after which the text might or might not appear. More often, it would just paste something completely different and unexpected. Even when it did paste the right text, it would often have ^M characters inserted into it.

I decided it was time to learn more about Wayland. I discovered wayland-book.com, which does a good job of introducing it (though the book is only half finished at the moment).

Protocol

One very nice feature of Wayland is that you can run any Wayland application with WAYLAND_DEBUG=1 and it will display a fairly readable trace of all the Wayland messages it sends and receives. Let's look at a simple application that just connects to the server (compositor) and opens a window:

$ WAYLAND_DEBUG=1 test.exe
-> wl_display@1.get_registry registry:+2
-> wl_display@1.sync callback:+3

The client connects to the server's socket at /run/user/1000/wayland-0 and sends two messages to object 1 (of type wl_display), which is the only object available in a new connection. The get_registry request asks the server to add the registry to the conversation and call it object 2. The sync request just asks the server to confirm it got it, using a new callback object (with ID 3).

Both clients and servers can add objects to the conversation. To avoid numbering conflicts, clients assign low numbers and servers pick high ones.

On the wire, each message gives the object ID, the operation ID, the length in bytes, and then the arguments. Objects are thought of as being at the server, so the client sends request messages to objects, while the server emits event messages from objects. At the wire level there's no difference though.

When the server gets the get_registry request it adds the registry, which immediately emits one event for each available service, giving the maximum supported version. The client receives these messages, followed by the callback notification from the sync message:

<- wl_registry@2.global name:0 interface:"wl_compositor" version:4
<- wl_registry@2.global name:1 interface:"wl_subcompositor" version:1
<- wl_registry@2.global name:2 interface:"wl_shm" version:1
<- wl_registry@2.global name:3 interface:"xdg_wm_base" version:1
<- wl_registry@2.global name:4 interface:"wl_output" version:2
<- wl_registry@2.global name:5 interface:"wl_data_device_manager" version:3
<- wl_registry@2.global name:6 interface:"zxdg_output_manager_v1" version:3
<- wl_registry@2.global name:7 interface:"gtk_primary_selection_device_manager" version:1
<- wl_registry@2.global name:8 interface:"wl_seat" version:5
<- wl_callback@3.done callback_data:1129040

The callback tells the client it has seen all the available services, and so it now picks the ones it wants. It has to choose a version no higher than the one offered by the server. Protocols starting with wl_ are from the core Wayland protocol; the others are extensions. The leading z in zxdg_output_manager_v1 indicates that the protocol is "unstable" (under development).

The protocols are defined in various XML files, which are scattered over the web. The core protocol is defined in wayland.xml. These XML files can be used to generate typed bindings for your programming language of choice.

Here, the application picks wl_compositor (for managing drawing surfaces), wl_shm (for sharing memory with the server), and xdg_wm_base (for desktop windows).

-> wl_registry@2.bind name:0 id:+4(wl_compositor:v4)
-> wl_registry@2.bind name:2 id:+5(wl_shm:v1)
-> wl_registry@2.bind name:3 id:+6(xdg_wm_base:v1)

The bind message is unusual in that the client gives the interface and version of the object it is creating. For other messages, both sides know the type from the schema, and the version is always the same as the parent object. Because the client chose the new IDs, it doesn't need to wait for the server; it continues by using the new objects to create a top-level window:

-> wl_compositor@4.create_surface id:+7
-> xdg_wm_base@6.get_xdg_surface id:+8 surface:7
-> xdg_surface@8.get_toplevel id:+9
-> xdg_toplevel@9.set_title title:"example app"
-> wl_surface@7.commit

This API is pretty strange. The core Wayland protocol says how to make generic drawing surfaces, but not how to make windows, so the application is using the xdg_wm_base extension to do that. Logically, there's only one object here (a toplevel window), but it ends up making three separate Wayland objects representing the different aspects of it.

The commit tells the server that the client has finished setting up the window and the server should now do something with it.

The above was all in response to the callback firing. The client now processes the last message in that batch, which is the server destroying the callback:

<- wl_display@1.delete_id id:3

Object destruction is a bit strange in Wayland. Normally, clients ask for things to be destroyed (by sending a "destructor" message) and the server confirms by sending delete_id from object 1. But this isn't symmetrical: there is no standard way for a client to confirm deletion when the server calls a destructor (such as the callback's done), so these have to be handled on a case-by-case basis. Since callbacks don't accept any messages, there is no need for the client to confirm that it got the done message and the server just sends a delete message immediately.

The client now waits for the server to respond to all the messages it sent about the new window, and gets a bunch of replies:

<- wl_shm@5.format format:0
<- wl_shm@5.format format:1
<- wl_shm@5.format format:875709016
<- wl_shm@5.format format:875708993
<- xdg_wm_base@6.ping serial:1129043
-> xdg_wm_base@6.pong serial:1129043
<- xdg_toplevel@9.configure width:0 height:0 states:""
<- xdg_surface@8.configure serial:1129042
-> xdg_surface@8.ack_configure serial:1129042

It gets some messages telling it what pixel formats are supported, a ping message (which the server sends from time to time to check the client is still alive), and a configure message giving the size for the new window. Oddly, Sway has set the size to 0x0, which means the client should choose whatever size it likes.

The client picks a suitable default size, allocates some shared memory (by opening a tmpfs file and immediately unlinking it), shares the file descriptor with the server (create_pool), and then carves out a portion of the memory to use as a buffer for the pixel data:

-> wl_shm@5.create_pool id:+3 fd:(fd) size:1228800
-> wl_shm_pool@3.create_buffer id:+10 offset:0 width:640 height:480 stride:2560 format:1
-> wl_shm_pool@3.destroy

In this case it used the whole memory region. It could also have allocated two buffers for double-buffering. The client then draws whatever it wants into the buffer (mapping the file into its memory and writing to it directly), attaches the buffer to the window's surface, marks the whole area as "damaged" (in need of being redrawn) and calls commit, telling the server the surface is ready for display:

-> wl_surface@7.attach buffer:10 x:0 y:0
-> wl_surface@7.damage x:0 y:0 width:2147483647 height:2147483647
-> wl_surface@7.commit

At this point the window appears on the screen! The server lets the client know it has finished with the buffer and the client destroys it:

<- wl_display@1.delete_id id:3
<- wl_buffer@10.release 
-> wl_buffer@10.destroy

Although the window is visible, the content is the wrong size. Sway now suddenly remembers that it's a tiling window manager. It sends another configure event with the correct size, causing the client to allocate a fresh memory pool of the correct size, allocate a fresh buffer from it, redraw everything at the new size, and tell the server to draw it.

<- xdg_toplevel@9.configure width:1534 height:1029 states:""
...

This process of telling the client to pick a size and then overruling it explains why Firefox draws itself incorrectly at first and then flickers into position a moment later. It probably also explains why Vim tries to open a 0x0 window.

Copying text

A bit of searching revealed that the ^M problem is a known Sway bug.

However, the main reason copying text wasn't working turned out to be a limitation in the design of the core wl_data_device_manager protocol. The normal way to copy text on X11 is to select the text you want to copy, then click the middle mouse button where you want it (or press Shift-Insert).

X also supports a clipboard mechanism, where you select text, then press Ctrl-C, then click at the destination, then press Ctrl-V. The original Wayland protocol only supports the clipboard system, not the selection, and so Wayland compositors have added selection support through extensions. Sommelier didn't proxy these extensions, leading to failure when copying in or out of VMs.

I also found that the reason weston-terminal wouldn't start was because I didn't have anything in my clipboard, and sommelier was trying to dereference a null pointer.

One problem with the Wayland protocol is that it's very hard to proxy. Although the wire protocol gives the length in bytes of each message, it doesn't say how many file descriptors it has. This means that you can't just pass through messages you don't understand, because you don't know which FDs go with which message. Also, the wire protocol doesn't give types for FDs (nor does the schema), which is a problem for anything that needs to proxy across a VM boundary or over a network.

This all meant that VMs could only use protocols explicitly supported by sommelier, and sommelier limited the version too. Which means that supporting extra extensions or new versions means writing (and debugging) loads of C++ code.

I didn't have time to write and debug C++ code for every missing Wayland protocol, so I took a short-cut: I wrote my own Wayland library, ocaml-wayland, and then used that to write my own version of sommelier. With that, adding support for copying text was fairly easy.

For each Wayland interface we need to handle each incoming message from the client and forward it to the host, and also forward each message from the host to the client. Here's the code to handle the "selection" event in OCaml, which we receive from the host and send to the client (c):

method on_selection _ offer = C.Wl_data_device.selection c (Option.map to_client offer)

The host passes us an "offer" argument, which is a previously-created host offer object. We look up the corresponding client object with to_client and pass that as the argument to the client.

For comparison, here's sommelier's equivalent to this line of code, in C++:

static void sl_data_device_selection(void* data,
                                     struct wl_data_device* data_device,
                                     struct wl_data_offer* data_offer) {
  struct sl_host_data_device* host = static_cast<sl_host_data_device*>(
      wl_data_device_get_user_data(data_device));
  struct sl_host_data_offer* host_data_offer =
      static_cast<sl_host_data_offer*>(wl_data_offer_get_user_data(data_offer));

  wl_data_device_send_selection(host->resource, host_data_offer->resource);
}

I think this is a great demonstration of the difference between "type safety" and "type ceremony". The C++ code is covered in types, making the code very hard to read, yet it crashes at runtime because it fails to consider that data_offer can be NULL.

By contract, the OCaml version has no type annotations, but the compiler would reject if I forgot to handle this (with Option.map).

Security

According to the GNOME wiki, the original justification for not supporting selection copies was "security concerns with unexpected data stealing if the mere act of selecting a text fragment makes it available to all running applications". The implication is that applications stealing data instead from the clipboard is OK, and that you should therefore never put anything confidential on the clipboard.

This seemed a bit odd, so I read the security section of the Wayland specification to learn more about its security model. That section of the specification is fairly short, so I'll reproduce it here in full:

Security and Authentication

mostly about access to underlying buffers, need new drm auth mechanism (the grant-to ioctl idea), need to check the cmd stream?

getting the server socket depends on the compositor type, could be a system wide name, through fd passing on the session dbus. or the client is forked by the compositor and the fd is already opened.

It looks like implementations have to figure things out for themselves.

The main advantage of Wayland over X11 here is that Wayland mostly isolates applications from each other. In X11 applications collaborate together to manage a tree of windows, and any application can access any window. In the Wayland protocol, each application's connection only includes that application's objects. Applications only get events relevant to their own windows (for example, you only get pointer motion events while the pointer is over your window). Communication between applications (e.g. copy-and-paste or drag-and-drop) is all handled though the compositor.

Also, to request the contents of the clipboard you need to quote the serial number of the mouse click or key press that triggered it. If it's too far in the past, the compositor can ignore the request.

I've also heard people say that security is the reason you can't take screenshots with Wayland. However, Sway lets you take screenshots, and this worked even from inside a VM through virtwl. I didn't add screenshot support to the proxy, because I don't want VMs to be able to take screenshots, but the proxy isn't a security tool (it runs inside the VM, which isn't trusted).

Clearly, the way to fix this was with a new compositor. One that would offer a different Wayland socket to each VM, tag the windows with the VM name, colour the frames, confirm copies across VM boundaries, and work with Vim. Luckily, I already had a handy pure-OCaml Wayland protocol library available. Unluckily, at this point I ran out of holiday.

Future work

There are quite a few things left to do here:

One problem with virtwl is that, while we can receive shared memory FDs from the host, we can't export guest memory to the host. This is unfortunate, because in Wayland the shared memory for window contents is allocated by the application from guest memory, and the proxy therefore has to copy each frame. If the host provided the memory to the guest, this wouldn't be needed. There is a wl_drm protocol for allocating video memory, which might help here, but I don't know how that works and, like many Wayland specifications, it seems to be in the process of being replaced by something else. Also, if we're going to copy the memory, we should at least only copy the damaged region, not the whole thing. I only got this code working just far enough to run the Wayland applications I use (mainly Firefox and Evince).
I'm still using ssh to proxy X11 connections (mainly for Vim and gitk). I'd prefer to run Xwayland in the VM, but it seems you need to provide a bit of extra support for that, which I haven't implemented yet. Sommelier can do this, but then copying doesn't work.
The host Wayland compositor needs to be aware of VMs, so it can colour the titles appropriately and limit access to privileged operations.
For the full Qubes experience, the network card should be handled by a VM, with another VM managing the firewall. Perhaps the Mirage unikernel firewall could be made to work on KVM too. I'm not sure how guest-to-guest communication works with KVM.

However, because the host NixOS environment is a fully-working Linux system, I can always trade off some security to get things working (e.g. by doing video conferencing directly on the host).

I hope the SpectrumOS project will resume at some point, or that Qubes will find a solution to its hardware compatibility and performance problems.