Even if you're not interested in security, capabilities provide a useful way to understand programs; when trying to track down buggy behaviour, it's very useful to know that some component couldn't have been the problem.
Table of Contents
( this post also appeared on Reddit, Hacker News and Lobsters )
We have some application (for example, a web-server) that we want to run. The application is many thousands of lines long and depends on dozens of third-party libraries, which get updated on a regular basis. I would like to be able to check, quickly and easily, that the application cannot do any of these things:
~/.ssh/authorized_keys
file.
For example, here are some of the OCaml packages I use just to generate this blog:
Having to read every line of every version of each of these packages in order to decide whether it's safe to generate the blog clearly isn't practical.
I'll start by looking at traditional solutions to this problem, using e.g. containers or VMs, and then show how to do better using capabilities.
A common approach to access control treats securing software as a separate activity to writing it. Programmers write (insecure) software, and a security team writes a policy saying what it can do. Examples include firewalls, containers, virtual machines, seccomp policies, SELinux and AppArmor.
The great advantage of these schemes is that security can be applied after the software is written, treating it as a black box. However, it comes with many problems:
Some actions are OK for one use but not for another.
For example, if the client of a web-server requests https://example.com/../../etc/httpd/server-key.pem
then we don't want the server to read this file and send it to them.
But the server does need to read this file for other reasons, so the policy must allow it.
All the modules making up the program are treated the same way, even though you probably trust some more than others.
For example, we might trust the TLS implementation with the server's private key, but not the templating engine, and I know the modules I wrote myself are not malicious.
Programming in a language with static types is supposed to ensure that if the program compiles then it won't crash. But the security policy can cause the program to fail even though it passed the compiler's checks.
For example, the server might sometimes need to send an email notification. If it didn't do that while the security policy was being written, then that will be blocked. Or perhaps the web-server didn't even have a notification system when the policy was written, but has since been updated.
The security configuration is written in a new language, which must be learned. It's usually not worth learning this just for one program, so the people who write the program struggle to write the policy. Also, the policy language often cannot express the desired policy, since it may depend on concepts unique to the program (e.g. controlling access based on a web-app user's ID, rather than local Unix user ID).
All of the above problems stem from trying to separate security from the code. If the code were fully correct, we wouldn't need the security layer. Checking that code is fully correct is hard, but maybe there are easy ways to check automatically that it does at least satisfy our security requirements...
One way to prevent programs from performing unwanted actions is to prevent all actions.
In pure functional languages, such as Haskell, the only way to interact with the outside world is to return the action you want to perform from main
. For example:
1 2 3 4 5 |
|
Even if we don't look at the code of f
, we can be sure it only returns a String
and performs no other actions
(assuming Safe Haskell is being used).
Assuming we trust putStr
, we can be sure this program will only output a string to stdout and not perform any other actions.
However, writing only pure code is quite limiting. Also, we still need to audit all IO code.
Consider this code (written in a small OCaml-like functional language, where ref n
allocates a new memory location
initially containing n
, and !x
reads the current value of x
):
1 2 3 4 5 6 7 |
|
Can we be sure that the assert won't fail, without knowing the definition of f
?
Assuming the language doesn't provide unsafe backdoors (such as OCaml's Obj.magic
), we can.
f x
cannot change y
, because f x
does not have access to y
.
So here is an access control system, built in to the lambda calculus itself!
At first glance this might not look very promising.
For example, while f
doesn't have access to y
, it does have access to any global variables defined before f
.
It also, typically, has access to the file-system and network,
which are effectively globals too.
To make this useful, we ban global variables.
Then any top-level function like f
can only access things passed to it explicitly as arguments.
Avoiding global variables is usually considered good practise, and some systems ban them for other reasons anyway
(for example, Rust doesn't allow global mutable state as it wouldn't be able to prevent races accessing it from multiple threads).
Returning to the Haskell example above (but now in OCaml syntax), it looks like this in our capability system:
1 2 3 |
|
Since f
is a top-level function, we know it does not close over any mutable state, and our 42
argument is pure data.
Therefore, the call f 42
does not have access to, and therefore cannot affect,
any pre-existing state (including the filesystem).
Internally, it can use mutation (creating arrays, etc),
but it has nowhere to store any mutable values and so they will get GC'd after it returns.
f
therefore appears as a pure function, and calling it multiple times will always give the same result,
just as in the Haskell version.
output_string
is also a top-level function, closing over no mutable state.
However, the function resulting from evaluating output_string ch
is not top-level,
and without knowing anything more about it we should assume it has full access to the output channel ch
.
If main
is invoked with standard output as its argument, it may output a message to it,
but cannot affect other pre-existing state.
In this way, we can reason about the pure parts of our code as easily as with Haskell, but we can also reason about the parts with side-effects. Haskell's purity is just a special case of a more general rule: the effects of a (top-level) function are bounded by its arguments.
So far, we've been thinking about what values are reachable through other values.
For example, the set of ref-cells that can be modified by f x
is bounded by
the union of the set of ref cells reachable from the closure f
with the set of ref cells reachable from x
.
One powerful aspect of capabilities is that we can use functions to implement whatever access controls we want.
For example, let's say we only want f
to be able to set the ref-cell, but not read it.
We can just pass it a suitable function:
1 2 3 4 5 |
|
Or perhaps we only want to allow inserting positive integers:
1 2 3 |
|
Or we can allow access to be revoked:
1 2 3 4 5 6 7 8 |
|
Or we could limit the number of times it can be used:
1 2 3 4 |
|
Or log each time it is used, tagged with a label that's meaningful to us (e.g. the function to which we granted access):
1 2 3 4 5 6 7 8 |
|
Or all of the above.
In these examples, our function f
never got direct access (permission) to x
, yet was still able to affect it.
Therefore, in capability systems people often talk about "authority" rather than permission.
Roughly speaking, the authority of a subject is the set of actions that the subject could cause to happen,
now or in the future, on currently-existing resources.
Since it's only things that might happen, and we don't want to read all the code to find out exactly what
it might do, we're usually only interested in getting an upper-bound on a subject's authority,
to show that it can't do something.
The examples here all used a single function. We may want to allow multiple operations on a single value (e.g. getting and setting a ref-cell), and the usual techniques are available for doing that (e.g. having the function take the operation as its first argument, or collecting separate functions together in a record, module or object).
Let's look at a more realistic example.
Here's a simple web-server (we are defining the main
function, which takes two arguments):
1 2 |
|
To use it, we pass it access to some network (net
) and a directory tree with the content (htdocs
).
Immediately we can see that this server does not access any part of the file-system outside of htdocs
,
but that it may use the network. Here's a picture of the situation:
Notes on reading the diagram:
htdocs
so we can see that app
doesn't have access to the rest of home
.
Just for emphasis, I also show .ssh
separately.
I'm assuming here that a directory doesn't give access to its parent,
so htdocs
can only be used to read files within that sub-tree.
net
represents the network and everything else connected to it.
app
and net
.
Since we don't yet know anything about either,
we would have to assume that app
might give net
access to htdocs
and to itself.
So, the diagram above shows the application app
has been given references to net
and to htdocs
as arguments.
Looking at our checklist from the start:
htdocs
.
~/.ssh/authorized_keys
.
We can read the body of the function to learn more:
1 2 3 4 |
|
Note: Net.listen net
is typical OCaml style for performing the listen
operation on net
.
We could also have used a record and written net.listen
instead, which may look more familiar to some readers.
Here's an updated diagram, showing the moment when Http.serve
is called.
The app
group has been opened to show socket
and handler
separately:
We can see that the code in the HTTP library can only access the network via socket
,
and can only access htdocs
by using handler
.
Assuming Net.listen
is trust-worthy (we'll normally trust the platform's networking layer),
it's clear that the application doesn't make out-bound connections,
since net
is used only to create a listening socket.
To know what the application might do to htdocs
, we only have to read the definition of static_files
:
1 2 |
|
Now we can see that the application doesn't change any files; it only uses htdocs
to read them.
Finally, expanding Http.serve
:
1 2 3 4 5 |
|
We see that handle_connection
has no way to share telemetry information between connections,
given that handle_request
never stores anything.
We can tell these things after only looking at the code for a few seconds, even though dozens of libraries are being used.
In particular, we didn't have to read handle_connection
or any of the HTTP parsing logic.
Now let's enable TLS. For this, we will require a configuration directory containing the server's key:
1 2 3 4 5 |
|
OCaml syntax note: I used ~
to make tls_config
a named argument; we wouldn't want to get this directory confused with htdocs
!
We can see that only the TLS library gets access to the key. The HTTP library interacts only with the TLS socket, which presumably does not reveal it.
Notice too how this fixes the problem we had with our original policy enforcement system.
There, an attacker could request https://example.com/../tls_config/server.key
and the HTTP server might send the key.
But here, the handler cannot do that even if it wants to.
When handler
loads a file, it does so via htdocs
, which does not have access to tls_config
.
The above server has pretty good security properties,
even though we didn't make any special effort to write secure code.
Security-conscious programmers will try to wrap powerful capabilities (like net
)
with less powerful ones (like socket
) as early as possible, making the code easier to understand.
A programmer uninterested in readability is likely to mix in more irrelevant code you have to skip through,
but even so it shouldn't take too long to track down where things like net
and htdocs
end up.
And even if they spread them throughout their entire application,
at least you avoid having to read all the libraries too!
By contrast, consider a more traditional (non-capability) style. We start with:
1
|
|
Here, htdocs
would be a plain string rather than a reference to a directory,
and the network would be reached through a global.
We can't tell anything about what this server could do from looking at this one line,
and even if we expand it, we won't be able to tell what all the functions it calls do, either.
We will end up having to follow every function call recursively through all of the server's
dependencies, and our analysis will be out of date as soon as any of them changes.
We've seen that we can create an over-approximation of the reference graph by looking at just a small part of the code,
and then get a closer bound on the possible effects as needed
by expanding groups of values until we can prove the desired property.
For example, to prove that the application didn't modify htdocs
, we followed htdocs
by expanding main
and then static_files
.
Within a single process, a capability is a reference (pointer) to another value in the process's memory. However, the diagrams also included arrows (capabilities) to things outside of the process, such as directories. We can regard these as references to privileged proxy functions in the process that make calls to the OS kernel, or (at a higher level of abstraction) we can consider them to be capabilities to the external resources themselves.
It is possible to build capability operating systems (in fact, this was the first use of capabilities). Just as we needed to ban global variables to make a safe programming language, we need to ban global namespaces to make a capability operating system. For example, on FreeBSD this is done (on a per-process basis) by invoking the cap_enter system call.
We can zoom out even further, and consider a network of computers. Here, an arrow between machines represents some kind of (unforgeable) network address or connection. At the IP level, any process can connect to any address, but a capability system can be implemented on top. CapTP (the Capability Transport Protocol) was an early system for this, but Cap'n Proto (Capabilities and Protocols) is the modern way to do it.
So, thinking in terms of capabilities, we can zoom out to look at the security properties of the whole network, yet still be able to expand groups as needed right down to the level of individual closures in a process.
Library code can be imported and called without it getting access to any pre-existing state, except that given to it explicitly. There is no "ambient authority" available to the library.
A function's side-effects are bounded by its arguments. We can understand (get a bound on) the behaviour of a function call just by looking at it.
If a
has access to b
and to c
, then a
can introduce them (e.g. by performing the function call b c
).
Note that there is no capability equivalent to making something "world readable";
to perform an introduction,
you need access to both the resource being granted and to the recipient ("only connectivity begets connectivity").
Instead of passing the name of a resource, we pass a capability reference (pointer) to it, thereby proving that we have access to it and sharing that access ("no designation without authority").
The caller of a function decides what it should access, and can provide restricted access by wrapping another capability, or substituting something else entirely.
I am sometimes unable to install a messaging app on my phone because it requires me to grant it access to my address book. A capability system should never say "This application requires access to the address book. Continue?"; it should say "This application requires access to an address book; which would you like to use?".
A capability must behave the same way regardless of who uses it.
When we do f x
, f
can perform exactly the same operations on x
that we can.
It is tempting to add a traditional policy language alongside capabilities for "extra security",
saying e.g. "f
cannot write to x
, even if it has a reference to it".
However, apart from being complicated and annoying,
this creates an incentive for f
to smuggle x
to another context with more powers.
This is the root cause of many real-world attacks, such as click-jacking or cross-site request forgery,
where a URL permits an attack if a victim visits it, but not if the attacker does.
One of the great benefits of capability systems is that you don't need to worry that someone is trying to trick you
into doing something that you can do but they can't,
because your ability to access the resource they give you comes entirely from them in the first place.
All of the above follow naturally from using functions in the usual way, while avoiding global variables.
The above discussion argues that capabilities would have been a good way to build systems in an ideal world. But given that most current operating systems and programming languages have not been designed this way, how useful is this approach? I'm currently working on Eio, an IO library for OCaml, and using these principles to guide the design. Here are a few thoughts about applying capabilities to a real system.
A lot of people worry about cluttering up their code by having to pass things explicitly everywhere. This is actually not much of a problem, for a couple of reasons:
We already do this with most things anyway. If your program uses a database, you probably establish a connection to it at the start and pass the connection around as needed. You probably also pass around open file handles, configuration settings, HTTP connection pools, arrays, queues, ref-cells, etc. Handling "the file-system" and "the network" the same way as everything else isn't a big deal.
You can often bundle up a capability with something else. For example, a web-server will likely let the user decide which directory to serve, so you're already passing around a pathname argument. Passing a path capability instead is no extra work.
Consider a request handler that takes the address of a Redis server:
1
|
|
It might seem that by using capabilities we'd need to pass the network in here too:
1
|
|
This is both messy and unnecessary.
Instead, handle_request
can take a function for connecting to Redis:
1
|
|
Then there is only one argument to pass around again.
Instead of writing the connection logic in handle_request
, we write the same logic outside and just pass in the function.
And now someone looking at the code can see "the handler can connect to Redis",
rather than the less precise "the handler accesses the network".
Of course, if Redis required more than one configuration setting then you'd probably already be doing it this way.
The main problematic case is providing defaults.
For example, a TLS library might allow us to specify the location of the system's certificate store,
but it would like to provide a default (e.g. /etc/ssl/certs/
).
This is particularly important if the default location varies by platform.
If the TLS library decides the location, then we must give it (read-only at least) access to the whole system!
We may just decide to trust the library, or we might separate out the default paths into a trusted package.
Ideally, our programming language would provide a secure implementation of capabilities that we could depend on. That would allow running untrusted code safely and protect us from compromised packages. However, converting a non-capability language to a capability-secure one isn't easy, and isn't likely to happen any time soon for OCaml (but see Emily for an old proof-of-concept).
Even without that, though, capabilities help to protect non-malicious code from malicious inputs.
For example, the request handler above forgot to sanitise the URL path from the remote client,
but it still can't access anything outside of htdocs
.
And even if we don't care about security at all, capabilities make it easy to see what a program does; they make it easy to test programs by replacing OS resources with mocks; and preventing access to globals helps to avoid race conditions, since two functions that access the same resource must be explicitly introduced.
A capability OS would let us run a program's main
function and provide the capabilities it wanted directly,
but most systems don't work like that.
Instead, each program requires a small trusted entrypoint that has the full privileges of the process.
In Eio, an application will typically start something like this:
1 2 3 4 5 |
|
Eio_main.run
starts the Eio event loop and then runs the callback.
The env
argument gives full access to the process's environment.
Here, the callback extracts network and filesystem access from this,
gets access to just "/srv/www" from fs
,
and then calls the main
function as before.
Note that Eio_main.run
itself is not a capability-safe function (it magics up env
from nothing).
A capability-enforcing compiler would flag this bit up as needing to be audited manually.
Maybe you're not convinced by all this capability stuff.
Traditional security systems are more widely available, better tested, and approved by your employer,
and you want to use that instead.
Still, to write the policy, you're going to need a list of resources the program might access.
Looking at the above code, we can immediately see that the policy need allow access only to the "/srv/www" directory,
and so we could call e.g. unveil here.
And if main
later changes to use TLS,
the type-checker will let us know to update this code to provide the TLS configuration
and we'll know to update the policy at the same time.
If you want to drop privileges, such a program also makes it easy to see when it's safe to do that.
For example, looking at main
we can see that net
is never used after creating the socket,
so we don't need the bind
system call after that,
and we never need connect
.
We know, for instance, that this program isn't hiding an XML parser that needs to download schema files to validate documents.
In addition to global and local variables, systems often allow us to attach data to threads as a sort of middle ground. This could allow unexpected interactions. For example:
1 2 3 |
|
Here, we'd expect that g
doesn't have access to x
, but f
could pass it using thread-local storage.
To prevent that, Eio instead provides Fiber.with_binding,
which runs a function with a binding but then puts things back how they were before returning,
so f
can't make changes that are still active when g
runs.
This also allows people who don't want capabilities to disable the whole system easily:
1 2 3 4 5 6 7 8 |
|
It looks like f ()
doesn't have access to anything, but in fact it can recover env
and get access to everything!
However, anyone trying to understand the code will start following env
from the main entrypoint
and will then see that it got put in fiber-local storage.
They then at least know that they must read all the code to understand anything about what it can do.
More usefully, this mechanism allows us to make just a few things ambiently available.
For example, we don't want to have to plumb stderr through to a function every time we want to do some printf
debugging,
so it makes sense to provide a tracing function this way (and Eio does this by default).
Tracing allows all components to write debug messages, but it doesn't let them read them.
Therefore, it doesn't provide a way for components to communicate with each other.
It might be tempting to use Fiber.with_binding
to restrict access to part of a program
(e.g. giving an HTTP server network access this way),
but note that this is a non-capability way to do things,
and suffers the same problems as traditional security systems,
separating designation from authority.
In particular, supposedly sandboxed code in other parts of the application
can try to escape by tricking the HTTP server part into running a callback function for them.
But fiber local storage is fine for things to which you don't care to restrict access.
Symlinks are a bit of a pain! If I have a capability reference to a directory, it's useful to know that I can only access things beneath that directory. But the directory may contain a symlink that points elsewhere.
One option would be to say that a symlink is a capability itself, but this means that you could only create symlinks to things you can access yourself, and this is quite a restriction. For example, you might be forbidden from extracting a tarball because tar
didn't have permission to the target of a symlink it wanted to create.
The other option is to say that symlinks are just strings, and it's up to the user to interpret them.
This is the approach FreeBSD uses. When you use a system call like openat
,
you pass a capability to a base directory and a string path relative to that.
In the case of our web-server, we'd use a capability for htdocs
, but use strings to reference things inside it, allowing the server to follow symlinks within that sub-tree, but not outside.
The main problem is that it makes the API a bit confusing. Consider:
1
|
|
It might look like save_to
is only getting access to the "uploads" directory,
but in Eio it actually gets access to the whole of htdocs
.
If you want to restrict access, you have to do that explicitly
(as we did when creating htdocs
from fs
).
The advantage, however, is that we don't break software that relies on symlinks.
Also, restricting access is quite expensive on some systems (FreeBSD has the handy O_BENEATH
open flag,
and Linux has RESOLVE_BENEATH
, but not all systems provide this), so might not be a good default.
I'm not completely satisfied with the current API, though.
It is also possible to use capabilities to restrict access to time and randomness. The security benefits here are less clear. Tracking access to time can be useful in preventing side-channel attacks that depend on measuring time accurately, but controlling access to randomness makes it difficult to e.g. randomise hash functions to help prevent denial-of-service-attacks.
However, controlling access to these does have the advantage of making code deterministic by default, which is a great benefit, especially for expect-style testing. Your top level test function is called with no arguments, and therefore has no access to non-determinism, instead creating deterministic mocks to use with the code under test. You can then just record a good trace of a test's operations and check that it doesn't change.
Interactive applications that load and save files present a small problem: since the user might load or save anywhere, it seems they need access to the whole file-system. The solution is a "powerbox". The powerbox has access to the file-system and the rest of the application only has access to the powerbox. When the application wants to save a file, it asks the powerbox, which pops up a GUI asking the user to choose the location. Then it opens the file and passes that back to the application.
Currently-popular security mechanisms are complex and have many shortcomings. Yet, the lambda calculus already contains an excellent security mechanism, and making use of it requires little more than avoiding global variables.
This is known as "capability-based security". The word "capabilities" has also been used for several unrelated concepts (such as "POSIX capabilities"), and for clarity much of the community rebranded a while back as "Object Capabilities", but this can make it seem irrelevant to functional programmers. In fact, I wrote this blog post because several OCaml programmers have asked me what the point of capabilities is. I was expecting it to be quite short (basically: applying functions to arguments good, global variables bad), but it's got quite long; it seems there is a fair bit that follows from this simple idea!
Instead of seeing security as an extra layer that runs separately from the code and tries to guess what it meant to do, capabilities fit naturally into the language. The key difference with traditional security is that the ability to do something depends on the reference used to do it, not on the identity of the caller. This way of thinking about security works not only for controlling access to resources within a single program, but also for controlling interactions between processes running on a machine, and between machines on a network. We can group together resources and zoom out to see the overall picture, or expand groups to zoom in and get a closer bound on the behaviour.
Even ignoring security, a key question is: what can a function do?
Should a function call be able to do anything at all that the process can do,
or should its behaviour be bounded in some way that is obvious just by looking at it?
If we say that you must read the source code of a function to see what it does, then this applies recursively:
we must also read all the functions that it calls, and so on.
To understand the main
function, we end up having to read the code of every library it uses!
If you want to read more, the What Are Capabilities? blog post provides a good overview; Part II of Robust Composition contains a longer explanation; Capability Myths Demolished does a good job of enumerating security properties provided by capabilities; my own SERSCIS Access Modeller paper shows how to analyse systems where some components have unknown behaviour; and, for historical interest, see Dennis and Van Horn's 1966 Programming Semantics for Multiprogrammed Computations, which introduced the idea.
]]>Table of Contents
( this post also appeared on Hacker News )
A graphical desktop typically allows running multiple applications on a single display (e.g. by showing each application in a separate window). Client applications connect to a server process (usually on the same machine) and ask it to display their windows.
Until recently, this service was an X server, and applications would communicate with it using the X11 protocol. However, on newer systems the display is managed by a Wayland compositor, using the Wayland protocol.
Many older applications haven't been updated yet. Xwayland can be used to allow unmodified X11 applications to run in a Wayland desktop environment. However, setting this up wasn't as easy as I'd hoped. Ideally, Xwayland would completely isolate the Wayland compositor from needing to know anything about X11:
However, it doesn't work like this. Xwayland handles X11 drawing operations, but it doesn't handle lots of other details, including window management (e.g. telling the Wayland compositor what the window title should be), copy-and-paste, and selections. Instead, the Wayland compositor is supposed to connect back to Xwayland over the X11 protocol and act as an X11 window manager to provide the missing features:
This is a problem for several reasons:
Because Wayland (unlike X11) doesn't allow applications to mess with other applications' windows, we can't have a third-party application act as the X11 window manager. It wouldn't have any way to ask the compositor to put Xwayland's surfaces into a window frame, because Xwayland is a separate application.
There is another way to do it, however. As I mentioned in the last post, I already had to write a Wayland proxy (wayland-proxy-virtwl) to run in each VM and relay Wayland messages over virtwl, so I decided to extend it to handle Xwayland too. As a bonus, the proxy can also be used even without VMs, avoiding the need for any X11 support in Wayland compositors at all. In fact, I found that doing this avoided several bugs in Sway's built-in Xwayland support.
Sommelier already has support for this, but it doesn't work for the applications I want to use.
For example, popup menus appear in the center of the screen, text selections don't work, and it generally crashes after a few seconds (often with the error xdg_surface has never been configured
).
So instead I'd been using ssh -Y vm
from the host to forward X11 connections to the host's Xwayland,
managed by Sway.
That works, but it's not at all secure.
Unlike Wayland, where applications are mostly unaware of each other, X is much more collaborative.
The X server maintains a tree of windows (rectangles) and the applications manipulate it.
The root of the tree is called the root window and fills the screen.
You can see the tree using the xwininfo
command, like this:
$ xwininfo -tree -root
xwininfo: Window id: 0x47 (the root window) (has no name)
Root window id: 0x47 (the root window) (has no name)
Parent window id: 0x0 (none)
9 children:
0x800112 "~/Projects/wayland/wayland-proxy-virtwl": ("ROX-Filer" "ROX-Filer") 2184x2076+0+0 +0+0
1 child:
0x800113 (has no name): () 1x1+-1+-1 +-1+-1
0x800123 (has no name): () 1x1+-1+-1 +-1+-1
0x800003 "ROX-Filer": () 10x10+-100+-100 +-100+-100
0x800001 "ROX-Filer": ("ROX-Filer" "ROX-Filer") 10x10+10+10 +10+10
1 child:
0x800002 (has no name): () 1x1+-1+-1 +9+9
0x600002 "main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1": ("gvim" "Gvim") 1648x1012+0+0 +0+0
1 child:
0x600003 (has no name): () 1x1+-1+-1 +-1+-1
0x600007 (has no name): () 1x1+-1+-1 +-1+-1
0x600001 "Vim": ("gvim" "Gvim") 10x10+10+10 +10+10
0x200002 (has no name): () 1x1+0+0 +0+0
0x200001 (has no name): () 1x1+0+0 +0+0
This tree shows the windows of two X11 applications, ROX-Filer and GVim, as well as various invisible utility windows (mostly 1x1 or 10x10 pixels in size).
Applications can create, move, resize and destroy windows, draw into them, and request events from them.
The X server also allows arbitrary data to be attached to windows in properties.
You can see a window's properties with xprop
. Here are some of the properties on the GVim window:
$ xprop -id 0x600002
WM_HINTS(WM_HINTS):
Client accepts input or input focus: True
Initial state is Normal State.
window id # of group leader: 0x600001
_NET_WM_WINDOW_TYPE(ATOM) = _NET_WM_WINDOW_TYPE_NORMAL
WM_NORMAL_HINTS(WM_SIZE_HINTS):
program specified minimum size: 188 by 59
program specified base size: 188 by 59
window gravity: NorthWest
WM_CLASS(STRING) = "gvim", "Gvim"
WM_NAME(STRING) = "main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1"
...
The X server itself doesn't know anything about e.g. window title bars. Instead, a window manager process connects and handles that. A window manager is just another X11 application. It asks to be notified when an application tries to show ("map") a window inside the root, and when that happens it typically creates a slightly larger window (with room for the title bar, etc) and moves the other application's window inside that.
This design gives X a lot of flexibility. All kinds of window managers have been implemented, without needing to change the X server itself. However, it is very bad for security. For example:
xwininfo
to find its window ID (you need the nested child window, not the top-level one).
xev -id 0x80001b -event keyboard
in another window (using the ID you got above).
sudo
or similar inside xterm
and enter a password.
As you type the password into xterm
, you should see the characters being captured by xev
.
An X application can easily spy on another application, send it synthetic events, etc.
Xwayland is a version of the xorg X server that treats Wayland as its display hardware.
If you run it as e.g. Xwayland :1
then it opens a single Wayland window corresponding to the X root window,
and you can use it as a nested desktop.
This isn't very useful, because these windows don't fit in with the rest of your desktop.
Instead, it is normally used in rootless mode, where each child of the X root window may have its own Wayland window.
$ WAYLAND_DEBUG=1 Xwayland :1 -rootless
[3991465.523] -> wl_display@1.get_registry(new id wl_registry@2)
[3991465.531] -> wl_display@1.sync(new id wl_callback@3)
...
When run this way, however, no windows actually appear.
If we run DISPLAY=:1 xterm
then we see Xwayland creating some buffers, but no surfaces:
[4076460.506] -> wl_shm@4.create_pool(new id wl_shm_pool@15, fd 9, 540)
[4076460.520] -> wl_shm_pool@15.create_buffer(new id wl_buffer@24, 0, 9, 15, 36, 0)
[4076460.526] -> wl_shm_pool@15.destroy()
...
We need to run Xwayland as Xwayland :1 -rootless -wm FD
, where FD is a socket we will use to speak the X11 protocol and act as a window manager.
It's a little hard to find information about Xwayland's rootless mode, because "rootless" has two separate common meanings in xorg:
After a while, it became clear that Xwayland's rootless mode isn't either of these, but a third xorg feature also called "rootless".
libxcb provides C bindings to the X11 protocol, but I wanted to program in OCaml. Luckily, the X11 protocol is well documented, and generating the messages directly didn't look any harder than binding libxcb, so I wrote a little OCaml library to do this (ocaml-x11).
At first, I hard-coded the messages. For example, here's the code to delete a property on a window:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
I'm using the cstruct syntax extension to let me define the exact layout of the message body.
Here, it generates sizeof_req
, set_req_window
and set_req_property
automatically.
After a bit, I discovered that there are XML files in xcbproto describing the X11 protocol. This provides a Python library for parsing the XML, which you can use by writing a Python script for your language of choice. For example, this glorious 3394 line Python script generates the C bindings. After studying this script carefully, I decided that hard-coding everything wasn't so bad after all.
I ended up having to implement more messages than I expected,
including some surprising ones like OpenFont
(see x11.mli for the final list).
My implementation came to 1754 lines of OCaml,
which is quite a bit shorter than the Python generator script,
so I guess I still came out ahead!
In the X11 protocol, client applications send requests and the server sends replies, errors and events. Most requests don't produce replies, but can produce errors. Replies and errors are returned immediately, so if you see a response to a later request, you know all previous ones succeeded. If you care about whether a request succeeded, you may need to send a dummy message that generates a reply after it. Since message sequence numbers are 16-bit, after sending 0xffff consecutive requests without replies, you should send a dummy one with a reply to resynchronise (but window management involves lots of round-trips, so this isn't likely to be a problem for us). Events can be sent by the server at any time.
Unlike Wayland, which is very regular, X11 has various quirks.
For example, every event has a sequence number at offset 2, except for KeymapNotify
.
Using Xwayland -wm FD
actually prevents any client applications from connecting at all at first,
because Xwayland then waits for the window manager to be ready before accepting any client connections.
To fix that, we need to claim ownership of the WM_S0
selection.
A "selection" is something that can be owned by only one application at a time.
Selections were originally used to track ownership of the currently-selected text, and later also used for the clipboard.
WM_S0
means "Window Manager for Screen 0" (Xwayland only has one screen).
1 2 3 |
|
Instead of passing things like WM_S0
as strings in each request, X11 requires us to first intern the string.
This returns a unique 32-bit ID for it, which we use in future messages.
Because intern
may require a round-trip to the server, it returns a promise,
and so we use let*
instead of let
to wait for that to resolve before continuing.
let*
is defined in the Lwt.Syntax
module, as an alternative to the more traditional >>=
notation.
This lets our clients connect. However, Xwayland still isn't creating any Wayland surfaces. By reading the Sommelier code and stepping through Xwayland with a debugger, I found that I needed to enable the Composite extension.
Composite was originally intended to speed up redraw operations, by having the server keep a copy of every top-level window's pixels (even when obscured), so that when you move a window it can draw it right away without asking the application for help. The application's drawing operations go to the window's buffer, and then the buffer is copied to the screen, either automatically by the X server or manually by the window manager. Xwayland reuses this mechanism, by turning each window buffer into a Wayland surface. We just need to turn that on:
1 2 |
|
This says that every child of the root window should use this system. Finally, we see Xwayland creating Wayland surfaces:
-> wl_compositor@5.create_surface id:+28
Now we just need to make them appear on the screen!
As usual for Wayland, we need to create a role object and attach it to the surface. This tells Wayland whether the surface is a window or a dialog, for example, and lets us set the title, etc.
But first we have a problem: we need to know which X11 window corresponds to each Wayland surface.
For example, we need the title, which is stored in a property on the X11 window.
Xwayland does this by sending the new window a ClientMessage event of type WL_SURFACE_ID
containing the Wayland ID.
We don't get this message by default, but it seems that selecting SubstructureRedirect
on the root does the trick.
SubstructureRedirect
is used by window managers to intercept attempts by other applications to change the children of the root window.
When an application asks the server to e.g. map a window, the server just forwards the request to the window manager.
Operations performed by the window manager itself do not get redirected, so it can just perform the same request the client wanted, or
make any changes it requires.
In our case, we don't actually need to modify the request, so we just re-perform the original map
operation:
1 2 3 4 5 6 7 8 9 |
|
Having two separate connections to Xwayland is quite annoying, because messages can arrive in any order.
We might get the X11 ClientMessage
first and need to wait for the Wayland create_surface
, or we might get the create_surface
first
and need to wait for the ClientMessage
.
An added complication is that not all Wayland surfaces correspond to X11 windows.
For example, Xwayland also creates surfaces representing cursor shapes, and these don't have X11 windows.
However, when we get the ClientMessage
we can be sure that a Wayland message is on the way,
so I just pause the X11 event handling until that has arrived:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Another complication is that Wayland doesn't allow you to attach a buffer to a surface until the window has been "configured". Doing so is a protocol error, and Sway will disconnect us if we try! But Xwayland likes to attach the buffer immediately after creating the surface.
To avoid this, I use a queue:
unpaired
map, and create a queue for further events.
ClientMessage
over the X11 connection and create a role for the new surface.
configure
event, confirming it's ready for the buffer.
However, this creates a new problem: if the surface isn't a window then the events will be queued forever.
To fix that, when we get a create_surface
we also do a round-trip on the X11 connection.
If the window is still unpaired when that returns then we know that no ClientMessage
is coming, and we flush the queue.
X applications like to create dummy windows for various purposes (e.g. receiving clipboard data),
and we need to avoid showing those.
They're normally set as override_redirect
so the window manager doesn't handle them,
but Xwayland redirects them anyway (it needs to because otherwise e.g. tooltips wouldn't appear at all).
I'm trying various heuristics to detect this, e.g. that override redirect windows with a size of 1x1 shouldn't be shown.
If Sway asks us to close a window, we need to relay that to the X application using the WM_DELETE_WINDOW
protocol,
if it supports that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Wayland defaults to using client-side decorations (where the application draws its own window decorations). X doesn't do that, so we need to turn it off (if the Wayland compositor supports the decoration manager extension):
1 2 3 4 5 6 7 8 |
|
Dialog boxes are more of a problem.
Wayland requires every dialog box to have a parent window, but X11 doesn't.
To handle that, the proxy tracks the last window the user interacted with and uses that as a fallback parent
if an X11 window with type _NET_WM_WINDOW_TYPE_DIALOG
is created without setting WM_TRANSIENT_FOR
.
That could be a problem if the application closes that window, but it seems to work.
I noticed a strange problem: scrolling around in GVim had long pauses once a second or so, corresponding to OCaml GC runs. This was surprising, as OCaml has a fast incremental garbage collector, and is normally not a problem for interactive programs. Besides, I'd been using the proxy with the (Wayland) Firefox and xfce4-terminal applications for 6 months without any similar problem.
Using perf
showed that Linux was spending a huge amount of time in release_pages
.
The problem is that Xwayland was sharing lots of short-lived memory pools with the proxy.
Each time it shares a pool, we have to ask the VM host for a chunk of memory of the same size.
We map both pools into our address space and then copy each frame across
(this is needed because we can't export guest memory to the host).
Normally, an application shares a single pool and just refers to regions within it, so we just map once at startup and unmap at exit. But Xwayland was creating, sharing and discarding around 100 pools per second while scrolling in GVim! Because these pools take up a lot of RAM, OCaml was (correctly) running the GC very fast, freeing them in batches of 100 or so each second.
First, I tried adding a cache of host memory, but that only solved half the problem: freeing the client pool was still slow.
Another option is to unmap the pools as soon as we get the destroy message, to spread the work out. Annoyingly, OCaml's standard library doesn't let you free memory-mapped memory explicitly (see the Add BigArray.Genarray.free PR for the current status), but adding this myself with a bit of C code would have been easy enough. We only touch the memory in one place (for the copy), so manually checking it hadn't been freed would have been pretty safe.
Then I noticed something interesting about the repeated log entries, which mostly looked like this:
-> wl_shm@4.create_pool id:+26 fd:(fd) size:8368360
-> wl_shm_pool@26.create_buffer id:+28 offset:0 width:2090 height:1001 stride:8360 format:1
-> wl_shm_pool@26.destroy
<- wl_display@1.delete_id id:26
-> wl_buffer@28.destroy
<- wl_display@1.delete_id id:28
Xwayland creates a pool, allocates a buffer within it, destroys the pool (so it can't create more buffers), and then deletes the buffer. But it never uses the buffer for anything!
So the solution was simple: I just made the host buffer allocation and the mapping operations lazy. We force the mapping if a pool's buffer is ever attached to a surface, but if not we just close the FD and forget about it. Would be more efficient if Xwayland only shared the pools when needed, though.
Wayland delivers pointer events relative to a surface, so we simply forward these on to Xwayland unmodified and everything just works.
I'm kidding - this was the hardest bit! When Xwayland gets a pointer event on a window, it doesn't send it directly to that window. Instead, it converts the location to screen coordinates and then pushes the event through the old X event handling mechanism, which looks at the X11 window stack to decide where to send it.
However, the X11 window stack (which we saw earlier with xwininfo -tree -root
) doesn't correspond to the Wayland window layout at all.
In fact, Wayland doesn't provide us any way to know where our windows are, or how they are stacked.
Sway seems to handle this via a backdoor: X11 applications do get access to location information even though native Wayland clients don't. This is one of the reasons I want to get X11 support out of the compositor - I want to make sure X11 apps don't have any special access. Sommelier has a solution though: when the pointer enters a window we raise it to the top of the X11 stack. Since it's the topmost window, it will get the events.
Unfortunately, the raise request goes over the X11 connection while the pointer events go over the Wayland one. We need to make sure that they arrive in the right order. If the computer is running normally, this isn't much of a problem, but if it's swapping or otherwise struggling it could result in events going to the wrong place (I temporarily added a 2-second delay to test this). This is what I ended up with:
At first I tried queuing up just the pointer events, but that doesn't work because e.g. keyboard events need to be synchronised with pointer events. Otherwise, if you e.g. Shift-click on something then the click gets delayed but the Shift doesn't and it can do the wrong thing. Also, Xwayland might ask Sway to destroy the window while we're entering it, and Sway might confirm the deletion. Pausing the whole event stream from Sway fixes all these problems.
The next problem was how to do the two round-trips.
For X11 we just send an Intern
request after the raise and wait to get a reply to that.
Wayland provides the wl_display.sync
method to clients, but we're acting as a Wayland server to Xwayland,
not a client.
I remembered that Wayland's xdg-shell extension provides a ping from the server to the client
(the compositor can use this to detect when an application is not responding).
Unfortunately, Xwayland has no reason to use this extension because it doesn't deal with window roles.
Luckily, it uses it anyway (it does need it for non-rootless mode and doesn't bother to check).
wl_display.sync
works by creating a fresh callback object, but xdg-shell's ping
just sends a pong
event to a fixed object,
so we also need a queue to keep track of pings in flight so we don't get confused between our pings and any pings we're relaying for Sway.
Also, xdg-shell's ping requires a serial number and we don't have one.
But since Xwayland is the only app this needs to support, and it doesn't look at that, I cheat and just send zero.
And that's how to get pointer events to go to the right window with Xwayland.
A very similar problem exists with the keyboard.
When Wayland says the focus has entered a window
we need to send a SetInputFocus
over the X11 connection
and then send the keyboard events over the Wayland one,
requiring another two round-trips to synchronise the two connections.
Some applications set their own pointer shape, which works fine.
But others rely on the default and for some reason you get no cursor at all in that case.
To fix it, you need to set a cursor on the root window, which applications will then inherit by default.
Unlike Wayland, where every application provides its own cursor bitmaps,
X very sensibly provides a standard set of cursors, in a font called cursor
(this is why I had to implement OpenFont
).
As cursors have two colours and a mask, each cursor is two glyphs: even numbered glyphs are the image and the following glyph is its mask:
1 2 3 4 5 6 7 8 9 10 |
|
The next job was to get copying text between X and Wayland working.
In X11:
PRIMARY
selection.
PRIMARY
.
CLIPBOARD
selection.
CLIPBOARD
.
It's quite neat that adding support for a Windows-style clipboard didn't require changing the X server at all. Good forward-thinking design there.
In Wayland, things are not so simple. I have so far found no less than four separate Wayland protocols for copying text:
gtk_primary_selection
supports copying the primary selection, but not the clipboard.
wp_primary_selection_unstable_v1
is identical to gtk_primary_selection
except that it renames everything.
wl_data_device_manager
supports clipboard transfers but not the primary selection.
zwlr_data_control_manager_v1
supports both, but it's for a "privileged client" to be a clipboard manager.
gtk_primary_selection
and wl_data_device_manager
both say they're stable, while the other two are unstable.
However, Sway dropped support for gtk_primary_selection
a while ago, breaking many applications
(luckily, I had a handy Wayland proxy and was able to add some adaptor code
to route gtk_primary_selection
messages to the new "unstable" protocol).
For this project, I went with wp_primary_selection_unstable_v1
and wl_data_device_manager
.
On the Wayland side, everything has to be written twice for the two protocols, which are almost-but-not-quite the same.
In particular, wl_data_device_manager
also has a load of drag-and-drop stuff you need to ignore.
For each selection (PRIMARY
or CLIPBOARD
), we can be in one of two states:
When we own a selection we proxy requests for it to the matching selection on the other protocol.
One good thing about the Wayland protocols is that you send the data by writing it to a normal Unix pipe. For X11, we need to write the data to a property on the requesting application's window and then notify it about the data. And we may need to split it into multiple chunks if there's a lot of data to transfer.
A strange problem I had was that, while pasting into GVim worked fine, xterm would segfault shortly after trying to paste into it.
This turned out to be a bug in the way I was sending the notifications.
If an X11 application requests the special TEXT
target, it means that the sender should choose the exact format.
You write the property with the chosen type (e.g. UTF8_STRING
),
but you must still send the notification with the target TEXT
.
xterm is a C application (thankfully no longer set-uid!) and seems to have a use-after-free bug in the timeout code.
Sadly, I wasn't able to get this working at all. X itself doesn't know anything about drag-and-drop and instead applications look at the window tree to decide where the user dropped things. This doesn't work with the proxy, because Wayland doesn't tell us where the windows really are on the screen.
Even without any VMs or proxies, drag-and-drop from X applications to Wayland ones doesn't work, because the X app can't see the Wayland window and the drop lands on the X window below (if any).
In the last post, I mentioned several other problems, which have also now been solved by the proxy:
Wayland's support for high resolution screens is a bit strange. I would have thought that applications really only need to know two things:
Some systems instead provide the size of the window and the DPI (dots-per-inch), but this doesn't work well. For example, a mobile phone might be high DPI but still want small text because you hold it close to your face, while a display board will have very low DPI but want large text.
Wayland instead redefines the idea of pixel to be a group of pixels corresponding to a single pixel on a typical 1990's display. So if you set your scale factor to 2 then 1 Wayland pixel is a 2x2 grid of physical pixels. If you have a 1000x1000 pixel window, Wayland will tell the application it is 500x500 but suggest a scale factor of 2. If the application supports HiDPI mode, it will double all the numbers and render a 1000x1000 image and things work correctly. If not, it will render a 500x500 pixel image and the compositor will scale it up.
Since Xwayland doesn't support this, it just draws everything too small and Sway scales it up, creating a blurry and unusable mess. This might be made worse by subpixel rendering, which doesn't cope well with being scaled.
With the proxy, the solution is simple enough: when talking to Xwayland we just scale everything back up to the real dimensions, scaling all coordinates as we relay them:
1 2 3 |
|
1 2 3 4 5 6 7 8 9 |
|
This will tend to make things sharp but too small, but X applications already have their own ways to handle high resolution screens.
For example, you can set Xft.dpi
to make all the fonts bigger. I run this proxy like this, which works for me:
wayland-proxy-virtwl --x-display=0 --xrdb Xft.dpi:150 --x-unscale=2
However, there is a problem. The Wayland specification says:
The new size of the surface is calculated based on the buffer size transformed by the inverse buffer_transform and the inverse buffer_scale. This means that at commit time the supplied buffer size must be an integer multiple of the buffer_scale. If that's not the case, an invalid_size error is sent.
Let's say we have an X11 image viewer that wants to show a 1001-pixel-high image in a 1001-pixel-high window. This isn't allowed by the spec, which can only handle even-sized windows when the scale factor is 2. Regular Wayland applications already have to deal with that somehow, but for X11 applications it becomes our problem.
I tried rounding down, but that has a bad side-effect: if GTK asks for a 1001-pixel high menu and gets a 1000 pixel allocation, it switches to squashed mode and draws two big bumper arrows at the top and bottom of the menu which you must use to scroll it. It looks very silly.
I also tried rounding up, but tooltips look bad with any rounding. Either one border is missing, or it's double thickness. Luckily, it seems that Sway doesn't actually enforce the rule about surfaces being a multiple of the scale factor. So, I just let the application attach a buffer of whatever size it likes to the surface and it seems to work!
The only problem I had was that when using unscaling, the mouse pointer in GVim would get lost. Vim hides it when you start typing, but it's supposed to come back when you move the mouse. The problem seems to be that it hides it by creating a 1x1 pixel cursor. Sway decides this isn't worth showing (maybe because it's 0x0 in Wayland-pixels?), and sends Xwayland a leave event saying the cursor is no longer on the screen. Then when Vim sets the cursor back, Xwayland doesn't bother updating it, since it's not on screen!
The solution was to stop applying unscaling to cursors. They look better doubled in size, anyway. True, this does mean that the sharpness of the cursor changes as you move between windows, but you're unlikely to notice this due to the far more jarring effect of Wayland cursors also changing size and shape at the same time.
Even without a proxy to complicate things, Wayland applications often have problems. To make investigating this easier, I added a ring-buffer log feature. When on, the proxy keeps the last 512K or so of log messages in memory, and will dump them out on demand.
To use it, you run the proxy with e.g. -v --log-ring-path ~/wayland.log
.
When something odd happens (e.g. an application crashes, or opens its menus in the wrong place) you can
dump out the ring buffer and see what just happened with:
echo dump-log > /run/user/1000/wayland-1-ctl
I also added some filtering options (e.g. --log-suppress motion,shm
) to suppress certain classes of noisy messages.
One annoyance with Sway is that Vim's window always appears blank (even when running on the host, without any proxy). You have to resize it before you can see the text.
My proxy initially suffered from the same problem, although only intermittently.
It turned out to be because Vim sends a ConfigureRequest
with its desired size and then waits for the confirmation message.
Since Sway is a tiling window manager, it ignores the new size and no event is generated.
In this case, an X11 window manager is supposed to send a synthetic ConfigureNotify
,
so I just got the proxy to do that and the problem disappeared
(I confirmed this by adding a sleep to Vim's gui_mch_update
).
By the way, the GVim start-up code is quite interesting.
The code path to opening the window goes though three separate functions which each define a
static int recursive = 0
and then proceed to behave differently depending on how many times they've
been reentered - see gui_init for an example!
The other major annoyance with Sway is that copy-and-paste doesn't work correctly (Sway bug #1839). Using the proxy avoids that problem completely.
I'm not sure how I feel about this project.
It ended up taking a lot longer than I expected, and I could probably have ported several X11 applications to Wayland in the same time.
On the other hand, I now have working X support in the VMs with no need for ssh -Y
from the host, plus support for HiDPI in Wayland, mouse cursors that are large enough to see easily, windows that open reliably, text pasting that works, and I can get logs whenever something misbehaves.
In fact, I'm now also running an instance of the proxy directly on the host to get the same benefits for host X11 applications.
Setting this up is actually a bit tricky:
you want to start Sway with DISPLAY=:0
so that every application it spawns knows it has an X11 display,
but if you set that then Sway thinks you want it to run nested inside an X window provided by the proxy,
which doesn't end well (or, indeed, at all).
Having all the legacy X11 support in a separate binary should make it much easier to write new Wayland compositors, which might be handy if I ever get some time to try that. It also avoids having many thousands of lines of legacy C code in the highly-trusted compositor code.
If Wayland had an official protocol for letting applications know the window layout then I could make drag-and-drop between X11 applications within the same VM work, but it still wouldn't work between VMs or to Wayland applications, so it's probably not worth it.
Having two separate connections to Xwayland creates a lot of unnecessary race conditions. A simple solution might be a Wayland extension that allows the Wayland server to say "please read N bytes from the X11 socket now", and likewise in the other direction. Then messages would always arrive in the order in which they were sent.
The code is all available at https://github.com/talex5/wayland-proxy-virtwl if you want to try it. It works with the applications I use when running under Sway, but will probably require some tweaking for other programs or compositors. Here's a screenshot of my desktop using it:
The windows with [dev]
in the title are from my Debian VM, while [com]
is a SpectrumOS VM I use for email, etc.
Gitk, GVim and ROX-Filer are X11 applications using Xwayland,
while Firefox and xfce4-terminal are using plain Wayland proxying.
This post gives my initial impressions of these tools and describes my current setup.
Table of Contents
( this post also appeared on Hacker News and Lobsters )
QubesOS aims to provide "a reasonably secure operating system". It does this by running multiple virtual machines under the Xen hypervisor. Each VM's windows have a different colour and tag, but they appear together as a single desktop. The VMs I run include:
com
for email and similar (the only VM that sees my email password).
dev
for software development.
shopping
(the only VM that sees my card number).
personal
(with no Internet access)
untrusted
(general browsing)
The desktop environment itself is another Linux VM (dom0
), used for managing the other VMs.
Most of the VMs are running Fedora (the default for Qubes), although I run Debian in dev
.
There are also a couple of system VMs; one for dealing with the network hardware,
and one providing a firewall between the VMs.
You can run qvm-copy
in a VM to copy a file to another VM.
dom0
pops up a dialog box asking which VM should receive the file, and it arrives there
as ~/QubesIncoming/$source_vm/$file
.
You can also press Ctrl-Shift-C to copy a VM's clipboard to the global clipboard, and then
press Ctrl-Shift-V in a window of the target VM to copy to that VM's clipboard,
ready for pasting into an application.
I think Qubes does a very good job at providing a secure environment.
However, it has poor hardware compatibility and it feels sluggish, even on a powerful machine. I bought a new machine a while ago and found that the motherboard only provided a single video output, limited to 30Hz. This meant I had to buy a discrete graphics card. With the card enabled, the machine fails to resume from suspend, and locks up from time to time (it's completely stable with the card removed or disabled). I spent some time trying to understand the driver code, but I didn't know enough about graphics, the Linux kernel, PCI suspend, or Xen to fix it.
I was also having some other problems with QubesOS:
Anyway, I decided it was time to try something new. Linux now has its own built-in hypervisor (KVM), and I thought that would probably work better with my hardware. I was also keen to try out Wayland, which is built around shared-memory and I thought it might therefore work better with VMs. How easy would it be to recreate a Qubes-like environment directly on Linux?
I've been meaning to try NixOS properly for some time. Ever since I started using Linux, its package management has struck me as absurd. On Debian, Fedora, etc, installing a package means letting it put files wherever it likes; which effectively gives the package author root on your system. Not a good base for sandboxing!
Also, they make it difficult to try out 3rd-party software, or to test newer versions of just some packages.
In 2003 I created 0install to address these problems, and Nix has very similar goals. I thought Nix was a few years younger, but looking at its Git history the first commit was on Mar 12, 2003. I announced the first preview of 0install just two days later, so both projects must have started writing code within a few days of each other!
NixOS is made up of quite a few components. Here is what I've learned so far:
The store holds the files of all the programs, and is the central component of the system.
Each version of a package goes in its own directory (or file), at /nix/store/$HASH
.
You can add data to the store directly, like this:
$ echo hello > file
$ nix-store --add-fixed sha256 file
/nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file
$ cat /nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file
hello
Here, the store location is calculated from the hash of the contents of the file we added (as with 0install store add
or git hash-object
).
However, you can also add things to the store by asking Nix to run a build script. For example, to compile some source code:
If a package in the store depends on another one (at build time or run time), it just refers to it by its full path. For example, a bash script in the store will start something like:
#! /nix/store/vnyfysaya7sblgdyvqjkrjbrb0cy11jf-bash-4.4-p23/bin/bash
...
If two users want to use the same build instructions, the second one will see that the hash already exists and can just reuse that. This allows users to compile software from source and share the resulting binaries, without having to trust each other.
Ideally, builds should be reproducible. To encourage this, builds which use the hash of the build instructions for the result path are built in a sandbox without network access. So, you can't submit a build job like "Download and compile whatever is the latest version of Vim". But you can discover the latest version yourself and then submit two separate jobs to the store:
You can run nix-collect-garbage
to delete everything from the store that isn't reachable via the symlinks under /nix/var/nix/gcroots/
.
Users can put symlinks to things they care about keeping in /nix/var/nix/gcroots/per-user/$USER/
.
By default, the store is also configured with a trusted binary cache service, and will try to download build results from there instead of compiling locally when possible.
Writing derivation files by hand is tedious, so Nix provides a templating language to create them easily.
The Nix language is dynamically typed and based around maps/dictionaries (which it confusingly refers to as "sets").
nix-instantiate file.nix
will generate a derivation from file.nix
and add it to the store.
An Nix file looks like this:
1
|
|
Running nix-instantiate
on this will:
myfile
to the store.
foo.drv
to the store, including the full store path of myfile
.
Writing Nix expressions for every package you want would also be tedious. The nixpkgs Git repository contains a Nix expression that evaluates to a set of derivations, one for each package in the distribution. It also contains a library of useful helper functions for packages (e.g. it knows how to handle GNU autoconf packages automatically).
Rather than evaluating the whole lot, you use -A
to ask for a single package.
For example, you can use nix-instantiate ./nixpkgs/default.nix -A firefox
to generate a derivation for Firefox.
nix-build
is a quick way to create a derivation with nix-instantiate
and build it with nix-store
.
It will also create a ./result
symlink pointing to its path in the store,
as well as registering ./result
with the garbage collector under /nix/var/nix/gcroots/auto/
.
For example, to build and run Firefox:
nix-build ./nixpkgs/default.nix -A firefox
./result/bin/firefox
If you use nixpkgs without making any changes, it will be able to download a pre-built binary from the cache service.
Keeping track of all these symlinks would be tedious too,
but you can collect them all together by making a package that depends on every application you want.
Its build script will produce a bin
directory full of symlinks to the applications.
Then you could just point your $PATH
variable at that bin
directory in the store.
To make updating easier, you will actually add ~/.nix-profile/bin/
to $PATH
and
update .nix-profile
to point at the latest build of your environment package.
This is essentially what nix-env
does, except with yet more symlinks to allow for
switching between multiple profiles, and to allow rolling back to previous environments
if something goes wrong.
For example, to install Firefox so you can run it via $PATH
:
nix-env -i firefox
Finally, just as nix-env
can create a user environment with bin
, man
, etc,
a similar process can create a root filesystem for a Linux distribution.
nixos-rebuild
reads the /etc/nixos/configuration.nix
configuration file,
generates a system environment,
and then updates grub and the /run/current-system
symlink to point to it.
In fact, it also lists previous versions of the system environment in the grub file, so if you mess up the configuration you can just choose an earlier one from the boot menu to return to that version.
To install NixOS you boot one of the live images at https://nixos.org. Which you use only affects the installation UI, not the system you end up with.
The manual walks you through the installation process, showing how to partition the disk, format and mount the partitions, and how to edit the configuration file. I like this style of installation, where it teaches you things instead of just doing it for you. Most of the effort in switching to a new system is learning about it, so I'd rather spend 3 hours learning stuff following an installation guide than use a 15-minute single-click installer that teaches me nothing.
The configuration file (/etc/nixos/configuration.nix
) is just another Nix expression.
Most things are set to off by default (I approve), but can be changed easily.
For example, if you want sound support you change that setting to sound.enable = true
,
and if you also want to use PulseAudio then you set hardware.pulseaudio.enable = true
too.
Every system service supported by NixOS is controlled from here,
with all kinds of options, from programs.vim.defaultEditor = true
(so you don't get trapped in nano
)
to services.factorio.autosave-interval
.
Use man configuration.nix
to see the available settings.
NixOS defaults to an X11 desktop, but I wanted to try Wayland (and Sway). Based on the NixOS wiki instructions, I used this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The xwayland
bit is important; without that you can't run any X11 applications.
My only complaint with the NixOS installation instructions is that following them will leave you with an unencrypted system,
which isn't very useful.
When partitioning, you have to skip ahead to the LUKS section of the manual, which just gives some options but no firm advice.
I created two primary partitions: a 1G unencrypted /boot
, and a LUKS partition for the rest of the disk.
Then I created an LVM volume group from the /dev/mapper/crypted
device and added the other partitions in that.
Once the partitions are mounted and the configuration file is complete,
nixos-install
downloads everything and configures grub.
Then you reboot into the new system.
Once running the new system you can made further edits to the configuration file there in the same way,
and use nixos-rebuild switch
to generate a new system.
It seems to be pretty good at updating the running system to the new settings, so you don't normally need to reboot
after making changes.
The big mistake I made was forgetting to add /boot
to fstab.
When I ran nixos-rebuild
it put all the grub configuration on the encrypted partition, rendering the system unbootable.
I fixed that with chattr +i /boot
on the unmounted partition.
That way, trying to rebuild with /boot
unmounted will just give an error message.
I've been using the system for a few weeks now and I've had no problems with Nix so far. Nix has been fast and reliable and there were fairly up-to-date packages for everything I wanted (I'm using the stable release). There is a lot to learn, but plenty of documentation.
When I wanted a newer package (socat
with vsock support, only just released) I just told Nix to install it from the latest Git checkout of nixpkgs.
Unlike on Debian and similar systems, doing this doesn't interfere with any other packages (such as forcing a system-wide upgrade of libc).
I think Nix does download more data than most other systems, but networks are fast enough now that it doesn't seem to matter. For example, let's say you're running Python 3.9.0 and you want to update to 3.9.1:
With Debian: apt-get upgrade
downloads the new version, which gets unpacked over the old one.
As the files are unpacked, the system moves through an exciting series of intermediate states no-one has thought about.
Running programs may crash as they find their library versions changing under them (though it's usually OK).
Only root can update software.
With 0install: 0install update
downloads the new version, unpacking it to a new directory.
Running programs continue to use the old version.
When a new program is started, 0install notices the update and runs the solver again.
If the program is compatible with the new Python then it uses that. If not, it continues with the old one.
You can run any previous version if there is a problem.
With Nix: nix-env -u
downloads the new version, unpacking it to a new directory.
It also downloads (or rebuilds) every package depending on Python, creating new directories for each of them.
It then creates a new environment with symlinks to the latest version of everything.
Running programs continue to use the old version.
Starting a new program will use the new version.
You can revert the whole environment back to the previous version if there is a problem.
With Docker: docker pull
downloads the new version of a single application,
downloading most or all of the application's packages, whether Python related or not.
Existing containers continue running with the old version.
New containers will default to using the new version.
You can specify which version to use when starting a program.
Other applications continue using the old version of Python until their authors update them
(you must update each application individually, rather than just updating Python itself).
The main problem with NixOS is that it's quite different to other Linux systems, so there's a lot to relearn.
Also, existing knowledge about how to edit fstab
, sudoers
, etc, isn't so useful, as you have to provide all configuration in Nix syntax.
However, having a single (fairly sane) syntax for everything is a nice bonus, and being able to generate things using the templating language is useful.
For example, for my network setup I use a bunch of tap devices (one for each of my VMs).
It was easy to write a little Nix function (mktap
) to generate them all from a simple list.
Here's that section of my configuration.nix
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Overall, I'm very happy with NixOS so far.
With NixOS I had a nice host environment, but after using Qubes I wanted to run my applications in VMs.
The basic problem is that Linux is the only thing that knows how to drive all the hardware, but Linux security is not ideal. There are several problems:
For example, imagine that we want to run a program with access to the network, but not to the graphical display. We can create a new Linux container for it using bubblewrap, like this:
$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /run/user/1000/wayland-0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /tmp/.X11-unix/X0
$ bwrap \
--ro-bind / / \
--dev /dev \
--tmpfs /home/tal \
--tmpfs /run/user \
--tmpfs /tmp \
--unshare-all --share-net \
bash
$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
ls: cannot access '/run/user/1000/wayland-0': No such file or directory
ls: cannot access '/tmp/.X11-unix/X0': No such file or directory
The container has an empty home directory, empty /tmp
, and no access to the display sockets.
If we run Firefox in this environment then... it opens its window just fine!
How? strace
shows what happened:
connect(4, {sa_family=AF_UNIX, sun_path="/run/user/1000/wayland-0"}, 27) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path=@"/tmp/.X11-unix/X0"}, 20) = 0
After failing to connect to Wayland, it then tried using X11 (via Xwayland) instead. Why did that work?
If the first byte of the socket pathname is \0
then Linux instead interprets it as an "abstract" socket address,
not subject to the usual filesystem permission rules.
Trying to anticipate these kinds of special cases is just too much work. Linux really wants everything on by default, and you have to find and disable every feature individually. By contrast, virtual machines tend to have integrations with the host off by default. The also tend to have much smaller APIs (e.g. just reading and writing disk blocks or network frames), with the rich Unix API entirely inside the VM, provided by a separate instance of Linux.
I was able to set up a qemu guest and restore my dev
Qubes VM in that, but it didn't integrate nicely with the rest of the desktop.
Installing ssh allowed me to connect in with ssh -Y dev
, allowing apps in the VM to open an X connection to Xwayland on the host.
That was somewhat usable, but still a bit slower than Qubes had been (which was already a bit too slow).
Searching for a way to forward the Wayland connection directly, I came across the SpectrumOS project. SpectrumOS aims to use one virtual machine per application, using shared directories so that VM files are stored on the host, simplifying management. It uses crosvm from the ChromiumOS project instead of qemu, because it has a driver that allows forwarding Wayland connections (and also because it's written in Rust rather than C). The project's single developer is currently taking a break from the project, and says "I'm currently working towards a proof of concept".
However, there is some useful stuff in the SpectrumOS repository (which is a fork of nixpkgs). In particular, it contains:
virtwl
kernel module, which connects to crosvm's Wayland driver.
virtwl
.
Building that, I was able to run the project's demo, which runs the Wayfire compositor inside the VM, appearing in a window on the host. Dragging the nested window around, the pixels flowed smoothly across my screen in exactly the way that pixels on QubesOS don't.
This was encouraging, but I didn't want to run a nested window manager. I tried running Firefox directly (without Wayfire), but it complained that sommelier didn't provide a new enough version of something, and running weston-terminal immediately segfaulted sommelier.
Why do we need the sommelier process anyway?
The problem is that, while virtwl
mostly proxies Wayland messages directly, it can't send arbitrary FDs to the host.
For example, if you want to forward a writable stream from an application to virtwl
you must first create a pipe from the host using a special virtwl
ioctl,
then read from that and copy the data to the application's regular Linux pipe.
With help from the mailing list, I managed to get it somewhat usable:
VIRTIO_FS
, allowing me to mount a host directory into the VM (for sharing files).
FONTCONFIG_FILE
got some usable fonts (otherwise, there was no monospace font for the terminal).
bash
shell with fish
so I could edit commands.
(while true; do socat vsock-listen:5000 exec:dash; done)
at the end of the VM's boot script.
Then I could start e.g. the VM's Firefox with echo 'firefox&' | socat stdin vsock-connect:7:5000
on the host, allowing me to add launchers for guest applications.
Making changes to the root filesystem was fairly easy once I'd read the Nix manuals.
To add an application (e.g. libreoffice
), you import it at the start of rootfs/default.nix and add it to the path
variable.
The Nix expression gets the transitive dependencies of path
from the Nix store and packs them into a squashfs image.
True, my squashfs image is getting a bit big.
Maybe I should instead make a minimal squashfs boot image, plus a shared directory of hard links to the required files.
That would allow sharing the data with the host.
I could also just share the whole /nix/store
directory, if I wanted to make all host software available to guests.
I made another Nix script to add various VM boot commands to my host environment.
For example, running qvm-start-shopping
boots my shopping VM using crosvm,
with the appropriate LVM data partition, network settings, and shared host directory.
I think, ideally, this would be a systemd socket-activated user service rather than a shell script. Then attempting to run Firefox by sending a command to the VM socket would cause systemd to boot the VM (if not already running). For now, I boot each VM manually in a terminal and then press Win-Shift-2 to banish it to workspace 2, with all the other VM root consoles.
The virlwl
Wayland forwarding feels pretty fast (much faster than Qubes' X graphics).
I now had a mostly functional Qubes-like environment, running most of my applications in VMs, with their windows appearing on the host desktop like any other application. However, I also had some problems:
ft.dpi: 150
with xrdb
), but Wayland apps must be configured individually.
I decided it was time to learn more about Wayland. I discovered wayland-book.com, which does a good job of introducing it (though the book is only half finished at the moment).
One very nice feature of Wayland is that you can run any Wayland application with WAYLAND_DEBUG=1
and it will display a fairly readable trace of all the Wayland messages it sends and receives.
Let's look at a simple application that just connects to the server (compositor) and opens a window:
$ WAYLAND_DEBUG=1 test.exe
-> wl_display@1.get_registry registry:+2
-> wl_display@1.sync callback:+3
The client connects to the server's socket at /run/user/1000/wayland-0
and sends two messages
to object 1 (of type wl_display
), which is the only object available in a new connection.
The get_registry
request asks the server to add the registry to the conversation and call it object 2.
The sync
request just asks the server to confirm it got it, using a new callback object (with ID 3).
Both clients and servers can add objects to the conversation. To avoid numbering conflicts, clients assign low numbers and servers pick high ones.
On the wire, each message gives the object ID, the operation ID, the length in bytes, and then the arguments. Objects are thought of as being at the server, so the client sends request messages to objects, while the server emits event messages from objects. At the wire level there's no difference though.
When the server gets the get_registry
request it adds the registry,
which immediately emits one event for each available service, giving the maximum supported version.
The client receives these messages, followed by the callback notification from the sync
message:
<- wl_registry@2.global name:0 interface:"wl_compositor" version:4
<- wl_registry@2.global name:1 interface:"wl_subcompositor" version:1
<- wl_registry@2.global name:2 interface:"wl_shm" version:1
<- wl_registry@2.global name:3 interface:"xdg_wm_base" version:1
<- wl_registry@2.global name:4 interface:"wl_output" version:2
<- wl_registry@2.global name:5 interface:"wl_data_device_manager" version:3
<- wl_registry@2.global name:6 interface:"zxdg_output_manager_v1" version:3
<- wl_registry@2.global name:7 interface:"gtk_primary_selection_device_manager" version:1
<- wl_registry@2.global name:8 interface:"wl_seat" version:5
<- wl_callback@3.done callback_data:1129040
The callback tells the client it has seen all the available services, and so it now picks the ones it wants.
It has to choose a version no higher than the one offered by the server.
Protocols starting with wl_
are from the core Wayland protocol; the others are extensions.
The leading z
in zxdg_output_manager_v1
indicates that the protocol is "unstable" (under development).
The protocols are defined in various XML files, which are scattered over the web. The core protocol is defined in wayland.xml. These XML files can be used to generate typed bindings for your programming language of choice.
Here, the application picks wl_compositor
(for managing drawing surfaces), wl_shm
(for sharing memory with the server),
and xdg_wm_base
(for desktop windows).
-> wl_registry@2.bind name:0 id:+4(wl_compositor:v4)
-> wl_registry@2.bind name:2 id:+5(wl_shm:v1)
-> wl_registry@2.bind name:3 id:+6(xdg_wm_base:v1)
The bind message is unusual in that the client gives the interface and version of the object it is creating. For other messages, both sides know the type from the schema, and the version is always the same as the parent object. Because the client chose the new IDs, it doesn't need to wait for the server; it continues by using the new objects to create a top-level window:
-> wl_compositor@4.create_surface id:+7
-> xdg_wm_base@6.get_xdg_surface id:+8 surface:7
-> xdg_surface@8.get_toplevel id:+9
-> xdg_toplevel@9.set_title title:"example app"
-> wl_surface@7.commit
This API is pretty strange.
The core Wayland protocol says how to make generic drawing surfaces, but not how to make windows,
so the application is using the xdg_wm_base
extension to do that.
Logically, there's only one object here (a toplevel window),
but it ends up making three separate Wayland objects representing the different aspects of it.
The commit
tells the server that the client has finished setting up the window and the server should
now do something with it.
The above was all in response to the callback firing. The client now processes the last message in that batch, which is the server destroying the callback:
<- wl_display@1.delete_id id:3
Object destruction is a bit strange in Wayland.
Normally, clients ask for things to be destroyed (by sending a "destructor" message)
and the server confirms by sending delete_id
from object 1.
But this isn't symmetrical: there is no standard way for a client to confirm deletion when the server calls
a destructor (such as the callback's done
), so these have to be handled on a case-by-case basis.
Since callbacks don't accept any messages, there is no need for the client to confirm that it got the done
message and the server just sends a delete message immediately.
The client now waits for the server to respond to all the messages it sent about the new window, and gets a bunch of replies:
<- wl_shm@5.format format:0
<- wl_shm@5.format format:1
<- wl_shm@5.format format:875709016
<- wl_shm@5.format format:875708993
<- xdg_wm_base@6.ping serial:1129043
-> xdg_wm_base@6.pong serial:1129043
<- xdg_toplevel@9.configure width:0 height:0 states:""
<- xdg_surface@8.configure serial:1129042
-> xdg_surface@8.ack_configure serial:1129042
It gets some messages telling it what pixel formats are supported, a ping message (which the server sends from time to time to check the client is still alive), and a configure message giving the size for the new window. Oddly, Sway has set the size to 0x0, which means the client should choose whatever size it likes.
The client picks a suitable default size, allocates some shared memory (by opening a tmpfs file and immediately unlinking it),
shares the file descriptor with the server (create_pool
), and then carves out a portion of the memory to use as a buffer for the pixel data:
-> wl_shm@5.create_pool id:+3 fd:(fd) size:1228800
-> wl_shm_pool@3.create_buffer id:+10 offset:0 width:640 height:480 stride:2560 format:1
-> wl_shm_pool@3.destroy
In this case it used the whole memory region. It could also have allocated two buffers for double-buffering.
The client then draws whatever it wants into the buffer (mapping the file into its memory and writing to it directly),
attaches the buffer to the window's surface, marks the whole area as "damaged" (in need of being redrawn) and calls commit
,
telling the server the surface is ready for display:
-> wl_surface@7.attach buffer:10 x:0 y:0
-> wl_surface@7.damage x:0 y:0 width:2147483647 height:2147483647
-> wl_surface@7.commit
At this point the window appears on the screen! The server lets the client know it has finished with the buffer and the client destroys it:
<- wl_display@1.delete_id id:3
<- wl_buffer@10.release
-> wl_buffer@10.destroy
Although the window is visible, the content is the wrong size.
Sway now suddenly remembers that it's a tiling window manager.
It sends another configure
event with the correct size, causing the client to allocate a fresh memory pool of the correct size,
allocate a fresh buffer from it, redraw everything at the new size, and tell the server to draw it.
<- xdg_toplevel@9.configure width:1534 height:1029 states:""
...
This process of telling the client to pick a size and then overruling it explains why Firefox draws itself incorrectly at first and then flickers into position a moment later. It probably also explains why Vim tries to open a 0x0 window.
A bit of searching revealed that the ^M
problem is a known Sway bug.
However, the main reason copying text wasn't working turned out to be a limitation in the design of the core wl_data_device_manager
protocol.
The normal way to copy text on X11 is to select the text you want to copy,
then click the middle mouse button where you want it (or press Shift-Insert).
X also supports a clipboard mechanism, where you select text, then press Ctrl-C, then click at the destination, then press Ctrl-V. The original Wayland protocol only supports the clipboard system, not the selection, and so Wayland compositors have added selection support through extensions. Sommelier didn't proxy these extensions, leading to failure when copying in or out of VMs.
I also found that the reason weston-terminal wouldn't start was because I didn't have anything in my clipboard, and sommelier was trying to dereference a null pointer.
One problem with the Wayland protocol is that it's very hard to proxy. Although the wire protocol gives the length in bytes of each message, it doesn't say how many file descriptors it has. This means that you can't just pass through messages you don't understand, because you don't know which FDs go with which message. Also, the wire protocol doesn't give types for FDs (nor does the schema), which is a problem for anything that needs to proxy across a VM boundary or over a network.
This all meant that VMs could only use protocols explicitly supported by sommelier, and sommelier limited the version too. Which means that supporting extra extensions or new versions means writing (and debugging) loads of C++ code.
I didn't have time to write and debug C++ code for every missing Wayland protocol, so I took a short-cut: I wrote my own Wayland library, ocaml-wayland, and then used that to write my own version of sommelier. With that, adding support for copying text was fairly easy.
For each Wayland interface we need to handle each incoming message from the client and forward it to the host,
and also forward each message from the host to the client.
Here's the code to handle the "selection" event in OCaml,
which we receive from the host and send to the client (c
):
1
|
|
The host passes us an "offer" argument, which is a previously-created host offer object.
We look up the corresponding client object with to_client
and pass that as the argument
to the client.
For comparison, here's sommelier's equivalent to this line of code, in C++:
1 2 3 4 5 6 7 8 9 10 |
|
I think this is a great demonstration of the difference between "type safety" and "type ceremony".
The C++ code is covered in types, making the code very hard to read, yet it crashes at runtime because it
fails to consider that data_offer
can be NULL
.
By contract, the OCaml version has no type annotations, but the compiler would reject if I forgot to handle this (with Option.map
).
According to the GNOME wiki, the original justification for not supporting selection copies was "security concerns with unexpected data stealing if the mere act of selecting a text fragment makes it available to all running applications". The implication is that applications stealing data instead from the clipboard is OK, and that you should therefore never put anything confidential on the clipboard.
This seemed a bit odd, so I read the security section of the Wayland specification to learn more about its security model. That section of the specification is fairly short, so I'll reproduce it here in full:
Security and Authentication
- mostly about access to underlying buffers, need new drm auth mechanism (the grant-to ioctl idea), need to check the cmd stream?
- getting the server socket depends on the compositor type, could be a system wide name, through fd passing on the session dbus. or the client is forked by the compositor and the fd is already opened.
It looks like implementations have to figure things out for themselves.
The main advantage of Wayland over X11 here is that Wayland mostly isolates applications from each other. In X11 applications collaborate together to manage a tree of windows, and any application can access any window. In the Wayland protocol, each application's connection only includes that application's objects. Applications only get events relevant to their own windows (for example, you only get pointer motion events while the pointer is over your window). Communication between applications (e.g. copy-and-paste or drag-and-drop) is all handled though the compositor.
Also, to request the contents of the clipboard you need to quote the serial number of the mouse click or key press that triggered it. If it's too far in the past, the compositor can ignore the request.
I've also heard people say that security is the reason you can't take screenshots with Wayland. However, Sway lets you take screenshots, and this worked even from inside a VM through virtwl. I didn't add screenshot support to the proxy, because I don't want VMs to be able to take screenshots, but the proxy isn't a security tool (it runs inside the VM, which isn't trusted).
Clearly, the way to fix this was with a new compositor. One that would offer a different Wayland socket to each VM, tag the windows with the VM name, colour the frames, confirm copies across VM boundaries, and work with Vim. Luckily, I already had a handy pure-OCaml Wayland protocol library available. Unluckily, at this point I ran out of holiday.
There are quite a few things left to do here:
One problem with virtwl
is that, while we can receive shared memory FDs from the host, we can't export guest memory to the host.
This is unfortunate, because in Wayland the shared memory for window contents is allocated by the application from guest memory,
and the proxy therefore has to copy each frame. If the host provided the memory to the guest, this wouldn't be needed.
There is a wl_drm
protocol for allocating video memory, which might help here, but I don't know how that works and,
like many Wayland specifications, it seems to be in the process of being replaced by something else.
Also, if we're going to copy the memory, we should at least only copy the damaged region, not the whole thing.
I only got this code working just far enough to run the Wayland applications I use (mainly Firefox and Evince).
I'm still using ssh to proxy X11 connections (mainly for Vim and gitk). I'd prefer to run Xwayland in the VM, but it seems you need to provide a bit of extra support for that, which I haven't implemented yet. Sommelier can do this, but then copying doesn't work.
The host Wayland compositor needs to be aware of VMs, so it can colour the titles appropriately and limit access to privileged operations.
For the full Qubes experience, the network card should be handled by a VM, with another VM managing the firewall. Perhaps the Mirage unikernel firewall could be made to work on KVM too. I'm not sure how guest-to-guest communication works with KVM.
However, because the host NixOS environment is a fully-working Linux system, I can always trade off some security to get things working (e.g. by doing video conferencing directly on the host).
I hope the SpectrumOS project will resume at some point, or that Qubes will find a solution to its hardware compatibility and performance problems.
]]>Table of Contents
( this post also appeared on Reddit and Lobsters )
I was asked to build a system for creating CI/CD pipelines. The initial use for it was to build a CI for testing OCaml projects on GitHub (testing each commit against multiple versions of the OCaml compiler and on multiple operating systems). Here's a simple pipeline that gets the Git commit at the head of a branch, builds it, and then runs the tests:
The colour-scheme here is that green boxes are completed, orange ones are in progress and grey means the step can't be started yet.
Here's a slightly more complex example, which also downloads a Docker base image, builds the commit in parallel using two different versions of the OCaml compiler, and then tests the resulting images. Here the red box indicates that this step failed:
A more complex example is testing the project itself and then searching for other projects that depend on it and testing those against the new version too:
Here, the circle means that we should wait for the tests to pass before checking the reverse dependencies.
We could describe these pipelines using YAML or similar, but that would be very limiting. Instead, I decided to use an Embedded Domain Specific Language, so that we can use the host language's features for free (e.g. string manipulation, variables, functions, imports, type-checking, etc).
The most obvious approach is making each box a regular function. Then the first example above could be (here, using OCaml syntax):
1 2 3 4 |
|
The second could be:
1 2 3 4 5 6 7 8 9 10 |
|
And the third might look something like this:
1 2 3 4 5 6 |
|
However, we'd like to add some extras to the language:
example2
function above would do the builds one at a time.
The exact extras don't matter too much to this blog post, so for simplicity I'll focus on just running steps concurrently.
Without the extra features, we have functions like this:
1 2 |
|
You can read this as "build
is a function that takes a source
value and returns a (Docker) image
".
These functions compose together easily to make a larger function that will fetch a commit and build it:
1 2 3 |
|
We could also shorten this to build (fetch c)
or to fetch c |> build
.
The |>
(pipe) operator in OCaml just calls the function on its right with the argument on its left.
To extend these functions to be concurrent, we can make them return promises, e.g.
1 2 |
|
But now we can't compose them easily using let
(or |>
), because the output type of fetch
doesn't match the input of build
.
However, we can define a similar operation, let*
(or >>=
) that works with promises. It immediately returns a promise for the final
result, and calls the body of the let*
later, when the first promise is fulfilled. Then we have:
1 2 3 |
|
In order words, by sprinkling a few *
characters around we can turn our plain old pipeline into a new concurrent one!
The rules for when you can compose promise-returning functions using let*
are exactly the same as the rules about when
you can compose regular functions using let
, so writing programs using promises is just as easy as writing regular programs.
Just using let*
doesn't add any concurrency within our pipeline
(it just allows it to execute concurrently with other code).
But we can define extra functions for that, such as all
to evaluate every promise in a list at once,
or an and*
operator to indicate that two things should run in parallel:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
As well as handling promises,
we could also define a let*
for functions that might return errors (the body of the let is called only if the first value
is successful), or for live updates (the body is called each time the input changes), or for all of these things together.
This is the basic idea of a monad.
This actually works pretty well. In 2016, I used this approach to make DataKitCI, which was used initially as the CI system for Docker-for-Mac. Later, Anil Madhavapeddy used it to create opam-repo-ci, which is the CI system for opam-repository, OCaml's main package repository. This checks each new PR to see what packages it adds or modifies, tests each one against multiple OCaml compiler versions and Linux distributions (Debian, Ubuntu, Alpine, CentOS, Fedora and OpenSUSE), and then finds all versions of all packages depending on the changed packages and tests those too.
The main problem with using a monad is that we can't statically analyse the pipeline.
Consider the example2
function above. Until we have queried GitHub to get a commit to
test, we cannot run the function and therefore have no idea what it will do.
Once we have commit
we can call example2 commit
,
but until the fetch
and docker_pull
operations complete we cannot evaluate the body of the let*
to find out what the pipeline will do next.
In other words, we can only draw diagrams showing the bits of the pipeline that have already
executed or are currently executing, and we must indicate opportunities for concurrency
manually using and*
.
An arrow makes it possible to analyse pipelines statically. Instead of our monadic functions:
1 2 |
|
we can define an arrow type:
1 2 3 4 |
|
An ('a, 'b) arrow
is a pipeline that takes an input of type 'a
and produces a result of type 'b
.
If we define type ('a, 'b) arrow = 'a -> 'b promise
then this is the same as the monadic version.
However, we can instead make the arrow
type abstract and extend it to store whatever static information we require.
For example, we could label the arrows:
1 2 3 4 |
|
Here, arrow
is a record. f
is the old monadic function and label
is the "static analysis".
Users can't see the internals of the arrow
type, and must build up pipelines using functions provided by the arrow implementation.
There are three basic functions available:
1 2 3 |
|
arr
takes a pure function and gives the equivalent arrow.
For our promise example, that means the arrow returns a promise that is already fulfilled.
>>>
joins two arrows together.
first
takes an arrow from 'a
to 'b
and makes it work on pairs instead.
The first element of the pair will be processed by the given arrow and
the second component is returned unchanged.
We can have these operations automatically create new arrows with appropriate f
and label
fields.
For example, in a >>> b
, the resulting label field could be the string {a.label} >>> {b.label}
.
This means that we can display the pipeline without having to run it first,
and we could easily replace label
with something more structured if needed.
With this our first example changes from:
1 2 3 4 |
|
to
1 2 |
|
That seems quite pleasant, although we did have to give up our variable names.
But things start to get complicated with larger examples. For example2
, we
need to define a few standard combinators:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Then, example2
changes from:
1 2 3 4 5 6 7 8 9 10 |
|
to:
1 2 3 4 5 6 7 8 9 10 |
|
We've lost most of the variable names and instead have to use tuples, remembering where our values are.
It's not too bad here with two values,
but it gets very difficult very quickly as more are added and we start nesting tuples.
We also lost the ability to use an optional labelled argument in build ~dockerfile src
and instead need to use a new operation that takes a tuple of the dockerfile and the source.
Imagine that running the tests now requires getting the test cases from the source code.
In the original code, we'd just change test image
to test image ~using:src
.
In the arrow version, we need to duplicate the source before the build step,
run the build with first build_with_dockerfile
,
and make sure the arguments are the right way around for a new test_using
.
I started wondering whether there might be an easier way to achieve the same static analysis that you get with arrows,
but without the point-free syntax, and it seems that there is. Consider the monadic version of example1
.
We had:
1 2 3 4 5 6 7 |
|
If you didn't know about monads, there is another way you might try to do this.
Instead of using let*
to wait for the fetch
to complete and then calling build
with
the source, you might define build
and test
to take promises as inputs:
1 2 |
|
After all, fetching gives you a source promise
and you want an image promise
, so this seems very natural.
We could even have example1
take a promise of the commit.
Then it looks like this:
1 2 3 4 |
|
That's good, because it's identical to the simple version we started with. The problem is that it is inefficient:
example1
with the promise of the commit (we don't know what it is yet).
fetch
, getting back a promise of some source.
build
, getting a promise of an image.
test
, getting a promise of the results.
We return the final promise of the test results immediately, but we haven't done any real work yet. Instead, we've built up a long chain of promises, wasting memory.
However, in this situation what we want is to perform a static analysis. i.e. we want to build up in memory some data structure representing the pipeline... and this is exactly what our "inefficient" use of the monad produces!
To make this useful, we need the primitive operations (such as fetch
)
to provide some information (e.g. labels) for the static analysis.
OCaml's let
syntax doesn't provide an obvious place for a label,
but I was able to define an operator (let**
) that returns a function taking a label argument.
It can be used to build primitive operations like this:
1 2 3 4 |
|
So, fetch
takes a promise of a commit, does a monadic bind on it to wait for the actual commit and then proceeds as before,
but it labels the bind as a fetch
operation.
If fetch
took multiple arguments, it could use and*
to wait for all of them in parallel.
In theory, the body of the let**
in fetch
could contain further binds.
In that case, we wouldn't be able to analyse the whole pipeline at the start.
But as long as the primitives wait for all their inputs at the start and don't do any binds internally,
we can discover the whole pipeline statically.
We can choose whether to expose these bind operations to application code or not.
If let*
(or let**
) is exposed, then applications get to use all the expressive power of monads,
but there will be points where we cannot show the whole pipeline until some promise resolves.
If we hide them, then applications can only make static pipelines.
My approach so far has been to use let*
as an escape hatch, so that any required pipeline can be built,
but I later replace any uses of it by more specialised operations. For example, I added:
1
|
|
This processes each item in a list that isn't known until runtime.
However, we can still know statically what pipeline we will apply to each item,
even though we don't know what the items themselves are.
list_map
could have been implemented using let*
, but then we wouldn't be able to see the pipeline statically.
Here are the other two examples, using the dart approach:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Compared to the original, we have an all
to combine the results, and there's an extra let+ base = base
when calculating the dockerfile.
let+
is just another syntax for map
, used here because I chose not to change the signature of make_dockerfile
.
Alternatively, we could have make_dockerfile
take a promise of the base image and do the map inside it instead.
Because map
takes a pure body (make_dockerfile
just generates a string; there are no promises or errors) it doesn't need its own box
on the diagrams and we don't lose anything by allowing its use.
1 2 3 4 5 6 7 |
|
This shows another custom operation: gate revdeps ~on:ok
is a promise that only resolves once both revdeps
and ok
have resolved.
This prevents it from testing the library's revdeps until the library's own tests have passed,
even though it could do this in parallel if we wanted it to.
Whereas with a monad we have to enable concurrency explicitly where we want it (using and*
),
with a dart we have to disable concurrency explicitly where we don't want it (using gate
).
I also added a list_iter
convenience function,
and gave it a pretty-printer argument so that we can label the cases in the diagrams once the list inputs are known.
Finally, although I said that you can't use let*
inside a primitive,
you can still use some other monad (that doesn't generate diagrams).
In fact, in the real system I used a separate let>
operator for primitives.
That expects a body using non-diagram-generating promises provided by the underlying promise library,
so you can't use let*
(or let>
) inside the body of a primitive.
Given a "dart" you can create an arrow interface from it easily by defining e.g.
1
|
|
Then arr
is just map
and f >>> g
is just fun x -> g (f x)
. first
can be defined easily too, assuming you have some kind
of function for doing two things in parallel (like our and*
above).
So a dart API (even with let*
hidden) is still enough to express any pipeline you can express using an arrow API.
The Haskell Arrow tutorial uses an example where an arrow is a stateful function.
For example, there is a total
arrow that returns the sum of its input and every previous input it has been called with.
e.g. calling it three times with inputs 1 2 3
produces outputs 1 3 6
.
Running a pipeline on a sequence of inputs returns the sequence of outputs.
The tutorial uses total
to define a mean1
function like this:
1
|
|
So this pipeline duplicates each input number,
replaces the second one with 1
,
totals both streams, and then
replaces each pair with its ratio.
Each time you put another number into the pipeline, you get out the average of all values input so far.
The equivalent code using the dart style would be (OCaml uses /.
for floating-point division):
1 2 3 4 |
|
That seems more readable to me.
We can simplify the code slightly by defining the standard operators let+
(for map
) and and+
(for pair
):
1 2 3 4 5 6 7 |
|
This is not a great example of an arrow anyway, because we don't use the output of one stateful function as the input to another, so this is actually just a plain applicative.
We could easily extend the example pipeline with another stateful function though,
perhaps by adding some smoothing.
That would look like mean1 >>> smooth
in the arrow notation,
and values |> mean |> smooth
(or smooth (mean values)
) in the dart notation.
Note: Haskell does also have an Arrows
syntax extension, which allows the Haskell code to be written as:
1 2 3 4 |
|
That's more similar to the dart notation.
I've put up a library using a slightly extended version of these ideas at ocurrent/ocurrent.
The lib_term
subdirectory is the part relevant to this blog post, with the various combinators described in TERM.
The other directories handle more concrete details, such as integration with the Lwt promise library,
and providing the admin web UI or the Cap'n Proto RPC interface, as well as plugins with primitives
for using Git, GitHub, Docker and Slack.
ocurrent/docker-base-images contains a pipeline that builds Docker base images for OCaml for various Linux distributions, CPU architectures, OCaml compiler versions and configuration options. For example, to test OCaml 4.09 on Debian 10, you can do:
$ docker run --rm -it ocurrent/opam:debian-10-ocaml-4.09
:~$ ocamlopt --version
4.09.0
:~$ opam depext -i utop
[...]
:~$ utop
----+-------------------------------------------------------------+------------------
| Welcome to utop version 2.4.2 (using OCaml version 4.09.0)! |
+-------------------------------------------------------------+
Type #utop_help for help about using utop.
-( 11:50:06 )-< command 0 >-------------------------------------------{ counter: 0 }-
utop #
Here's what the pipeline looks like (click for full-size):
It pulls the latest Git commit of opam-repository each week, then builds base images containing that and the opam package manager for each distribution version, then builds one image for each supported compiler variant. Many of the images are built on multiple architectures (amd64
, arm32
, arm64
and ppc64
) and pushed to a staging area on Docker Hub. Then, the pipeline combines all the hashes to push a multi-arch manifest to Docker Hub. There are also some aliases (e.g. debian
means debian-10-ocaml-4.09
at the moment). Finally, if there is any problem then the pipeline sends the error to a Slack channel.
You might wonder whether we really need a pipeline for this, rather than a simple script run from a cron-job. But having a pipeline allows us to see what the pipeline will do before running it, watch the pipeline's progress, restart failed jobs individually, etc, with almost the same code we would have written anyway.
You can read pipeline.ml if you want to see the full pipeline.
ocurrent/ocaml-ci is an (experimental) GitHub app for testing OCaml projects. The pipeline gets the list of installations of the app, gets the configured repositories for each installation, gets the branches and PRs for each repository, and then tests the head of each one against multiple Linux distributions and OCaml compiler versions. If the project uses ocamlformat, it also checks that the commit is formatted exactly as ocamlformat would do it.
The results are pushed back to GitHub as the commit status, and also recorded in a local index for the web and tty UIs. There's quite a lot of red here mainly because if a project doesn't support a particular version of OCaml then the build is marked as failed and shows up as red in the pipeline, although these failures are filtered out when making the GitHub status report. We probably need a new colour for skipped stages.
It's convenient to write CI/CD pipelines as if they were single-shot scripts that run the steps once, in series, and always succeed, and then with only minor changes have the pipeline run the steps whenever the input changes, in parallel, with logging, error reporting, cancellation and rebuild support.
Using a monad allows any program to be converted easily to have these features, but, as with a regular program, we don't know what the program will do with some data until we run it. In particular, we can only automatically generate diagrams showing steps that have already started.
The traditional way to do static analysis is to use an arrow. This is a little more limited than a monad, because the structure of the pipeline can't change depending on the input data, although we can add limited flexibility such as optional steps or a choice between two branches. However, writing pipelines using arrow notation is difficult because we have to program in a point-free style (without variables).
We can get the same benefits of static analysis by using a monad in an unusual way, here referred to as a "dart". Instead of functions that take plain values and return wrapped values, our functions both take and return wrapped values. This results in a syntax that looks identical to plain programming, but allows static analysis (at the cost of not being able to manipulate the wrapped values directly).
If we hide (or don't use) the monad's let*
(bind) function then the pipelines we
create can always be determined statically. If we use a bind, then there will be holes
in the pipeline that may expand to more pipeline stages as the pipeline runs.
Primitive steps can be created by using a single "labelled bind", where the label provides the static analysis for the atomic component.
I haven't seen this pattern used before (or mentioned in the arrow documentation), and it seems to provide exactly the same benefits as arrows with much less difficulty. If this has a proper name, let me know!
This work was funded by OCaml Labs.
]]>Table of Contents
( this post also appeared on Reddit, Hacker News and Lobsters )
I run QubesOS on my laptop. A QubesOS desktop environment is made up of multiple virtual machines. A privileged VM, called dom0, provides the desktop environment and coordinates the other VMs. dom0 doesn't have network access, so you have to use other VMs for doing actual work. For example, I use one VM for email and another for development work (these are called "application VMs"). There is another VM (called sys-net) that connects to the physical network, and yet another VM (sys-firewall) that connects the application VMs to net-vm.
The default sys-firewall is based on Fedora Linux. A few years ago, I replaced sys-firewall with a MirageOS unikernel. MirageOS is written in OCaml, and has very little C code (unlike Linux). It boots much faster and uses much less RAM than the Fedora-based VM. But recently, a user reported that restarting mirage-firewall was taking a very long time. The problem seemed to be that it was taking several minutes to transfer the information about the network configuration to the firewall. This is sent over vchan. The user reported that stracing the QubesDB process in dom0 revealed that it was sleeping for 10 seconds between sending the records, suggesting that a wakeup event was missing.
The lead developer of QubesOS said:
I'd guess missing evtchn trigger after reading/writing data in vchan.
Perhaps ocaml-vchan, the OCaml implementation of vchan, wasn't implementing the vchan specification correctly? I wanted to check, but there was a problem: there was no vchan specification.
The Xen wiki lists vchan under Xen Document Days/TODO. The initial Git commit on 2011-10-06 said:
libvchan: interdomain communications library
This library implements a bidirectional communication interface between applications in different domains, similar to unix sockets. Data can be sent using the byte-oriented
libvchan_read
/libvchan_write
or the packet-orientedlibvchan_recv
/libvchan_send
.Channel setup is done using a client-server model; domain IDs and a port number must be negotiated prior to initialization. The server allocates memory for the shared pages and determines the sizes of the communication rings (which may span multiple pages, although the default places rings and control within a single page).
With properly sized rings, testing has shown that this interface provides speed comparable to pipes within a single Linux domain; it is significantly faster than network-based communication.
I looked in the xen-devel mailing list around this period in case the reviewers had asked about how it worked.
One reviewer suggested:
Please could you say a few words about the functionality this new library enables and perhaps the design etc? In particular a protocol spec would be useful for anyone who wanted to reimplement for another guest OS etc. [...] I think it would be appropriate to add protocol.txt at the same time as checking in the library.
However, the submitter pointed out that this was unnecessary, saying:
The comments in the shared header file explain the layout of the shared memory regions; any other parts of the protocol are application-defined.
Now, ordinarily, I wouldn't be much interested in spending my free time tracking down race conditions in 3rd-party libraries for the benefit of strangers on the Internet. However, I did want to have another play with TLA...
TLA+ is a language for specifying algorithms. It can be used for many things, but it is particularly designed for stateful parallel algorithms.
I learned about TLA while working at Docker. Docker EE provides software for managing large clusters of machines. It includes various orchestrators (SwarmKit, Kubernetes and Swarm Classic) and a web UI. Ensuring that everything works properly is very important, and to this end a large collection of tests had been produced. Part of my job was to run these tests. You take a test from a list in a web UI and click whatever buttons it tells you to click, wait for some period of time, and then check that what you see matches what the test says you should see. There were a lot of these tests, and they all had to be repeated on every supported platform, and for every release, release candidate or preview release. There was a lot of waiting involved and not much thinking required, so to keep my mind occupied, I started reading the TLA documentation.
I read The TLA+ Hyperbook and Specifying Systems. Both are by Leslie Lamport (the creator of TLA), and are freely available online. They're both very easy to read. The hyperbook introduces the tools right away so you can start playing, while Specifying Systems starts with more theory and discusses the tools later. I think it's worth reading both.
Once Docker EE 2.0 was released, we engineers were allowed to spend a week on whatever fun (Docker-related) project we wanted. I used the time to read the SwarmKit design documents and make a TLA model of that. I felt that using TLA prompted useful discussions with the SwarmKit developers (which can see seen in the pull request comments).
A specification document can answer questions such as:
You don't have to answer all of them to have a useful document, but I will try to answer each of them for vchan.
In my (limited) experience with TLA, whenever I have reached the end of a specification (whether reading it or writing it), I always find myself thinking "Well, that was obvious. It hardly seems worth writing a spec for that!". You might feel the same after reading this blog post.
To judge whether TLA is useful, I suggest you take a few minutes to look at the code. If you are good at reading C code then you might find, like the Xen reviewers, that it is quite obvious what it does, how it works, and why it is correct. Or, like me, you might find you'd prefer a little help. You might want to jot down some notes about it now, to see whether you learn anything new.
To give the big picture:
The public/io/libxenvchan.h header file provides some information, including the shared structures and comments about them:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
You might also like to look at the vchan source code.
Note that the libxenvchan.h
file in this directory includes and extends
the above header file (with the same name).
For this blog post, we will ignore the Xen-specific business of sharing the memory and telling the client where it is, and assume that the client has mapped the memory and is ready to go.
We'll take a first look at TLA concepts and notation using a simplified version of vchan. TLA comes with excellent documentation, so I won't try to make this a full tutorial, but hopefully you will be able to follow the rest of this blog post after reading it. We will just consider a single direction of the channel (e.g. client-to-server) here.
A variable in TLA is just what a programmer expects: something that changes over time.
For example, I'll use Buffer
to represent the data currently being transmitted.
We can also add variables that are just useful for the specification.
I use Sent
to represent everything the sender-side application asked the vchan library to transmit,
and Got
for everything the receiving application has received:
1
|
|
A state in TLA represents a snapshot of the world at some point.
It gives a value for each variable.
For example, { Got: "H", Buffer: "i", Sent: "Hi", ... }
is a state.
The ...
is just a reminder that a state also includes everything else in the world,
not just the variables we care about.
Here are some more states:
State | Got | Buffer | Sent |
---|---|---|---|
s0 | |||
s1 | H | H | |
s2 | H | H | |
s3 | H | i | Hi |
s4 | Hi | Hi | |
s5 | iH | Hi |
A behaviour is a sequence of states, representing some possible history of the world.
For example, << s0, s1, s2, s3, s4 >>
is a behaviour.
So is << s0, s1, s5 >>
, but not one we want.
The basic idea in TLA is to specify precisely which behaviours we want and which we don't want.
A state expression is an expression that can be evaluated in the context of some state.
For example, this defines Integrity
to be a state expression that is true whenever what we have got
so far matches what we wanted to send:
1 2 3 4 5 6 7 8 |
|
Integrity
is true for all the states above except for s5
.
I added some helper operators Take
and Drop
here.
Sequences in TLA+ can be confusing because they are indexed from 1 rather than from 0,
so it is easy to make off-by-one errors.
These operators just use lengths, which we can all agree on.
In Python syntax, it would be written something like:
1 2 |
|
A temporal formula is an expression that is evaluated in the context of a complete behaviour. It can use the temporal operators, which include:
[]
(that's supposed to look like a square) : "always"
<>
(that's supposed to look like a diamond) : "eventually"
[] F
is true if the expression F
is true at every point in the behaviour.
<> F
is true if the expression F
is true at any point in the behaviour.
Messages we send should eventually arrive. Here's one way to express that:
1 2 3 |
|
TLA syntax is a bit odd. It's rather like LaTeX (which is not surprising: Lamport is also the "La" in LaTeX).
\A
means "for all" (rendered as an upside-down A).
So this says that for every number x
, it is always true that if we have sent x
bytes then
eventually we will have received at least x
bytes.
This pattern of [] (F => <>G)
is common enough that it has a shorter notation of F ~> G
, which
is read as "F (always) leads to G". So, Availability
can also be written as:
1 2 3 |
|
We're only checking the lengths in Availability
, but combined with Integrity
that's enough to ensure
that we eventually receive what we want.
So ideally, we'd like to ensure that every possible behaviour of the vchan library will satisfy
the temporal formula Properties
:
1 2 |
|
That /\
is "and" by the way, and \/
is "or".
I did eventually start to be able to tell one from the other, though I still think &&
and ||
would be easier.
In case I forget to explain some syntax, A Summary of TLA lists most of it.
It is hopefully easy to see that Properties
defines properties we want.
A user of vchan would be happy to see that these are things they can rely on.
But they don't provide much help to someone trying to implement vchan.
For that, TLA provides another way to specify behaviours.
An action in TLA is an expression that is evaluated in the context of a pair of states, representing a single atomic step of the system. For example:
1 2 3 4 5 |
|
The Read
action is true of a step if that step transfers all the data from Buffer
to Got
.
Unprimed variables (e.g. Buffer
) refer to the current state and primed ones (e.g. Buffer'
)
refer to the next state.
There's some more strange notation here too:
/\
to form a bulleted list here rather than as an infix operator.
This is indentation-sensitive. TLA also supports \/
lists in the same way.
\o
is sequence concatenation (+
in Python).
<< >>
is the empty sequence ([ ]
in Python).
UNCHANGED Sent
means Sent' = Sent
.
In Python, it might look like this:
1 2 3 4 5 |
|
Actions correspond more closely to code than temporal formulas, because they only talk about how the next state is related to the current one.
This action only allows one thing: reading the whole buffer at once. In the C implementation of vchan the receiving application can provide a buffer of any size and the library will read at most enough bytes to fill the buffer. To model that, we will need a slightly more flexible version:
1 2 3 4 5 |
|
This says that a step is a Read
step if there is any n
(in the range 1 to the length of the buffer)
such that we transferred n
bytes from the buffer. \E
means "there exists ...".
A Write
action can be defined in a similar way:
1 2 3 4 5 6 7 8 9 |
|
A CONSTANT
defines a parameter (input) of the specification
(it's constant in the sense that it doesn't change between states).
A Write
operation adds some message m
to the buffer, and also adds a copy of it to Sent
so we can talk about what the system is doing.
Seq(Byte)
is the set of all possible sequences of bytes,
and \ {<< >>}
just excludes the empty sequence.
A step of the combined system is either a Read
step or a Write
step:
1 2 |
|
We also need to define what a valid starting state for the algorithm looks like:
1 2 3 4 |
|
Finally, we can put all this together to get a temporal formula for the algorithm:
1 2 3 4 |
|
Some more notation here:
[Next]_vars
(that's Next
in brackets with a subscript vars
) means
Next \/ UNCHANGED vars
.
Init
(a state expression) in a temporal formula means it must be
true for the first state of the behaviour.
[][Action]_vars
means that [Action]_vars
must be true for each step.
TLA syntax requires the _vars
subscript here.
This is because other things can be going on in the world beside our algorithm,
so it must always be possible to take a step without our algorithm doing anything.
Spec
defines behaviours just like Properties
does,
but in a way that makes it more obvious how to implement the protocol.
Now we have definitions of Spec
and Properties
,
it makes sense to check that every behaviour of Spec
satisfies Properties
.
In Python terms, we want to check that all behaviours b
satisfy this:
1 2 |
|
i.e. either b
isn't a behaviour that could result from the actions of our algorithm or,
if it is, it satisfies Properties
. In TLA notation, we write this as:
1 2 |
|
It's OK if a behaviour is allowed by Properties
but not by Spec
.
For example, the behaviour which goes straight from Got="", Sent=""
to
Got="Hi", Sent="Hi"
in one step meets our requirements, but it's not a
behaviour of Spec
.
The real implementation may itself further restrict Spec
.
For example, consider the behaviour << s0, s1, s2 >>
:
State | Got | Buffer | Sent |
---|---|---|---|
s0 | Hi | Hi | |
s1 | H | i | Hi |
s2 | Hi | Hi |
The sender sends two bytes at once, but the reader reads them one at a time. This is a behaviour of the C implementation, because the reading application can ask the library to read into a 1-byte buffer. However, it is not a behaviour of the OCaml implementation, which gets to choose how much data to return to the application and will return both bytes together.
That's fine.
We just need to show that OCamlImpl => Spec
and Spec => Properties
and we can deduce that
OCamlImpl => Properties
.
This is, of course, the key purpose of a specification:
we only need to check that each implementation implements the specification,
not that each implementation directly provides the desired properties.
It might seem strange that an implementation doesn't have to allow all the specified behaviours.
In fact, even the trivial specification Spec == FALSE
is considered to be a correct implementation of Properties
,
because it has no bad behaviours (no behaviours at all).
But that's OK.
Once the algorithm is running, it must have some behaviour, even if that behaviour is to do nothing.
As the user of the library, you are responsible for checking that you can use it
(e.g. by ensuring that the Init
conditions are met).
An algorithm without any behaviours corresponds to a library you could never use,
not to one that goes wrong once it is running.
Now comes the fun part: we can ask TLC (the TLA model checker) to check that Spec => Properties
.
You do this by asking the toolbox to create a new model (I called mine SpecOK
) and setting Spec
as the
"behaviour spec". It will prompt for a value for BufferSize
. I used 2
.
There will be various things to fix up:
Write
, TLC first tries to get every possible Seq(Byte)
, which is an infinite set.
I defined MSG == Seq(Byte)
and changed Write
to use MSG
.
I then added an alternative definition for MSG
in the model so that we only send messages of limited length.
In fact, my replacement MSG
ensures that Sent
will always just be an incrementing sequence (<< 1, 2, 3, ... >>
).
That's enough to check Properties
, and much quicker than checking every possible message.
Len(Sent) < 4
This tells TLC to stop considering any execution once this becomes false.
With that, the model runs successfully. This is a nice feature of TLA: instead of changing our specification to make it testable, we keep the specification correct and just override some aspects of it in the model. So, the specification says we can send any message, but the model only checks a few of them.
Now we can add Integrity
as an invariant to check.
That passes, but it's good to double-check by changing the algorithm.
I changed Read
so that it doesn't clear the buffer, using Buffer' = Drop(Buffer, 0)
(with 0
instead of n
).
Then TLC reports a counter-example ("Invariant Integrity is violated"):
<< 1, 2 >>
to Buffer
.
Got=1, Buffer=12, Sent=12
.
Got=11, Buffer=12, Sent=12
.
Looks like it really was checking what we wanted.
It's good to be careful. If we'd accidentally added Integrity
as a "property" to check rather than
as an "invariant" then it would have interpreted it as a temporal formula and reported success just because
it is true in the initial state.
One really nice feature of TLC is that (unlike a fuzz tester) it does a breadth-first search and therefore
finds minimal counter-examples for invariants.
The example above is therefore the quickest way to violate Integrity
.
Checking Availability
complains because of the use of Nat
(we're asking it to check for every possible
length).
I replaced the Nat
with AvailabilityNat
and overrode that to be 0..4
in the model.
It then complains "Temporal properties were violated" and shows an example where the sender wrote
some data and the reader never read it.
The problem is, [Next]_vars
always allows us to do nothing.
To fix this, we can specify a "weak fairness" constraint.
WF_vars(action)
, says that we can't just stop forever with action
being always possible but never happening.
I updated Spec
to require the Read
action to be fair:
1
|
|
Again, care is needed here.
If we had specified WF_vars(Next)
then we would be forcing the sender to keep sending forever, which users of vchan are not required to do.
Worse, this would mean that every possible behaviour of the system would result in Sent
growing forever.
Every behaviour would therefore hit our Len(Sent) < 4
constraint and
TLC wouldn't consider it further.
That means that TLC would never check any actual behaviour against Availability
,
and its reports of success would be meaningless!
Changing Read
to require n \in 2..Len(Buffer)
is a quick way to see that TLC is actually checking Availability
.
Here's the complete spec so far: vchan1.pdf (source)
The simple Spec
algorithm above has some limitations.
One obvious simplification is that Buffer
is just the sequence of bytes in transit, whereas in the real system it is a ring buffer, made up of an array of bytes along with the producer and consumer counters.
We could replace it with three separate variables to make that explicit.
However, ring buffers in Xen are well understood and I don't feel that it would make the specification any clearer
to include that.
A more serious problem is that Spec
assumes that there is a way to perform the Read
and Write
operations atomically.
Otherwise the real system would have behaviours not covered by the spec.
To implement the above Spec
correctly, you'd need some kind of lock.
The real vchan protocol is more complicated than Spec
, but avoids the need for a lock.
The real system has more shared state than just Buffer
.
I added extra variables to the spec for each item of shared state in the C code, along with its initial value:
SenderLive = TRUE
(sender sets to FALSE to close connection)
ReceiverLive = TRUE
(receiver sets to FALSE to close connection)
NotifyWrite = TRUE
(receiver wants to be notified of next write)
DataReadyInt = FALSE
(sender has signalled receiver over event channel)
NotifyRead = FALSE
(sender wants to be notified of next read)
SpaceAvailableInt = FALSE
(receiver has notified sender over event channel)
DataReadyInt
represents the state of the receiver's event port.
The sender can make a Xen hypercall to set this and wake (or interrupt) the receiver.
I guess sending these events is somewhat slow,
because the NotifyWrite
system is used to avoid sending events unnecessarily.
Likewise, SpaceAvailableInt
is the sender's event port.
Here is my understanding of the protocol. On the sending side:
NotifyRead
so the receiver will notify us when there is more.NotifyWrite
flag is set, we clear it and notify the receiver of the write.
On the receiving side:
NotifyWrite
so the sender will notify us when there is.NotifyRead
flag is set, we clear it and notify the sender of the new space.Either side can close the connection by clearing their "live" flag and signalling the other side. I assumed there is also some process-local way that the close operation can notify its own side if it's currently blocked.
To make expressing this kind of step-by-step algorithm easier, TLA+ provides a programming-language-like syntax called PlusCal. It then translates PlusCal into TLA actions.
Confusingly, there are two different syntaxes for PlusCal: Pascal style and C style. This means that, when you search for examples on the web, there is a 50% chance they won't work because they're using the other flavour. I started with the Pascal one because that was the first example I found, but switched to C-style later because it was more compact.
Here is my attempt at describing the sender algorithm above in PlusCal:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
The labels (e.g. sender_request_notify:
) represent points in the program where other actions can happen.
Everything between two labels is considered to be atomic.
I checked that every block of code between labels accesses only one shared variable.
This means that the real system can't see any states that we don't consider.
The toolbox doesn't provide any help with this; you just have to check manually.
The sender_ready
label represents a state where the client application hasn't yet decided to send any data.
Its label is tagged with -
to indicate that fairness doesn't apply here, because the protocol doesn't
require applications to keep sending more data forever.
The other steps are fair, because once we've decided to send something we should keep going.
Taking a step from sender_ready
to sender_write
corresponds to the vchan library's write function
being called with some argument m
.
The with (m \in MSG)
says that m
could be any message from the set MSG
.
TLA also contains a CHOOSE
operator that looks like it might do the same thing, but it doesn't.
When you use with
, you are saying that TLC should check all possible messages.
When you use CHOOSE
, you are saying that it doesn't matter which message TLC tries (and it will always try the
same one).
Or, in terms of the specification, a CHOOSE
would say that applications can only ever send one particular message, without telling you what that message is.
In sender_write_data
, we set free := 0
for no obvious reason.
This is just to reduce the number of states that the model checker needs to explore,
since we don't care about its value after this point.
Some of the code is a little awkward because I had to put things in else
branches that would more naturally go after the whole if
block, but the translator wouldn't let me do that.
The use of semi-colons is also a bit confusing: the PlusCal-to-TLA translator requires them after a closing brace in some places, but the PDF generator messes up the indentation if you include them.
Here's how the code block starting at sender_request_notify
gets translated into a TLA action:
1 2 3 4 5 6 7 8 9 10 11 |
|
pc
is a mapping from process ID to the label where that process is currently executing.
So sender_request_notify
can only be performed when the SenderWriteID process is
at the sender_request_notify
label.
Afterwards pc[SenderWriteID]
will either be at sender_write_data
or sender_recheck_len
(if there wasn't enough space for the whole message).
Here's the code for the receiver:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
It's quite similar to before.
recv_ready
corresponds to a state where the application hasn't yet called read
.
When it does, we take n
(the maximum number of bytes to read) as an argument and
store it in the local variable want
.
Note: you can use the C library in blocking or non-blocking mode.
In blocking mode, a write
(or read
) waits until data is sent (or received).
In non-blocking mode, it returns a special code to the application indicating that it needs to wait.
The application then does the waiting itself and then calls the library again.
I think the specification above covers both cases, depending on whether you think of
sender_blocked
and recv_await_data
as representing code inside or outside of the library.
We also need a way to close the channel. It wasn't clear to me, from looking at the C headers, when exactly you're allowed to do that. I think that if you had a multi-threaded program and you called the close function while the write function was blocked, it would unblock and return. But if you happened to call it at the wrong time, it would try to use a closed file descriptor and fail (or read from the wrong one). So I guess it's single threaded, and you should use the non-blocking mode if you want to cancel things.
That means that the sender can close only when it is at sender_ready
or sender_blocked
,
and similarly for the receiver.
The situation with the OCaml code is the same, because it is cooperatively threaded and so the close
operation can only be called while blocked or idle.
However, I decided to make the specification more general and allow for closing at any point
by modelling closing as separate processes:
1 2 3 4 5 6 7 8 9 |
|
Again, the processes are "fair" because once we start closing we should finish, but the initial labels are tagged with "-" to disable fairness there: it's OK if you keep a vchan open forever.
There's a slight naming problem here. The PlusCal translator names the actions it generates after the starting state of the action. So sender_open is the action that moves from the sender_open label. That is, the sender_open action actually closes the connection!
Finally, we share the event channel with the buffer going in the other direction, so we might get notifications that are nothing to do with us. To ensure we handle that, I added another process that can send events at any time:
1 2 3 4 5 6 |
|
either/or
says that we need to consider both possibilities.
This process isn't marked fair, because we can't rely these interrupts coming.
But we do have to handle them when they happen.
PlusCal code is written in a specially-formatted comment block, and you have to press Ctrl-T to generate (or update) then TLA translation before running the model checker.
Be aware that the TLA Toolbox is a bit unreliable about keyboard short-cuts. While typing into the editor always works, short-cuts such as Ctrl-S (save) sometimes get disconnected. So you think you're doing "edit/save/translate/save/check" cycles, but really you're just checking some old version over and over again. You can avoid this by always running the model checker with the keyboard shortcut too, since that always seems to fail at the same time as the others. Focussing a different part of the GUI and then clicking back in the editor again fixes everything for a while.
Anyway, running our model on the new spec shows that Integrity
is still OK.
However, the Availability
check fails with the following counter-example:
<< 1 >>
to Buffer
.
We need to update Availability
to consider the effects of closing connections.
And at this point, I'm very unsure what vchan is intended to do.
We could say:
1 2 3 4 5 |
|
That passes. But vchan describes itself as being like a Unix socket. If you write to a Unix socket and then close it, you still expect the data to be delivered. So actually I tried this:
1 2 3 4 5 |
|
This says that if a sender write operation completes successfully (we're back at sender_ready
)
and at that point the sender hasn't closed the connection, then the receiver will eventually receive
the data (or close its end).
That is how I would expect it to behave. But TLC reports that the new spec does not satisfy this, giving this example (simplified - there are 16 steps in total):
Buffer
and returns to sender_ready
.
Is this a bug? Without a specification, it's impossible to say.
Maybe vchan was never intended to ensure delivery once the sender has closed its end.
But this case only happens if you're very unlucky about the scheduling.
If the receiving application calls read
when the sender has closed the connection but there is data
available then the C code does return the data in that case.
It's only if the sender happens to close the connection just after the receiver has checked the buffer and just before it checks the close flag that this happens.
It's also easy to fix. I changed the code in the receiver to do a final check on the buffer before giving up:
1 2 3 4 |
|
With that change, we can be sure that data sent while the connection is open will always be delivered (provided only that the receiver doesn't close the connection itself). If you spotted this issue yourself while you were reviewing the code earlier, then well done!
Note that when TLC finds a problem with a temporal property (such as Availability
),
it does not necessarily find the shortest example first.
I changed the limit on Sent
to Len(Sent) < 2
and added an action constraint of ~SpuriousInterrupts
to get a simpler example, with only 1 byte being sent and no spurious interrupts.
I noticed a couple of other odd things, which I thought I'd mention.
First, NotifyWrite
is initialised to TRUE
, which seemed unnecessary.
We can initialise it to FALSE
instead and everything still works.
We can even initialise it with NotifyWrite \in {TRUE, FALSE}
to allow either behaviour,
and thus test that old programs that followed the original version of the spec still work
with either behaviour.
That's a nice advantage of using a specification language. Saying "the code is the spec" becomes less useful as you build up more and more versions of the code!
However, because there was no spec before, we can't be sure that existing programs do follow it. And, in fact, I found that QubesDB uses the vchan library in a different and unexpected way. Instead of calling read, and then waiting if libvchan says to, QubesDB blocks first in all cases, and then calls the read function once it gets an event.
We can document that by adding an extra step at the start of ReceiverRead:
1 2 3 4 5 |
|
Then TLC shows that NotifyWrite
cannot start as FALSE
.
The second odd thing is that the receiver sets NotifyRead
whenever there isn't enough data available
to fill the application's buffer completely.
But usually when you do a read operation you just provide a buffer large enough for the largest likely message.
It would probably make more sense to set NotifyWrite
only when the buffer is completely empty.
After checking the current version of the algorithm, I changed the specification to allow either behaviour.
At this point, we have specified what vchan should do and how it does it. We have also checked that it does do this, at least for messages up to 3 bytes long with a buffer size of 2. That doesn't sound like much, but we still checked 79,288 distinct states, with behaviours up to 38 steps long. This would be a perfectly reasonable place to declare the specification (and blog post) finished.
However, TLA has some other interesting abilities. In particular, it provides a very interesting technique to help discover why the algorithm works.
We'll start with Integrity
.
We would like to argue as follows:
Integrity
is true in any initial state (i.e. Init => Integrity
).
Next
step preserves Integrity
(i.e. Integrity /\ Next => Integrity'
).
Then it would just be a matter looking at each possible action that makes up Next
and
checking that each one individually preserves Integrity
.
However, we can't do this with Integrity
because (2) isn't true.
For example, the state { Got: "", Buffer: "21", Sent: "12" }
satisfies Integrity
,
but if we take a read step then the new state won't.
Instead, we have to argue "If we take a Next
step in any reachable state then Integrity'
",
but that's very difficult because how do we know whether a state is reachable without searching them all?
So the idea is to make a stronger version of Integrity
, called IntegrityI
, which does what we want.
IntegrityI
is called an inductive invariant.
The first step is fairly obvious - I began with:
1 2 |
|
Integrity
just said that Got
is a prefix of Sent
.
This says specifically that the rest is Buffer \o msg
- the data currently being transmitted and the data yet to be transmitted.
We can ask TLC to check Init /\ [][Next]_vars => []IntegrityI
to check that it is an invariant, as before.
It does that by finding all the Init
states and then taking Next
steps to find all reachable states.
But we can also ask it to check IntegrityI /\ [][Next]_vars => []IntegrityI
.
That is, the same thing but starting from any state matching IntegrityI
instead of Init
.
I created a new model (IntegrityI
) to do that.
It reports a few technical problems at the start because it doesn't know the types of anything.
For example, it can't choose initial values for SenderLive
without knowing that SenderLive
is a boolean.
I added a TypeOK
state expression that gives the expected type of every variable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We also need to tell it all the possible states of pc
(which says which label each process it at):
1 2 3 4 5 6 7 8 9 10 11 |
|
You might imagine that the PlusCal translator would generate that for you, but it doesn't.
We also need to override MESSAGE
with FINITE_MESSAGE(n)
for some n
(I used 2
).
Otherwise, it can't enumerate all possible messages.
Now we have:
1 2 3 4 |
|
With that out of the way, TLC starts finding real problems
(that is, examples showing that IntegrityI /\ Next => IntegrityI'
isn't true).
First, recv_read_data
would do an out-of-bounds read if have = 1
and Buffer = << >>
.
Our job is to explain why that isn't a valid state.
We can fix it with an extra constraint:
1 2 3 4 5 |
|
(note: that =>
is "implies", while the <=
is "less-than-or-equal-to")
Now it complains that if we do recv_got_len
with Buffer = << >>, have = 1, want = 0
then we end up in recv_read_data
with
Buffer = << >>, have = 1
, and we have to explain why that can't happen and so on.
Because TLC searches breadth-first, the examples it finds never have more than 2 states. You just have to explain why the first state can't happen in the real system. Eventually, you get a big ugly pile of constraints, which you then think about for a bit and simply. I ended up with:
1 2 3 4 5 6 7 8 9 10 |
|
It's a good idea to check the final IntegrityI
with the original SpecOK
model,
just to check it really is an invariant.
So, in summary, Integrity
is always true because:
Sent
is always the concatenation of Got
, Buffer
and msg
.
That's fairly obvious, because sender_ready
sets msg
and appends the same thing to Sent
,
and the other steps (sender_write_data
and recv_read_data
) just transfer some bytes from
the start of one variable to the end of another.
Although, like all local information, the receiver's have
variable might be out-of-date,
there must be at least that much data in the buffer, because the sender process will only
have added more, not removed any. This is sufficient to ensure that we never do an
out-of-range read.
Likewise, the sender's free
variable is a lower bound on the true amount of free space,
because the receiver only ever creates more space. We will therefore never write beyond the
free space.
I think this ability to explain why an algorithm works, by being shown examples where the inductive property doesn't hold, is a really nice feature of TLA. Inductive invariants are useful as a first step towards writing a proof, but I think they're valuable even on their own. If you're documenting your own algorithm, this process will get you to explain your own reasons for believing it works (I tried it on a simple algorithm in my own code and it seemed helpful).
Some notes:
Originally, I had the free
and have
constraints depending on pc
.
However, the algorithm sets them to zero when not in use so it turns out they're always true.
IntegrityI
matches 532,224 states, even with a maximum Sent
length of 1, but it passes!
There are some games you can play to speed things up;
see Using TLC to Check Inductive Invariance for some suggestions
(I only discovered that while writing this up).
TLA provides a syntax for writing proofs, and integrates with TLAPS (the TLA+ Proof System) to allow them to be checked automatically.
Proving IntegrityI
is just a matter of showing that Init => IntegrityI
and that it is preserved
by any possible [Next]_vars
step.
To do that, we consider each action of Next
individually, which is long but simple enough.
I was able to prove it, but the recv_read_data
action was a little difficult
because we don't know that want > 0
at that point, so we have to do some extra work
to prove that transferring 0 bytes works, even though the real system never does that.
I therefore added an extra condition to IntegrityI
that want
is non-zero whenever it's in use,
and also conditions about have
and free
being 0 when not in use, for completeness:
1 2 3 4 5 6 7 8 9 10 |
|
Integrity
was quite easy to prove, but I had more trouble trying to explain Availability
.
One way to start would be to add Availability
as a property to check to the IntegrityI
model.
However, it takes a while to check properties as it does them at the end, and the examples
it finds may have several steps (it took 1m15s to find a counter-example for me).
Here's a faster way (37s).
The algorithm will deadlock if both sender and receiver are in their blocked states and neither
interrupt is pending, so I made a new invariant, I
, which says that deadlock can't happen:
1 2 3 4 5 6 |
|
I discovered some obvious facts about closing the connection.
For example, the SenderLive
flag is set if and only if the sender's close thread hasn't done anything.
I've put them all together in CloseOK
:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
But I had problems with other examples TLC showed me, and I realised that I didn't actually know why this algorithm doesn't deadlock.
Intuitively it seems clear enough: the sender puts data in the buffer when there's space and notifies the receiver, and the receiver reads it and notifies the writer. What could go wrong? But both processes are working with information that can be out-of-date. By the time the sender decides to block because the buffer looked full, the buffer might be empty. And by the time the receiver decides to block because it looked empty, it might be full.
Maybe you already saw why it works from the C code, or the algorithm above, but it took me a while to figure it out! I eventually ended up with an invariant of the form:
1 2 3 4 |
|
SendMayBlock
is TRUE
if we're in a state that may lead to being blocked without checking the
buffer's free space again. Likewise, ReceiveMayBlock
indicates that the receiver might block.
SpaceWakeupComing
and DataWakeupComing
predict whether we're going to get an interrupt.
The idea is that if we're going to block, we need to be sure we'll be woken up.
It's a bit ugly, though, e.g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
It did pass my model that tested sending one byte, and I decided to try a proof.
Well, it didn't work.
The problem seems to be that DataWakeupComing
and SpaceWakeupComing
are really mutually recursive.
The reader will wake up if the sender wakes it, but the sender might be blocked, or about to block.
That's OK though, as long as the receiver will wake it, which it will do, once the sender wakes it...
You've probably already figured it out, but I thought I'd document my confusion. It occurred to me that although each process might have out-of-date information, that could be fine as long as at any one moment one of them was right. The last process to update the buffer must know how full it is, so one of them must have correct information at any given time, and that should be enough to avoid deadlock.
That didn't work either.
When you're at a proof step and can't see why it's correct, you can ask TLC to show you an example.
e.g. if you're stuck trying to prove that sender_request_notify
preserves I
when the
receiver is at recv_ready
, the buffer is full, and ReceiverLive = FALSE
,
you can ask for an example of that:
1 2 3 4 5 6 7 |
|
You then create a new model that searches Example /\ [][Next]_vars
and tests I
.
As long as Example
has several constraints, you can use a much larger model for this.
I also ask it to check the property [][FALSE]_vars
, which means it will show any step starting from Example
.
It quickly became clear what was wrong: it is quite possible that neither process is up-to-date.
If both processes see the buffer contains X
bytes of data, and the sender sends Y
bytes and the receiver reads Z
bytes, then the sender will think there are X + Y
bytes in the buffer and the receiver will think there are X - Z
bytes, and neither is correct.
My original 1-byte buffer was just too small to find a counter-example.
The real reason why vchan works is actually rather obvious.
I don't know why I didn't see it earlier.
But eventually it occurred to me that I could make use of Got
and Sent
.
I defined WriteLimit
to be the total number of bytes that the sender would write before blocking,
if the receiver never did anything further.
And I defined ReadLimit
to be the total number of bytes that the receiver would read if the sender
never did anything else.
Did I define these limits correctly?
It's easy to ask TLC to check some extra properties while it's running.
For example, I used this to check that ReadLimit
behaves sensibly:
1 2 3 4 5 6 7 8 9 |
|
Because ReadLimit
is defined in terms of what it does when no other processes run,
this property should ideally be tested in a model without the fairness conditions
(i.e. just Init /\ [][Next]_vars
).
Otherwise, fairness may force the sender to perform a step.
We still want to allow other steps, though, to show that ReadLimit
is a lower bound.
With this, we can argue that e.g. a 2-byte buffer will eventually transfer 3 bytes:
By this point, I was learning to be more cautious before trying a proof, so I added some new models to check this idea further. One prevents the sender from ever closing the connection and the other prevents the receiver from ever closing. That reduces the number of states to consider and I was able to check a slightly larger model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
If a process is on a path to being blocked then it must have set its notify flag.
NotifyFlagsCorrect
says that in that case, the flag it still set, or the interrupt has been sent,
or the other process is just about to trigger the interrupt.
I managed to use that to prove that the sender's steps preserved I
,
but I needed a little extra to finish the receiver proof.
At this point, I finally spotted the obvious invariant (which you, no doubt, saw all along):
whenever NotifyRead
is still set, the sender has accurate information about the buffer.
1 2 3 4 5 |
|
That's pretty obvious, isn't it? The sender checks the buffer after setting the flag, so it must have accurate information at that point. The receiver clears the flag after reading from the buffer (which invalidates the sender's information).
Now I had a dilemma.
There was obviously going to be a matching property about NotifyWrite
.
Should I add that, or continue with just this?
I was nearly done, so I continued and finished off the proofs.
With I
proved, I was able to prove some other nice things quite easily:
1 2 3 4 5 |
|
That says that, whenever the sender is idle or blocked, the receiver will read everything sent so far, without any further help from the sender. And:
1 2 3 4 |
|
That says that whenever the receiver is blocked, the sender can fill the buffer.
That's pretty nice.
It would be possible to make a vchan system that e.g. could only send 1 byte at a time and still
prove it couldn't deadlock and would always deliver data,
but here we have shown that the algorithm can use the whole buffer.
At least, that's what these theorems say as long as you believe that ReadLimit
and WriteLimit
are defined correctly.
With the proof complete, I then went back and deleted all the stuff about ReadLimit
and WriteLimit
from I
and started again with just the new rules about NotifyRead
and NotifyWrite
.
Instead of using WriteLimit = Len(Got) + BufferSize
to indicate that the sender has accurate information,
I made a new SenderInfoAccurate
that just returns TRUE
whenever the sender will fill the buffer without further help.
That avoids some unnecessary arithmetic, which TLAPS needs a lot of help with.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
By talking about accuracy instead of the write limit, I was also able to include "Done" in with the other happy cases. Before, that had to be treated as a possible problem because the sender can't use the full buffer when it's Done.
With this change, the proof of Spec => []I
became much simpler (384 lines shorter).
And most of the remaining steps were trivial.
The ReadLimit
and WriteLimit
idea still seemed useful, though,
but I found I was able to prove the same things from I
.
e.g. we can still conclude this, even if I
doesn't mention WriteLimit
:
1 2 3 4 |
|
That's nice, because it keeps the invariant and its proofs simple, but we still get the same result in the end.
I initially defined WriteLimit
to be the number of bytes the sender could write if
the sending application wanted to send enough data,
but I later changed it to be the actual number of bytes it would write if the application didn't
try to send any more.
This is because otherwise, with packet-based sends
(where we only write when the buffer has enough space for the whole message at once)
WriteLimit
could go down.
e.g. we think we can write another 3 bytes,
but then the application decides to write 10 bytes and now we can't write anything more.
The limit theorems above are useful properties,
but it would be good to have more confidence that ReadLimit
and WriteLimit
are correct.
I was able to prove some useful lemmas here.
First, ReceiverRead
steps don't change ReadLimit
(as long as the receiver hasn't closed
the connection):
1 2 3 |
|
This gives us a good reason to think that ReadLimit is correct:
ReadLimit
is defined to be Len(Got)
then, so ReadLimit
is obviously correct for this case.
ReadLimit
, this shows that ReadLimit is correct in all cases.
e.g. if ReadLimit = 5
and no other processes do anything,
then we will end up in a state with the receiver blocked, and ReadLimit = Len(Got) = 5
and so we really did read a total of 5 bytes.
I was also able to prove that it never decreases (unless the receiver closes the connection):
1 2 3 |
|
So, if ReadLimit = n
then it will always be at least n
,
and if the receiver ever blocks then it will have read at least n
bytes.
I was able to prove similar properties about WriteLimit
.
So, I feel reasonably confident that these limit predictions are correct.
Disappointingly, we can't actually prove Availability
using TLAPS,
because currently it understands very little temporal logic (see TLAPS limitations).
However, I could show that the system can't deadlock while there's data to be transmitted:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
I've included the proof of DeadlockFree1
above:
NotifyRead
and NotifyWrite
must both be set
(because processes don't block without setting them,
and if they'd been unset then an interrupt would now be pending and we wouldn't be blocked).
NotifyRead
is still set,
the sender is correct in thinking that the buffer is still full.
NotifyWrite
is still set,
the receiver is correct in thinking that the buffer is still empty.
BufferSize
isn't zero.
If it doesn't deadlock, then some process must keep getting woken up by interrupts, which means that interrupts keep being sent. We only send interrupts after making progress (writing to the buffer or reading from it), so we must keep making progress. We'll have to content ourselves with that argument.
The toolbox doesn't come with the proof system, so you need to install it separately. The instructions are out-of-date and have a lot of broken links. In May, I turned the steps into a Dockerfile, which got it partly installed, and asked on the TLA group for help, but no-one else seemed to know how to install it either. By looking at the error messages and searching the web for programs with the same names, I finally managed to get it working in December. If you have trouble installing it too, try using my Docker image.
Once installed, you can write a proof in the toolbox and then press Ctrl-G, Ctrl-G to check it. On success, the proof turns green. On failure, the failing step turns red. You can also do the Ctrl-G, Ctrl-G combination on a single step to check just that step. That's useful, because it's pretty slow. It takes more than 10 minutes to check the complete specification.
TLA proofs are done in the mathematical style, which is to write a set of propositions and vaguely suggest that thinking about these will lead you to the proof. This is good for building intuition, but bad for reproducibility. A mathematical proof is considered correct if the reader is convinced by it, which depends on the reader. In this case, the "reader" is a collection of automated theorem-provers with various timeouts. This means that whether a proof is correct or not depends on how fast your computer is, how many programs are currently running, etc. A proof might pass one day and fail the next. Some proof steps consistently pass when you try them individually, but consistently fail when checked as part of the whole proof. If a step fails, you need to break it down into smaller steps.
Sometimes the proof system is very clever, and immediately solves complex steps.
For example, here is the proof that the SenderClose
process (which represents the sender closing the channel),
preserves the invariant I
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
A step such as IntegrityI' BY DEF IntegrityI
says
"You can see that IntegrityI
will be true in the next step just by looking at its definition".
So this whole lemma is really just saying "it's obvious".
And TLAPS agrees.
At other times, TLAPS can be maddeningly stupid. And it can't tell you what the problem is - it can only make things go red.
For example, this fails:
1 2 3 4 5 |
|
We're trying to say that pc[2]
is unchanged, given that pc'
is the same as pc
except that we changed pc[1]
.
The problem is that TLA is an untyped language.
Even though we know we did a mapping update to pc
,
that isn't enough (apparently) to conclude that pc
is in fact a mapping.
To fix it, you need:
1 2 3 4 5 6 |
|
The extra pc \in [Nat -> STRING]
tells TLA the type of the pc
variable.
I found missing type information to be the biggest problem when doing proofs,
because you just automatically assume that the computer will know the types of things.
Another example:
1 2 3 4 5 |
|
We're just trying to remove the x + ...
from both sides of the equation.
The problem is, TLA doesn't know that Min(y, 10)
is a number,
so it doesn't know whether the normal laws of addition apply in this case.
It can't tell you that, though - it can only go red.
Here's the solution:
1 2 3 4 5 |
|
The BY DEF Min
tells TLAPS to share the definition of Min
with the solvers.
Then they can see that Min(y, 10)
must be a natural number too and everything works.
Another annoyance is that sometimes it can't find the right lemma to use, even when you tell it exactly what it needs. Here's an extreme case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
TransferFacts
states some useful facts about transferring data between two variables.
You can prove that quite easily.
SameAgain
is identical in every way, and just refers to TransferFacts
for the proof.
But even with only one lemma to consider - one that matches all the assumptions and conclusions perfectly -
none of the solvers could figure this one out!
My eventual solution was to name the bundle of results. This works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Most of the art of using TLAPS is in controlling how much information to share with the provers.
Too little (such as failing to provide the definition of Min
) and they don't have enough information to find the proof.
Too much (such as providing the definition of TransferResults
) and they get overwhelmed and fail to find the proof.
It's all a bit frustrating, but it does work, and being machine checked does give you some confidence that your proofs are actually correct.
Another, perhaps more important, benefit of machine checked proofs is that when you decide to change something in the specification you can just ask it to re-check everything. Go and have a cup of tea, and when you come back it will have highlighted in red any steps that need to be updated. I made a lot of changes, and this worked very well.
The TLAPS philosophy is that
If you are concerned with an algorithm or system, you should not be spending your time proving basic mathematical facts. Instead, you should assert the mathematical theorems you need as assumptions or theorems.
So even if you can't find a formal proof of every step, you can still use TLAPS to break it down into steps than you either can prove, or that you think are obvious enough that they don't require a proof. However, I was able to prove everything I needed for the vchan specification within TLAPS.
I did a little bit of tidying up at the end.
In particular, I removed the want
variable from the specification.
I didn't like it because it doesn't correspond to anything in the OCaml implementation,
and the only place the algorithm uses it is to decide whether to set NotifyWrite
,
which I thought might be wrong anyway.
I changed this:
1 2 |
|
to:
1 2 3 4 5 6 |
|
That always allows an implementation to set NotifyWrite
if it wants to,
or to skip that step just as long as have > 0
.
That covers the current C behaviour, my proposed C behaviour, and the OCaml implementation.
It also simplifies the invariant, and even made the proofs shorter!
I put the final specification online at spec-vchan. I also configured Travis CI to check all the models and verify all the proofs. That's useful because sometimes I'm too impatient to recheck everything on my laptop before pushing updates.
You can generate a PDF version of the specification with make pdfs
.
Expressions there can be a little easier to read because they use proper symbols, but
it also breaks things up into pages, which is highly annoying.
It would be nice if it could omit the proofs too, as they're really only useful if you're trying to edit them.
I'd rather just see the statement of each theorem.
With my new understanding of vchan, I couldn't see anything obvious wrong with the C code (at least, as long as you keep the connection open, which the firewall does).
I then took a look at ocaml-vchan. The first thing I noticed was that someone had commented out all the memory barriers, noting in the Git log that they weren't needed on x86. I am using x86, so that's not it, but I filed a bug about it anyway: Missing memory barriers.
The other strange thing I saw was the behaviour of the read
function.
It claims to implement the Mirage FLOW
interface, which says that read
"blocks until some data is available and returns a fresh buffer containing it".
However, looking at the code, what it actually does is to return a pointer directly into the shared buffer.
It then delays updating the consumer counter until the next call to read.
That's rather dangerous, and I filed another bug about that: Read has very surprising behaviour.
However, when I checked the mirage-qubes
code, it just takes this buffer and makes a copy of it immediately.
So that's not the bug either.
Also, the original bug report mentioned a 10 second timeout, and neither the C implementation nor the OCaml one had any timeouts. Time to look at QubesDB itself.
QubesDB accepts messages from either the guest VM (the firewall) or from local clients connected over Unix domain sockets. The basic structure is:
1 2 3 4 5 6 |
|
The suspicion was that we were missing a vchan event, but then it was discovering that there was data in the buffer anyway due to the timeout. Looking at the code, it does seem to me that there is a possible race condition here:
handle_client_data
sends the data to the firewall using a blocking write.
I don't think this is the cause of the bug though,
because the only messages the firewall might be sending here are QDB_RESP_OK
messages,
and QubesDB just discards such messages.
I managed to reproduce the problem myself,
and saw that in fact QubesDB doesn't make any progress due to the 10 second timeout.
It just tries to go back to sleep for another 10 seconds and then
immediately gets woken up by a message from a local client.
So, it looks like QubesDB is only sending updates every 10 seconds because its client, qubesd
,
is only asking it to send updates every 10 seconds!
And looking at the qubesd
logs, I saw stacktraces about libvirt failing to attach network devices, so
I read the Xen network device attachment specification to check that the firewall implemented that correctly.
I'm kidding, of course. There isn't any such specification. But maybe this blog post will inspire someone to write one...
As users of open source software, we're encouraged to look at the source code and check that it's correct ourselves. But that's pretty difficult without a specification saying what things are supposed to do. Often I deal with this by learning just enough to fix whatever bug I'm working on, but this time I decided to try making a proper specification instead. Making the TLA specification took rather a long time, but it was quite pleasant. Hopefully the next person who needs to know about vchan will appreciate it.
A TLA specification generally defines two sets of behaviours. The first is the set of desirable behaviours (e.g. those where the data is delivered correctly). This definition should clearly explain what users can expect from the system. The second defines the behaviours of a particular algorithm. This definition should make it easy to see how to implement the algorithm. The TLC model checker can check that the algorithm's behaviours are all acceptable, at least within some defined limits.
Writing a specification using the TLA notation forces us to be precise about what we mean. For example, in a prose specification we might say "data sent will eventually arrive", but in an executable TLA specification we're forced to clarify what happens if the connection is closed. I would have expected that if a sender writes some data and then closes the connection then the data would still arrive, but the C implementation of vchan does not always ensure that. The TLC model checker can find a counter-example showing how this can fail in under a minute.
To explain why the algorithm always works, we need to find an inductive invariant.
The TLC model checker can help with this,
by presenting examples of unreachable states that satisfy the invariant but don't preserve it after taking a step.
We must add constraints to explain why these states are invalid.
This was easy for the Integrity
invariant, which explains why we never receive incorrect data, but
I found it much harder to prove that the system cannot deadlock.
I suspect that the original designer of a system would find this step easy, as presumably they already know why it works.
Once we have found an inductive invariant, we can write a formal machine-checked proof that the invariant is always true. Although TLAPS doesn't allow us to prove liveness properties directly, I was able to prove various interesting things about the algorithm: it doesn't deadlock; when the sender is blocked, the receiver can read everything that has been sent; and when the receiver is blocked, the sender can fill the entire buffer.
Writing formal proofs is a little tedious, largely because TLA is an untyped language. However, there is nothing particularly difficult about it, once you know how to work around various limitations of the proof checkers.
You might imagine that TLA would only work on very small programs like libvchan, but this is not the case.
It's just a matter of deciding what to specify in detail.
For example, in this specification I didn't give any details about how ring buffers work,
but instead used a single Buffer
variable to represent them.
For a specification of a larger system using vchan, I would model each channel using just Sent
and Got
and an action that transferred some of the difference on each step.
The TLA Toolbox has some rough edges. The ones I found most troublesome were: the keyboard shortcuts frequently stop working; when a temporal property is violated, it doesn't tell you which one it was; and the model explorer tooltips appear right under the mouse pointer, preventing you from scrolling with the mouse wheel. It also likes to check its "news feed" on a regular basis. It can't seem to do this at the same time as other operations, and if you're in the middle of a particularly complex proof checking operation, it will sometimes suddenly pop up a box suggesting that you cancel your job, so that it can get back to reading the news.
However, it is improving. In the latest versions, when you get a syntax error, it now tells you where in the file the error is. And pressing Delete or Backspace while editing no longer causes it to crash and lose all unsaved data. In general I feel that the TLA Toolbox is quite usable now. If I were designing a new protocol, I would certainly use TLA to help with the design.
TLA does not integrate with any language type systems, so even after you have a specification you still need to check manually that your code matches the spec. It would be nice if you could check this automatically, somehow.
One final problem is that whenever I write a TLA specification, I feel the need to explain first what TLA is. Hopefully it will become more popular and that problem will go away.
Update 2019-01-10: Marek Marczykowski-Górecki told me that the state model for network devices is the same as
the one for block devices, which is documented in the blkif.h
block device header file, and provided libvirt debugging help -
so the bug is now fixed!