<?xml version="1.0" encoding="UTF-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:base="https://roscidus.com/">
  <title>Thomas Leonard's blog</title>
  <link href="https://roscidus.com/blog/atom.xml" rel="self"></link>
  <link href="https://roscidus.com/blog/"></link>
  <updated>2026-01-01T16:00:00+00:00</updated>
  <id>https://roscidus.com/blog/</id>
  <author>
    <name>Thomas Leonard</name>
  </author>
  <entry>
    <title type="html">Proving liveness with TLA</title>
    <link href="https://roscidus.com/blog/blog/2026/01/01/tla-liveness/"></link>
    <updated>2026-01-01T16:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2026/01/01/tla-liveness</id>
    <content type="html"><![CDATA[<p>The TLA Toolbox now has support for proving liveness properties (i.e. that something will eventually happen).
I try it out on the Xen vchan protocol.</p>
<!-- more -->
<p><span class="caption-wrapper center"><img src="/blog/images/tla/liveness.png" title="Working on a liveness proof with the TLA+ Toolbox." class="caption"/><span class="caption-text">Working on a liveness proof with the TLA+ Toolbox.</span></span></p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#background">Background</a>
</li>
<li><a href="#a-simple-channel-specification">A simple channel specification</a>
<ul>
<li><a href="#specification-in-terms-of-actions">Specification in terms of actions</a>
</li>
<li><a href="#invariants">Invariants</a>
</li>
</ul>
</li>
<li><a href="#temporal-logic">Temporal logic</a>
<ul>
<li><a href="#proving-temporal-claims-with-tlaps">Proving temporal claims with TLAPS</a>
</li>
<li><a href="#generalising">Generalising</a>
</li>
<li><a href="#hiding-definitions">Hiding definitions</a>
</li>
</ul>
</li>
<li><a href="#proving-liveness-simple-case">Proving liveness (simple case)</a>
</li>
<li><a href="#multi-step-liveness">Multi-step liveness</a>
</li>
<li><a href="#the-real-protocol">The real protocol</a>
<ul>
<li><a href="#proving-readlimit">Proving ReadLimit</a>
</li>
<li><a href="#proving-availability">Proving Availability</a>
</li>
<li><a href="#proving-writelimit">Proving WriteLimit</a>
</li>
<li><a href="#end-to-end-liveness">End-to-end liveness</a>
</li>
</ul>
</li>
<li><a href="#work-arounds-for-bugs">Work-arounds for bugs</a>
<ul>
<li><a href="#suffices-doesnt-always-generalise">SUFFICES doesn't always generalise</a>
</li>
<li><a href="#case-with-a-temporal-goal">CASE with a temporal goal</a>
</li>
<li><a href="#ptl-and-primes">PTL and primes</a>
</li>
<li><a href="#syntax-matters">Syntax matters</a>
</li>
<li><a href="#other-bugs">Other bugs</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=46471699">Hacker News</a>
and <a href="https://lobste.rs/s/r54non/proving_liveness_with_tla">Lobsters</a> )</p>
<h2 id="background">Background</h2>
<p>The vchan protocol is used for efficient communication between Xen virtual machines.
In 2018, <a href="https://roscidus.com/blog/blog/2019/01/01/using-tla-plus-to-understand-xen-vchan/">I made a TLA+ specification of vchan</a>:
I created a specification of the protocol from the C code,
used the model checker (TLC) to test that the protocol worked on small models,
and wrote a machine-verified proof of <code>Integrity</code> (that the data received always matched what was sent).
I also outlined a proof of <code>Availability</code> (that data sent will eventually arrive, rather than the system deadlocking or going around in circles). But:</p>
<blockquote>
<p>Disappointingly, we can't actually prove <code>Availability</code> using TLAPS, because
currently it understands very little temporal logic</p>
</blockquote>
<p>However, newer versions of <a href="https://proofs.tlapl.us/doc/web/content/Home.html">TLAPS</a> (the TLA Proof System) have added proper support for temporal logic,
so I decided to take another look.
In this post, I'll start with a simple example of proving a liveness property,
and then look at the real protocol.
If you're not familiar with TLA, you might want to read the earlier post first,
though I'll briefly introduce concepts as we meet them here.</p>
<h2 id="a-simple-channel-specification">A simple channel specification</h2>
<p>We'll start with <a href="https://github.com/talex5/spec-vchan/blob/blog-liveness-1/example.tla">a simple model of a one-way channel</a>.
There is a sender and a receiver, and they have access to a shared buffer of size <code>BufferSize</code>
(which is a non-zero natural number):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="kn">CONSTANT</span> <span class="n">BufferSize</span> 
</span><span class="line"><span class="n">ASSUME</span> <span class="n">BufferSizeType</span> <span class="ni">==</span> <span class="n">BufferSize</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="o">\</span> <span class="ni">{</span><span class="m">0</span><span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The model just keeps track of the total number of bytes sent and received (not the actual data):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="kn">VARIABLES</span> <span class="n">Sent</span><span class="p">,</span> <span class="n">Got</span>
</span><span class="line"><span class="n">vars</span> <span class="ni">==</span> <span class="ni">&lt;&lt;</span> <span class="n">Sent</span><span class="p">,</span> <span class="n">Got</span> <span class="ni">&gt;&gt;</span>     <span class="c">\* For referring to both variables together</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The amount of data currently in the buffer (<code>BufferUsed</code>) is the difference between these counters,
and the free space (<code>BufferFree</code>) is the difference between that and the total buffer size:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">BufferUsed</span> <span class="ni">==</span> <span class="n">Sent</span> <span class="o">-</span> <span class="n">Got</span>
</span><span class="line"><span class="n">BufferFree</span> <span class="ni">==</span> <span class="n">BufferSize</span> <span class="o">-</span> <span class="n">BufferUsed</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The property we are interested in (<code>Liveness</code>) is that data sent eventually arrives:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Liveness</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>    <span class="c">\* For all natural numbers n:</span>
</span><span class="line">    <span class="n">Sent</span> <span class="ni">=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span>     <span class="c">\* if we have sent exactly n bytes, then eventually</span>
</span><span class="line">      <span class="n">Got</span> <span class="o">&gt;=</span> <span class="n">n</span>      <span class="c">\* we&#39;ll have received at least that many.</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="specification-in-terms-of-actions">Specification in terms of actions</h3>
<p><code>Liveness</code> is the property we want, but we also need to say how we plan to achieve this.
In TLA, this is done by describing the initial state and the actions that can be performed
at each step.</p>
<p>Initially, no bytes have been sent or received:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Init</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span> <span class="ni">=</span> <span class="m">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An <em>action</em> only looks at a single atomic step of the algorithm, and relates the values of the variables
at the start of the step (e.g. <code>Sent</code>) to their values at the end (written primed, e.g. <code>Sent'</code>).</p>
<p>The <code>Send</code> action is true when the sender increases <code>Sent</code>
(by some amount <code>n</code>, limited by the free space in the buffer):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Send</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\E</span> <span class="n">n</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">BufferFree</span> <span class="p">:</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Sent</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Sent</span> <span class="o">+</span> <span class="n">n</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Got</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Recv</code> action is true when the receiver reads all of the data currently in the buffer:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Recv</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">BufferUsed</span> <span class="ni">&gt;</span> <span class="m">0</span>             <span class="c">\* Buffer must be non-empty</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Got</span> <span class="o">+</span> <span class="n">BufferUsed</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Every action of our system is either a <code>Send</code> or a <code>Recv</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Next</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">Send</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">Recv</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Finally, <code>Spec</code> describes the behaviours of the whole system in these terms:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Spec</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Init</span>              <span class="c">\* Init is true initially</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[][</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span>     <span class="c">\* Each step is a Next step, or leaves vars unchanged</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">WF_vars</span><span class="p">(</span><span class="n">Recv</span><span class="p">)</span>     <span class="c">\* The receiver must eventually read, if it can</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="invariants">Invariants</h3>
<p>It's always useful to prove some basic invariants about the system
(in particular, the types of the variables):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">I</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">Got</span>                    <span class="c">\* We can&#39;t receive more than was sent</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">BufferUsed</span> <span class="s">\in</span> <span class="m">0</span><span class="o">..</span><span class="n">BufferSize</span>   <span class="c">\* The buffer doesn&#39;t overflow</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Before trying to prove anything, it's good to use the model checker (TLC) to test it first.
You'll need to put some limits on the number of bytes sent and checked in the model,
so it isn't a full proof, but it's likely to spot most problems.
In this example, it passes easily.</p>
<p>The general idea for proving things in TLA is to get away from temporal logic as quickly as possible
and do most of the proof with simple actions.
The following pattern should work for proving any invariant:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">AlwaysI</span> <span class="ni">==</span> <span class="n">Spec</span> <span class="ni">=&gt;</span> <span class="p">[]</span><span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Init</span> <span class="ni">=&gt;</span> <span class="n">I</span> <span class="n">BY</span> <span class="n">BufferSizeType</span> <span class="n">DEF</span> <span class="n">Init</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">I</span> <span class="o">/\</span> <span class="p">[</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span> <span class="ni">=&gt;</span> <span class="n">I</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">NextPreservesI</span> <span class="n">DEF</span> <span class="n">vars</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Spec</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that <code>Spec</code> implies that <code>I</code> is always true because:</p>
<ul>
<li><code>I</code> is true initially.
</li>
<li><code>Next</code> steps preserve <code>I</code> (and so does leaving <code>vars</code> unchanged).
</li>
<li>Therefore <code>Spec</code> ensures <code>I</code> will always be true.
</li>
</ul>
<p>Note: The <code>&lt;1&gt;</code> at the start of each step is the indentation level,
which for some reason TLAPS can't work out by itself.</p>
<p>The final <code>QED BY PTL</code> (&quot;by Propositional Temporal Logic&quot;) is the only step that uses temporal logic.
Everything else is just regular logic.</p>
<p>Now all we need to do is prove that the <code>Next</code> action preserves <code>I</code>, which is straight-forward:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">NextPreservesI</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">Next</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">I</span><span class="err">&#39;</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="k k-Conditional">CASE</span> <span class="n">Send</span> <span class="n">BY</span> <span class="n">BufferSizeType</span> <span class="n">DEF</span> <span class="n">I</span><span class="p">,</span> <span class="n">Send</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="k k-Conditional">CASE</span> <span class="n">Recv</span> <span class="n">BY</span> <span class="n">BufferSizeType</span> <span class="n">DEF</span> <span class="n">I</span><span class="p">,</span> <span class="n">Recv</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">Next</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>So for proving invariants like <code>I</code> we can mostly ignore temporal logic.
By for proving liveness, we'll need to understand more about it...</p>
<h2 id="temporal-logic">Temporal logic</h2>
<p>Temporal logic is a type of modal logic.
In a modal logic we imagine there are many <em>worlds</em>, and we are in one of them.
Ordinary mathematical statements refer to our world,
so e.g. <code>Sent = 4</code> refers to the value of <code>Sent</code> in our world.
However, by using the modal operators, you can refer to other worlds.
For example, <code>[](Sent = 4)</code> means that <code>Sent = 4</code> is true in all worlds &quot;reachable&quot; from the current one.</p>
<p>In TLA, each world corresponds to a moment in time,
and current and future times are reachable.
So <code>[]X</code> means &quot;X will always be true&quot;.
<code>&lt;&gt;X</code> means &quot;X will eventually be true&quot;.
For example:</p>
<ul>
<li><code>[](Sent = 4)</code> means that <code>Sent = 4</code> is true now and always will be.
</li>
<li><code>&lt;&gt;(Sent = 4)</code> means that <code>Sent = 4</code> is either true now or will be true at some point in the future.
</li>
</ul>
<p>Modal operators nest, so you can say things like <code>&lt;&gt;[](Sent = 4)</code>
(eventually <code>Sent</code> will equal 4 and will remain so from then onwards),
or <code>[](F =&gt; &lt;&gt;G)</code> (<code>F</code> always leads to <code>G</code>, which can also be written as <code>F ~&gt; G</code>).</p>
<h3 id="proving-temporal-claims-with-tlaps">Proving temporal claims with TLAPS</h3>
<p>In the proof of <code>Spec =&gt; []I</code> above, only one step required temporal logic (<code>BY PTL</code>).
Having few such steps is good because these steps can be tricky.</p>
<p>The main problem is that TLA propositions are written in <em>First Order Modal Logic</em>,
but TLAPS doesn't have any solvers that understand this!
Instead, it has a number of solvers that understand regular non-modal logic,
plus the <code>PTL</code> decision procedure that handles modal logic but not much else.</p>
<p>For example, consider this apparently simple claim:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that if <code>Sent</code> is always equal to 4 then it is currently greater than 0.
TLAPS can't prove this in a single step:</p>
<ul>
<li>The regular solvers don't understand the <code>[]</code> bit.
</li>
<li>The PTL procedure doesn't understand numbers.
</li>
</ul>
<p><a href="https://members.loria.fr/SMerz/papers/arqnl2014.html">Coalescing for Reasoning in First-Order Modal Logics</a> explains in detail what's going on,
but my rough understanding is that TLAPS replaces things a solver won't understand with
new fresh variables.</p>
<p>For example, <code>[](Sent = 4) =&gt; Sent &gt; 0</code> is received by the solver as something like:</p>
<ul>
<li><code>Blob1 =&gt; Sent &gt; 0</code> (for a regular solver), or
</li>
<li><code>[]Blob2 =&gt; Blob3</code> (for PTL)
</li>
</ul>
<p>Instead, you have to break it down into separate steps:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In the second step, the PTL solver will receive something like this, which it can prove:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">ASSUME</span> <span class="n">Blob2</span> <span class="ni">=&gt;</span> <span class="n">Blob3</span>
</span><span class="line"><span class="n">PROVE</span>  <span class="p">[]</span><span class="n">Blob2</span> <span class="ni">=&gt;</span> <span class="n">Blob3</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Annoyingly, TLAPS doesn't have an easy way to show you what it replaced in this way.
If it did, I think it would be much easier to learn how to use it.
You can use <code>tlapm spec.tla --debug tempfiles</code> and look at the files generated for each
solver backend, but they're quite hard to read.</p>
<h3 id="generalising">Generalising</h3>
<p>Things that can be proved without assumptions specific to the current time are always true.
In the above example, <code>PTL</code> converted <code>Blob2 =&gt; Blob3</code> from the regular solver
to <code>[](Blob2 =&gt; Blob3)</code>.
This was only possible because the proof of <code>Blob2 =&gt; Blob3</code> didn't depend on the current world/time.</p>
<p>For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">Spec</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span> <span class="cm">(* Fails *)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We're trying to prove that if <code>Sent</code> is always 4 then it's always greater than 0.
The problem is that we proved <code>Sent = 4 =&gt; Sent &gt; 0</code> in a context that assumed <code>Spec</code>,
so TLAPS won't let us generalise it (even though it looks identical to what we proved before,
and didn't actually use anything from <code>Spec</code>).</p>
<p>When <code>PTL</code> fails, the <code>Interesting Obligations</code> view shows e.g.</p>
<pre><code>ASSUME ...
       Sent = 4 =&gt; Sent &gt; 0 (* non-[] *),
       PTL 
PROVE  [](Sent = 4) =&gt; [](Sent &gt; 0)
</code></pre>
<p>The <code>(* non-[] *)</code> indicates that this can't be generalised to all times; we only know it's true now.</p>
<p>Instead, we can do:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">Spec</span> <span class="ni">=&gt;</span> <span class="p">([](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">))</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ASSUME</span> <span class="n">Spec</span>
</span><span class="line">             <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line">    <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Spec</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Doing the <code>Sent = 4 =&gt; Sent &gt; 0</code> step before introducing <code>Spec</code> as an assumption allowed it to be generalised.
We could also have proved it as a separate lemma.</p>
<p>Here's an example where we make use of a <code>non-[]</code> assumption:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">Spec</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="ni">&lt;&gt;</span><span class="p">(</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">Spec</span><span class="p">,</span> <span class="n">Init</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If <code>Spec</code> is true (we're at the start of the algorithm) then <code>Sent = 0</code>.
We can't generalise to saying that <code>Sent</code> is always 0,
but we can still use it to prove that <code>Sent</code> is (trivially) eventually zero.
But having <code>non-[]</code> assumptions usually indicates that something has gone wrong.</p>
<p>Some assumptions are safe to have in the context however.
You can assume constants (e.g <code>NEW n \in Nat</code>), and also things that will always be true (e.g. <code>[]I</code>).
This is fine:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="p">[]</span><span class="n">I</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="hiding-definitions">Hiding definitions</h3>
<p>Here's a simplified version of a problem I had:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">L1</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line"><span class="n">PROOF</span> <span class="n">OMITTED</span>
</span><span class="line">
</span><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">m</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span><span class="p">)</span>
</span><span class="line"><span class="n">BY</span> <span class="n">L1</span>  <span class="cm">(* Fails! *)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>As usual, the regular solvers can't handle the <code>[]</code> and <code>PTL</code> can't handle the numbers or the for-all.
We could create a function (<code>F</code>) for the temporal formula and then hide its definition:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">m</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">DEFINE</span> <span class="n">F</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="ni">==</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">i</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="s">\A</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span> <span class="n">F</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="n">BY</span> <span class="n">L1</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">F</span><span class="p">(</span><span class="n">m</span> <span class="o">+</span> <span class="m">1</span><span class="p">)</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">HIDE</span> <span class="n">DEF</span> <span class="n">F</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>However, due to a bug in TLAPS, this is unreliable.
Often a proof fails, then you rename things a bit and it passes, then you rename them back and it continues to pass!
I reported this at <a href="https://github.com/tlaplus/tlapm/issues/247">https://github.com/tlaplus/tlapm/issues/247</a>.
Here's a screenshot of an extreme case, where TLAPS is failing to prove a conclusion that matches an assumption
character for character!</p>
<p><span class="caption-wrapper center"><img src="/blog/images/tla/failing-proof.png" title="TLAPS failing to prove something very obvious." class="caption"/><span class="caption-text">TLAPS failing to prove something very obvious.</span></span></p>
<p>Until that's fixed, I found that the following pattern works well as a work-around:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">L1_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="ni">==</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">L1</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">L1_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">USE</span> <span class="n">DEF</span> <span class="n">L1_prop</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">PROOF</span> <span class="n">OMITTED</span>
</span><span class="line">
</span><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">m</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">m</span> <span class="o">+</span> <span class="m">1</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">L1_prop</span><span class="p">(</span><span class="n">m</span><span class="o">+</span><span class="m">1</span><span class="p">)</span> <span class="n">BY</span> <span class="n">L1</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">L1_prop</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Even when this bug is fixed, I think it would be really nice if
you could say e.g. <code>BY L1(m+1)</code> to explicitly use <code>L1</code> with a particular value of <code>n</code>.</p>
<h2 id="proving-liveness-simple-case">Proving liveness (simple case)</h2>
<p>OK, let's use some temporal logic to prove <code>Liveness</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Liveness</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>    <span class="c">\* For all natural numbers n:</span>
</span><span class="line">    <span class="n">Sent</span> <span class="ni">=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span>     <span class="c">\* if we have sent exactly n bytes, then eventually</span>
</span><span class="line">      <span class="n">Got</span> <span class="o">&gt;=</span> <span class="n">n</span>      <span class="c">\* we&#39;ll have received at least that many.</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>One way to prove <code>A ~&gt; B</code> (<code>A</code> always eventually leads to <code>B</code>)
is from an action <code>A =&gt; B'</code> (<code>A</code> leads to <code>B</code> in one step), plus a few side conditions.
We'll do that here because we conveniently have a <code>Recv</code> action that immediately gets us to our goal.
To prove <code>A ~&gt; B</code> a generally-useful pattern is:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">A</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">B</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">A</span> <span class="o">/\</span> <span class="n">Recv</span> <span class="ni">=&gt;</span> <span class="n">B</span><span class="err">&#39;</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">A</span> <span class="ni">=&gt;</span> <span class="n">ENABLED</span> <span class="ni">&lt;&lt;</span><span class="n">Recv</span><span class="ni">&gt;&gt;</span><span class="n">_vars</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">WF_vars</span><span class="p">(</span><span class="n">Recv</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">A</span> <span class="o">/\</span> <span class="p">[</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span> <span class="ni">=&gt;</span> <span class="p">(</span><span class="n">A</span> <span class="o">\/</span> <span class="n">B</span><span class="p">)</span><span class="err">&#39;</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="p">[][</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><ol>
<li>If we're at <code>A</code> and perform our useful action <code>Recv</code> then afterwards we'll be at <code>B</code>.
</li>
<li>When we're at <code>A</code> we can perform the useful action.
</li>
<li>If the useful action is continually possible then eventually it will happen.
</li>
<li>If we're at <code>A</code> and perform any action <code>Next</code>, we'll either stay at <code>A</code> or get to <code>B</code>.
</li>
<li>The system always performs <code>Next</code> steps.
</li>
</ol>
<p>Some examples of why these are necessary to prove <code>Liveness</code>:</p>
<ul>
<li>Without 2: <code>Recv</code> requires at least 2 bytes in the buffer; sending a single byte fails.
</li>
<li>Without 3: someone else has to perform <code>Recv</code> and we can't be sure they'll actually do it.
</li>
<li>Without 4: the sender can <code>Send</code> some data and then <code>Retract</code> it before it is read.
</li>
<li>Without 5: 4 wouldn't be useful.
</li>
</ul>
<p>(3) and (5) come directly from the definition of <code>Spec</code>.
We need to write proofs for the others, but they're all non-temporal proofs and fairly simple.
The only slightly tricky one is proving <code>ENABLED</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">EnabledRecv</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="n">I</span><span class="p">,</span>
</span><span class="line">         <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">,</span> <span class="o">~</span><span class="p">(</span><span class="n">Got</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">ENABLED</span> <span class="ni">&lt;&lt;</span><span class="n">Recv</span><span class="ni">&gt;&gt;</span><span class="n">_vars</span>
</span><span class="line"><span class="n">BY</span> <span class="n">AutoUSE</span><span class="p">,</span> <span class="n">ExpandENABLED</span> <span class="n">DEF</span> <span class="n">Recv</span><span class="p">,</span> <span class="n">I</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The magic <code>ExpandENABLED</code> causes the definition of <code>ENABLED</code> to be expanded in the version given to the solver.
This is only useful if the solver also has access to the definitions of all the actions used;
<code>AutoUSE</code>provides them automatically so you don't have to list them manually.</p>
<p>However, the solver has to work quite hard to prove this and I found it times out on more realistic examples.
The following trick is more generally useful:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">EnabledRecv</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="n">I</span><span class="p">,</span>
</span><span class="line">         <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">,</span> <span class="o">~</span><span class="p">(</span><span class="n">Got</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">ENABLED</span> <span class="ni">&lt;&lt;</span><span class="n">Recv</span><span class="ni">&gt;&gt;</span><span class="n">_vars</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="ni">&lt;&lt;</span><span class="n">Recv</span><span class="ni">&gt;&gt;</span><span class="n">_vars</span> <span class="o">&lt;=</span><span class="ni">&gt;</span> <span class="n">Recv</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">vars</span><span class="p">,</span> <span class="n">Recv</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">ENABLED</span> <span class="ni">&lt;&lt;</span><span class="n">Recv</span><span class="ni">&gt;&gt;</span><span class="n">_vars</span> <span class="o">&lt;=</span><span class="ni">&gt;</span> <span class="n">ENABLED</span> <span class="n">Recv</span> <span class="n">BY</span> <span class="n">ENABLEDaxioms</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ENABLED</span> <span class="n">Recv</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">BufferUsed</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">AutoUSE</span><span class="p">,</span> <span class="n">ExpandENABLED</span> <span class="n">DEF</span> <span class="n">Recv</span>
</span></code></pre></td></tr></tbody></table></div></figure><ul>
<li><code>Recv</code> always modifies <code>vars</code>, so <code>&lt;&lt;Recv&gt;&gt;_vars</code> is the same as <code>Recv</code>.
</li>
<li>Therefore, <code>ENABLED &lt;&lt;Recv&gt;&gt;_vars</code> is the same as <code>ENABLED Recv</code>.
</li>
<li>Therefore, we just need to prove <code>ENABLED Recv</code>, which is easier for the solver.
</li>
</ul>
<p>The above uses the magic <code>BY ENABLEDaxioms</code>.
This doesn't seem to be documented; I just found it in one of the examples.
It's really fussy about the syntax used.
e.g. it works if you ask to prove it in both directions with <code>&lt;=&gt;</code>,
but if you ask for e.g. just <code>=&gt;</code> then it fails!
Also, it requires <code>&lt;&lt;Recv&gt;&gt;_vars &lt;=&gt; Recv</code> rather than <code>&lt;&lt;Recv&gt;&gt;_vars = Recv</code>,
though you can prove the first from the second.
The <a href="https://github.com/tlaplus/tlapm/blob/main/test/fast/enabled_cdot/ENABLEDaxioms_test.tla">ENABLEDaxioms_test.tla</a> test file is useful for getting the syntax right.</p>
<p>See <a href="https://github.com/talex5/spec-vchan/blob/blog-liveness-1/example.tla">example.tla</a> for the full proof.</p>
<h2 id="multi-step-liveness">Multi-step liveness</h2>
<p>The above proof was easy because the <code>Recv</code> step gets us to our goal immediately.
It's more complicated if we only say that each <code>Recv</code> step reads <em>some</em> of the buffer:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Recv</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\E</span> <span class="n">n</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">BufferUsed</span> <span class="p">:</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Got</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Got</span> <span class="o">+</span> <span class="n">n</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The trick is to define a distance metric:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="ni">==</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="o">/\</span> <span class="n">n</span> <span class="o">-</span> <span class="n">Got</span> <span class="o">&lt;=</span> <span class="n">i</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Dist(n, i)</code> says we are within <code>i</code> bytes of having received the first <code>n</code> bytes.
Then a simple inductive argument will do:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">DEFINE</span> <span class="n">R</span><span class="p">(</span><span class="n">j</span><span class="p">)</span> <span class="ni">==</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">1</span> <span class="n">R</span><span class="p">(</span><span class="m">0</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span> <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">i</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="n">R</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="n">PROVE</span> <span class="n">R</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="m">1</span><span class="p">)</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="m">1</span><span class="p">)</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span><span class="p">,</span> <span class="n">Progress</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Dist</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> <span class="n">BY</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="s">\A</span> <span class="n">i</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span> <span class="n">R</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">HIDE</span> <span class="n">DEF</span> <span class="n">R</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">NatInduction</span><span class="p">,</span> <span class="n">Isa</span><span class="p">,</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">1</span><span class="p">,</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>R(i)</code> says that the algorithm works if we start at most <code>i</code> bytes away from our goal.
<code>R(0)</code> is obviously true, and we show that <code>R(i) =&gt; R(i+1)</code>.
By induction, <code>R</code> is true for any distance.
Notice again the use of <code>HIDE</code> to avoid confusing the solvers with temporal logic.</p>
<p>Then we just need to show <code>Progress</code>: we eventually get closer to our goal (<code>Dist(n, i+1) ~&gt; Dist(n, i)</code>).
That can be proved from <code>Recv</code> in a similar way to before. See <a href="https://github.com/talex5/spec-vchan/blob/blog-liveness-2/example.tla">example.tla</a> for the full proof.</p>
<h2 id="the-real-protocol">The real protocol</h2>
<p>I've now updated <a href="https://github.com/talex5/spec-vchan/blob/master/vchan.tla">vchan.tla</a> to prove liveness for the real system.
It was more complicated than the example above (because it also looks at the data being sent,
has many steps, models sending interrupts, etc), but the basic approach is the same.</p>
<p>Note: In the real system, <code>Sent</code> and <code>Got</code> are the actual messages, so the
property talks about their lengths. <code>Sent</code> is what the sending application has
asked the vchan library to transmit (even if it hasn't been written to the shared
buffer yet), and <code>Got</code> is what the receiving application has received from its
vchan library.</p>
<p>Ideally, we'd like to prove something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>However, that doesn't work because the sender (and receiver) can decide to shut down the connection at any moment.
If the sender closes the connection before transmitting all the data then obviously it might not be received.
So it's actually defined like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* Any data that the write function reports has been sent successfully</span>
</span><span class="line"><span class="cm">   (i.e. data in Sent when it is back at &quot;sender_ready&quot;) will eventually</span>
</span><span class="line"><span class="cm">   be received (if the receiver doesn&#39;t close the connection).</span>
</span><span class="line"><span class="cm">   In particular, this says that it&#39;s OK for the sender to close its</span>
</span><span class="line"><span class="cm">   end immediately after sending some data.</span>
</span><span class="line"><span class="cm">   Note that this is not currently true of the C implementation. *)</span>
</span><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">x</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span>
</span><span class="line">      <span class="o">~</span><span class="ni">&gt;</span> <span class="o">\/</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span><span class="line">         <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: <code>SenderLive</code> and <code>ReceiverLive</code> are badly named.
<code>SenderLive = FALSE</code> means that the sender has written all data successfully and there is nothing more to come.
<code>ReceiverLive = FALSE</code> means that the receiver has decided to terminate the connection early,
and probably indicates an error (think <code>EPIPE</code>).</p>
<p>Although TLAPS couldn't prove liveness properties back in 2018, I had got most of the way there:</p>
<p>I'd defined <code>ReadLimit</code> as:</p>
<blockquote>
<p>The number of bytes that the receiver will eventually get without further action
from the sender (assuming the receiver doesn't decide to close the connection).</p>
</blockquote>
<p>and likewise <code>WriteLimit</code> as:</p>
<blockquote>
<p>The number of bytes that the sender will eventually send without further
action from the other processes or the client application, assuming the
connection isn't closed by either end.</p>
</blockquote>
<p>I'd proved some useful invariants about them. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* Whenever the sender is blocked or idle, the receiver can read everything in</span>
</span><span class="line"><span class="cm">   the buffer without further action from any other process. *)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">ReadAllIfSenderBlocked</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">SenderLive</span><span class="p">,</span> <span class="n">ReceiverLive</span><span class="p">,</span>
</span><span class="line">         <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span>
</span><span class="line">         <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span> <span class="o">/\</span> <span class="o">~</span><span class="n">SpaceAvailableInt</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">ReadLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="cm">(* Whenever the receiver is blocked, the sender will fill the buffer (or write everything</span>
</span><span class="line"><span class="cm">   it wants to write) without further action from any other process. *)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">WriteAllIfReceiverBlocked</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">SenderLive</span><span class="p">,</span> <span class="n">ReceiverLive</span><span class="p">,</span>
</span><span class="line">         <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_await_data&quot;</span><span class="p">,</span> <span class="o">~</span><span class="n">DataReadyInt</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">WriteLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">Min</span><span class="p">(</span><span class="n">BufferSize</span><span class="p">,</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="o">+</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">))</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If <code>ReadLimit</code> and <code>WriteLimit</code> accurately predict what will happen then these properties
are good evidence that the protocol works:</p>
<ul>
<li><code>WriteLimit</code> predicts that the sender will be able to fill the buffer initially.
</li>
<li>If the receiver doesn't read it then the buffer will become full and the sender will block,
at which point <code>ReadAllIfSenderBlocked</code> predicts all the data will be read.
</li>
<li>If the receiver runs out of data then it will then eventually block,
at which point, <code>WriteAllIfReceiverBlocked</code> predicts more data can be written.
</li>
</ul>
<p>I'd used TLC to check that both predictions were correct (on small models),
but I hadn't been able to prove them with the old version of TLAPS.</p>
<h3 id="proving-readlimit">Proving ReadLimit</h3>
<p>The actual proof followed the same outline as the example above.
The main extra bit was showing that if we haven't yet read up to the <code>ReadLimit</code>
then the receiver process would eventually reach the code that did the read,
and with the belief that there was at least one byte available.
Mixing <code>BY PTL</code> and regular solvers was particularly annoying here and required stating
lots of obvious facts like <code>PC \in {&quot;recv_got_len&quot;} =&gt; PC = &quot;recv_got_len&quot;</code> because <code>PTL</code>
doesn't understand sets, etc.</p>
<p>It seems a bit strange to have to prove that each line of code leads to the next.
That's something you can mostly just assume when writing software, but in TLA you need to be explicit.
The only interesting bits were the usual things about showing code terminates:</p>
<ul>
<li>For loops, show that they will make progress each iteration.
</li>
<li>Show the code doesn't crash (e.g. no buffer overflow).
</li>
<li>Show that any explicit <code>await</code> will eventually resume.
</li>
</ul>
<p>To help with the proofs I defined <code>RSpec</code> as a variant of <code>Spec</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">RSpec</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[]</span><span class="n">I</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[][</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[]</span><span class="n">ReceiverLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">WF_vars</span><span class="p">(</span><span class="n">ReceiverRead</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Instead of being true only at the start (like <code>Spec</code>, which uses <code>Init</code>),
this only requires our invariant (<code>[]I</code>) to be true.
Things proved using <code>RSpec</code> are therefore true from any point in the protocol,
which lets them be reused in other proofs.</p>
<p>I was able to show that <code>ReadLimit</code>'s prediction will come true:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">ReadLimitEventually_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="ni">==</span> <span class="n">RSpec</span> <span class="o">/\</span> <span class="n">ReadLimit</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">ReadLimitEventually</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">ReadLimitEventually_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>and from that, rather magically:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* When the sender is waiting, the receiver will get everything sent. *)</span>
</span><span class="line"><span class="n">ReaderLiveness_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="ni">==</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">RSpec</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">SenderLive</span>
</span><span class="line">     <span class="o">/\</span> <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">BytesTransmitted</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="c">\* If we&#39;ve sent n bytes in total...</span>
</span><span class="line">     <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span>         <span class="c">\* then at least n bytes will eventually be received.</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">ReaderLiveness</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span>
</span><span class="line">  <span class="n">PROVE</span> <span class="n">ReaderLiveness_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That eliminates all mention of <code>ReadLimit</code>, and expresses the property in terms meaningful to the sender.</p>
<h3 id="proving-availability">Proving Availability</h3>
<p>I was expecting to have to prove <code>WriteLimit</code> was correct too.
But it turned out I didn't.
<code>Availability</code> only said that the receiver would receive data that had been transmitted;
it never claimed that the sending process would succeed in doing that!
It's right there in the comment even:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* Any data that the write function reports has been sent successfully ... *)</span>
</span><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">x</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span>
</span><span class="line">      <span class="o">~</span><span class="ni">&gt;</span> <span class="o">\/</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span><span class="line">         <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Well, this is pretty bad. The point of proofs is that instead of having to check hundreds of lines of C,
or dozens of lines of TLA actions, we only have to check that the definition of <code>Availability</code> is what we want.
My definition of <code>Availability</code> is still true if the sender blocks forever waiting for more space in the buffer.
In fact, if I change the definition of the sender process so that
the application asking for data to be sent is the only possible action,
then TLC reports that <code>Availability</code> still holds (which is true).
The original blog post appeared on Reddit, Hacker News and Lobsters,
and nobody else seems to have spotted this either.</p>
<p>Though in fairness to my past self, I did mention checking an earlier version of <code>Availability</code>
that did require that it works end-to-end (but didn't make any claims about what happens on shutdown),
and I had also tested <code>WriteLimit</code>, so I'm not really worried that the system doesn't work.</p>
<p>And so, having learnt how to do liveness proofs, which was my real goal, I decided to stop here.</p>
<h3 id="proving-writelimit">Proving WriteLimit</h3>
<p>But proof assistants are addictive, and I found myself continuing anyway.
The proof of the correctness of <code>WriteLimit</code> was very similar to the one for <code>ReadLimit</code> and I got it done much faster
now I knew more about how to do these proofs.</p>
<p>However, I also wanted to fix <code>Availability</code> to something useful and updated it to this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* We can&#39;t ensure Availability if the sender shuts down</span>
</span><span class="line"><span class="cm">   before it has finishing writing the message to the buffer,</span>
</span><span class="line"><span class="cm">   or if it tries to send another message after shutting down. *)</span>
</span><span class="line"><span class="n">CleanShutdownOnly</span> <span class="ni">==</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">SenderLive</span>
</span><span class="line">
</span><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[]</span><span class="n">CleanShutdownOnly</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[]</span><span class="n">ReceiverLive</span>
</span><span class="line">  <span class="ni">=&gt;</span> <span class="s">\A</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">        <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span>
</span><span class="line">          <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(<code>msg</code> is the data yet to be transmitted)</p>
<p>That required strengthening some of the existing proofs to talk about clean shutdowns,
not just the connection never being closed, but it was fairly straight-forward.</p>
<h3 id="end-to-end-liveness">End-to-end liveness</h3>
<p>I used the <code>ReadLimit</code> and <code>WriteLimit</code> lemmas to prove some things about the system as a whole, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* If WriteLimit says we&#39;ll transmit some data, it will also be received:</span>
</span><span class="line"><span class="cm">   1. If WriteLimit says we&#39;ll transmit some data, then we will.</span>
</span><span class="line"><span class="cm">   2. Eventually the sender will be at a safe point after that.</span>
</span><span class="line"><span class="cm">   3. Once the sender is at a safepoint, the receiver will get all the data. *)</span>
</span><span class="line"><span class="n">EndToEndLive_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="ni">==</span> <span class="n">WriteLimit</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">EndToEndLive</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="p">[]</span><span class="n">WSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">RSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">CleanShutdownI</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">EndToEndLive_prop</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="cm">(* If we&#39;ve received n bytes but wanted to send more, then eventually</span>
</span><span class="line"><span class="cm">   there will be space to send more. *)</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">EventuallySpace</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span>
</span><span class="line">         <span class="p">[]</span><span class="n">WSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">RSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">CleanShutdownI</span><span class="p">,</span>
</span><span class="line">         <span class="p">[](</span><span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">),</span> <span class="p">[](</span><span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">&gt;</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="ni">&lt;&gt;</span><span class="p">(</span><span class="n">WriteLimit</span> <span class="ni">&gt;</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="cm">(* If the application wants to send n bytes then eventually WriteLimit will</span>
</span><span class="line"><span class="cm">   predict that we will send them. *)</span>
</span><span class="line"><span class="n">LEMMA</span> <span class="n">SufficientSpace</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">n</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="p">[]</span><span class="n">WSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">RSpec</span><span class="p">,</span> <span class="p">[]</span><span class="n">CleanShutdownI</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">n</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">WriteLimit</span> <span class="o">&gt;=</span> <span class="n">n</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>and was then finally able to prove my improved version of <code>Availability</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">Spec</span> <span class="ni">=&gt;</span> <span class="n">Availability</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>See <a href="https://github.com/talex5/spec-vchan/blob/master/vchan.tla">vchan.tla</a> for the final version with the proofs.</p>
<h2 id="work-arounds-for-bugs">Work-arounds for bugs</h2>
<p>The biggest problem I had was the bug when mixing forall and temporal formulas,
until I found <a href="#hiding-definitions">the work-around above</a>.
Here are some other problems I hit and their solutions:</p>
<h3 id="suffices-doesnt-always-generalise">SUFFICES doesn't always generalise</h3>
<p>This doesn't work:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Foo</span> <span class="ni">==</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">Foo</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ASSUME</span> <span class="n">Foo</span> <span class="n">PROVE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line">    <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Foo</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">Foo</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Foo</span>  <span class="c">\* Fails!</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Even though the <code>SUFFICES</code> used <code>BY DEF Foo</code>, <code>Foo</code> still counts as a time-specific assumption.
A work-around is to <code>ASSUME []Foo</code> and then assert <code>Foo</code> in a separate step:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Foo</span> <span class="ni">==</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">Foo</span> <span class="ni">=&gt;</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ASSUME</span> <span class="p">[]</span><span class="n">Foo</span> <span class="n">PROVE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span><span class="p">)</span>
</span><span class="line">    <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Foo</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Foo</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span> <span class="ni">=&gt;</span> <span class="n">Sent</span> <span class="o">&gt;=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">Foo</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span> <span class="n">DEF</span> <span class="n">Foo</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="case-with-a-temporal-goal">CASE with a temporal goal</h3>
<p>This doesn't work:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="o">\/</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line">        <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="k k-Conditional">CASE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="k k-Conditional">CASE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It fails to parse, with <code>Non-constant CASE for temporal goal</code>.
However, if you replace <code>CASE</code> with the expanded form then it works:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="o">\/</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line">        <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="n">PROVE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="n">PROVE</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span> <span class="n">BY</span> <span class="n">PTL</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="ptl-and-primes">PTL and primes</h3>
<p>This doesn't work:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="n">PROVE</span> <span class="n">Sent</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A work-around:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="n">PROVE</span> <span class="n">Sent</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="p">(</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span><span class="err">&#39;</span> <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">PTL</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="syntax-matters">Syntax matters</h3>
<p>Things you might expect to be the same after parsing are treated differently.
This fails:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="p">[](</span><span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line">            <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>But this passes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="p">[](</span><span class="n">Sent</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line">            <span class="o">\/</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="m">4</span><span class="p">)</span>
</span><span class="line"><span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="other-bugs">Other bugs</h3>
<ul>
<li>
<p>If keyboard short-cuts stop working, move the focus out of the window and back in again.
Sometimes some keys stop working (e.g. Delete) while others continue (e.g. Backspace).</p>
</li>
<li>
<p>If the toolbox GUI fails to launch the prover due to <code>NullPointerException</code>,
try closing the specification and reopening it.</p>
</li>
<li>
<p>The solver doesn't actually tell you if everything passed; you have to look in the scrollbar for little red regions.
If a theorem is folded in the editor, the top-level bit will still show as red if any sub-step fails, but it will
NOT appear in the scrollbar!
Either expand everything and check by eye, or run <code>tlapm</code> from the command-line as a final check.
Note that the bright green <code>Spec Status</code> indicator in the status bar only means that <em>parsing</em> the spec succeeded,
not that the proof-checking passed.</p>
</li>
</ul>
<h2 id="conclusions">Conclusions</h2>
<p>Checking the specification with TLC was quick and easy and immediately found the shutdown bug.
Actually proving the specification correct using TLAPS was much more difficult,
took way longer, and didn't uncovered any additional bugs.
It did reveal a mistake in my definition of <code>Availability</code>, but there are surely quicker ways of finding that.
So, in terms of value for time spent, I guess it's only worth it for very high-value code
(or if, like me, you just like proving stuff for fun).</p>
<p>However, I think it's good to know how to verify things, even if you don't usually do it.
It has certainly made me think more carefully about exactly what you need to check to verify liveness.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Linux mode setting, from the comfort of OCaml</title>
    <link href="https://roscidus.com/blog/blog/2025/11/16/libdrm-ocaml/"></link>
    <updated>2025-11-16T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/11/16/libdrm-ocaml</id>
    <content type="html"><![CDATA[<p>Linux provides the KMS (Kernel Mode Setting) API to let applications query and configure display settings.
It's used by Wayland compositors and other programs that need to configure the hardware directly.
I found the C API a little verbose and hard to follow so I made <a href="https://github.com/talex5/libdrm-ocaml">libdrm-ocaml</a>,
which lets us run commands interactively in a REPL.</p>
<!-- more -->
<p>We'll start by discovering what hardware is available and how it's currently configured,
then configure a monitor to display a simple bitmap, and then finally render a 3D animation.
The post should be a useful introduction to KMS even if you don't know OCaml.</p>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=45947822">Hacker News</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#running-it-yourself">Running it yourself</a>
</li>
<li><a href="#querying-the-current-state">Querying the current state</a>
<ul>
<li><a href="#finding-devices">Finding devices</a>
</li>
<li><a href="#listing-resources">Listing resources</a>
</li>
<li><a href="#connectors">Connectors</a>
</li>
<li><a href="#modes">Modes</a>
</li>
<li><a href="#properties">Properties</a>
</li>
<li><a href="#encoders">Encoders</a>
</li>
<li><a href="#crt-controllers">CRT Controllers</a>
</li>
<li><a href="#framebuffers">Framebuffers</a>
</li>
<li><a href="#crtc-planes">CRTC planes</a>
</li>
<li><a href="#expanded-resources-diagram">Expanded resources diagram</a>
</li>
</ul>
</li>
<li><a href="#making-changes">Making changes</a>
<ul>
<li><a href="#non-atomic-mode-setting">Non-atomic mode setting</a>
</li>
<li><a href="#dumb-buffers">Dumb buffers</a>
</li>
<li><a href="#atomic-mode-setting">Atomic mode setting</a>
</li>
</ul>
</li>
<li><a href="#d-rendering">3D rendering</a>
</li>
<li><a href="#linux-vts">Linux VTs</a>
</li>
<li><a href="#debugging">Debugging</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="running-it-yourself">Running it yourself</h2>
<p>If you want to follow along, you'll need to install <a href="https://github.com/talex5/libdrm-ocaml">libdrm-ocaml</a> and an interactive REPL like <a href="https://github.com/ocaml-community/utop">utop</a>.
With Nix, you can set everything up like this:</p>
<pre><code>git clone https://github.com/talex5/libdrm-ocaml
cd libdrm-ocaml
nix develop
dune utop
</code></pre>
<p>You should see a <code>utop #</code> prompt, where you can enter OCaml expressions.
Use <code>;;</code> to tell the REPL you've finished typing and it's time to evaluate, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="mi">1</span><span class="o">+</span><span class="mi">1</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="kt">int</span> <span class="o">=</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Alternatively, you can install things using <a href="https://opam.ocaml.org/">opam</a> (OCaml's package manager):</p>
<pre><code>opam install libdrm utop
utop
</code></pre>
<p>Then, at the utop prompt enter <code>#require &quot;libdrm&quot;;;</code> (including the leading <code>#</code>).</p>
<h2 id="querying-the-current-state">Querying the current state</h2>
<p>Before changing anything, we'll start by discovering what hardware is available.</p>
<p>I'll introduce the API as we go along, but you can check the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/index.html">API reference docs</a>
if you want more information.</p>
<h3 id="finding-devices">Finding devices</h3>
<p>To list available graphics devices:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="n">list</span> <span class="bp">()</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Info</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[{</span><span class="n">primary_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/card0&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">render_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/renderD128&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">info</span> <span class="o">=</span> <span class="nc">PCI</span> <span class="o">{</span><span class="n">bus</span> <span class="o">=</span> <span class="o">{</span><span class="n">domain</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">bus</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span> <span class="n">dev</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">func</span> <span class="o">=</span> <span class="mi">0</span><span class="o">};</span>
</span><span class="line">              <span class="n">dev</span> <span class="o">=</span> <span class="o">{</span><span class="n">vendor_id</span> <span class="o">=</span> <span class="mh">0x1002</span><span class="o">;</span>
</span><span class="line">                     <span class="n">device_id</span> <span class="o">=</span> <span class="mh">0x67ff</span><span class="o">;</span>
</span><span class="line">                     <span class="n">subvendor_id</span> <span class="o">=</span> <span class="mh">0x1458</span><span class="o">;</span>
</span><span class="line">                     <span class="n">subdevice_id</span> <span class="o">=</span> <span class="mh">0x230b</span><span class="o">;</span>
</span><span class="line">                     <span class="n">revision_id</span> <span class="o">=</span> <span class="mh">0xff</span><span class="o">}}}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>libdrm scans the <code>/dev/dri/</code> directory looking for devices.
It uses <code>stat</code> to find the device major and minor numbers and uses the virtual <code>/sys</code> filesystem to get information about each one.
This is a PCI device, and the information corresponds to the values from <code>lspci</code>, e.g.</p>
<pre><code>$ lspci -nns 0:1:0.0
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI]
  Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev ff)
</code></pre>
<p>Each graphics device can have a <em>primary</em> and a <em>render</em> node.
The primary node gives full access to the device, including configuring monitors,
while the render node just allows applications to render scenes to memory.
In the <a href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/">last post</a> I was using the render to node to create a 3D image,
and then sending it to the Wayland compositor for display.
This time we'll be doing the display ourselves, so we need to open the primary node:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">dev</span> <span class="o">=</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">openfile</span> <span class="s2">&quot;/dev/dri/card0&quot;</span> <span class="o">[</span><span class="nc">O_CLOEXEC</span><span class="o">;</span> <span class="nc">O_RDWR</span><span class="o">]</span> <span class="mi">0</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">dev</span> <span class="o">:</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">file_descr</span> <span class="o">=</span> <span class="o">&lt;</span><span class="n">abstr</span><span class="o">&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To check the driver version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Version</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Version</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">version</span> <span class="o">=</span> <span class="mi">3</span><span class="o">.</span><span class="mi">61</span><span class="o">.</span><span class="mi">0</span><span class="o">;</span> <span class="n">name</span> <span class="o">=</span> <span class="s2">&quot;amdgpu&quot;</span><span class="o">;</span> <span class="n">date</span> <span class="o">=</span> <span class="s2">&quot;0&quot;</span><span class="o">;</span> <span class="n">desc</span> <span class="o">=</span> <span class="s2">&quot;AMD GPU&quot;</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If you're familiar with the C API, this corresponds to the <code>drmGetVersion</code> function,
and <code>Drm.Device.list</code> corresponds to <code>drmGetDevices2</code>;
I reorganised things a bit to make better use of OCaml's modules.</p>
<h3 id="listing-resources">Listing resources</h3>
<p>Let's see what resources we've got to play with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">resources</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Resources</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">resources</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Resources</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">fbs</span> <span class="o">=</span> <span class="bp">[]</span><span class="o">;</span>
</span><span class="line">   <span class="n">crtcs</span> <span class="o">=</span> <span class="o">[</span><span class="mi">57</span><span class="o">;</span> <span class="mi">60</span><span class="o">;</span> <span class="mi">63</span><span class="o">;</span> <span class="mi">66</span><span class="o">;</span> <span class="mi">69</span><span class="o">];</span>
</span><span class="line">   <span class="n">connectors</span> <span class="o">=</span> <span class="o">[</span><span class="mi">71</span><span class="o">;</span> <span class="mi">78</span><span class="o">;</span> <span class="mi">84</span><span class="o">];</span>
</span><span class="line">   <span class="n">encoders</span> <span class="o">=</span> <span class="o">[</span><span class="mi">70</span><span class="o">;</span> <span class="mi">76</span><span class="o">;</span> <span class="mi">83</span><span class="o">;</span> <span class="mi">86</span><span class="o">;</span> <span class="mi">87</span><span class="o">;</span> <span class="mi">88</span><span class="o">;</span> <span class="mi">89</span><span class="o">;</span> <span class="mi">90</span><span class="o">];</span>
</span><span class="line">   <span class="n">min_width</span><span class="o">,</span><span class="n">max_width</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">16384</span><span class="o">;</span>
</span><span class="line">   <span class="n">min_height</span><span class="o">,</span><span class="n">max_height</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">16384</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: The Kernel Mode Setting functions are in the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Kms/index.html">Drm.Kms</a> module.
The C API calls these functions <code>drmMode*</code>, but I found that confusing as
e.g. <code>drmModeGetResources</code> sounds like you're asking for the resources of a mode.</p>
<p>A <em>CRTC</em> is a CRT Controller, and typically controls a single monitor
(known as a <a href="https://en.wikipedia.org/wiki/Cathode_ray_tube">Cathode Ray Tube</a> for historical reasons).
<em>Framebuffers</em> provide image data to a CRTC (we create framebuffers as needed).
<em>Connectors</em> correspond to physical connectors (e.g. where you plug in a monitor cable).
An <em>Encoder</em> encodes data from the CRTC for a particular connector.</p>
<p><a href="/blog/images/libdrm/arch-simple.svg"><span class="caption-wrapper center"><img src="/blog/images/libdrm/arch-simple.svg" title="Resources diagram (simplified)" class="caption"/><span class="caption-text">Resources diagram (simplified)</span></span></a></p>
<h3 id="connectors">Connectors</h3>
<p>To save a bit of typing, I'll create an alias for the <code>Drm.Kms</code> module:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">module</span> <span class="nc">K</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nc">Kms</span><span class="o">;;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You could also <code>open Drm.Kms</code> to avoid needing any prefix, but I'll keep using <code>K</code> for clarity.</p>
<p>To get details for the first connector (the head of the list):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">resources</span><span class="o">.</span><span class="n">connectors</span><span class="o">);;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span>
</span><span class="line"> <span class="n">connector_type</span> <span class="o">=</span> <span class="nc">DisplayPort</span><span class="o">;</span>
</span><span class="line"> <span class="n">connector_type_id</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span>
</span><span class="line"> <span class="n">connection</span> <span class="o">=</span> <span class="nc">Connected</span><span class="o">;</span>
</span><span class="line"> <span class="n">mm_width</span><span class="o">,</span><span class="n">mm_height</span> <span class="o">=</span> <span class="mi">700</span><span class="o">,</span><span class="mi">390</span><span class="o">;</span>
</span><span class="line"> <span class="n">subpixel</span> <span class="o">=</span> <span class="nc">Unknown</span><span class="o">;</span>
</span><span class="line"> <span class="n">modes</span> <span class="o">=</span> <span class="o">[</span><span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">30</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">29</span><span class="o">.</span><span class="mi">97</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">2560</span><span class="n">x1440</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="o">...];</span>
</span><span class="line"> <span class="n">props</span> <span class="o">=</span> <span class="o">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">77</span><span class="o">;</span> <span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">5</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">6</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">4</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">34</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">35</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">36</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">37</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">72</span><span class="o">:</span><span class="mi">8</span><span class="o">;</span> <span class="mi">73</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> 
</span><span class="line">          <span class="mi">7</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">74</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">75</span><span class="o">:</span><span class="mi">15</span><span class="o">];</span>
</span><span class="line"> <span class="n">encoder_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">70</span><span class="o">;</span>
</span><span class="line"> <span class="n">encoders</span> <span class="o">=</span> <span class="o">[</span><span class="mi">70</span><span class="o">]}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is DisplayPort connector 1 (usually called <code>DP-1</code>) and it's currently <code>Connected</code>.
The connector also says which modes are available on the connected monitor.</p>
<p>I was lucky in that the first connector was the one I'm using,
but really we should get all the connectors and filter them to find the connected ones.
<code>List.map</code> can be used to run <code>get</code> on each of them:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">connectors</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">)</span> <span class="n">resources</span><span class="o">.</span><span class="n">connectors</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">connectors</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...};</span>
</span><span class="line">   <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">78</span><span class="o">;</span> <span class="c">(* HDMI-A-1 *)</span> <span class="o">...};</span>
</span><span class="line">   <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">84</span><span class="o">;</span> <span class="c">(* DVI-D-1 *)</span> <span class="o">...}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then to filter:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">is_connected</span> <span class="o">(</span><span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span> <span class="o">(</span><span class="n">c</span><span class="o">.</span><span class="n">connection</span> <span class="o">=</span> <span class="nc">Connected</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">is_connected</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">bool</span> <span class="o">=</span> <span class="o">&lt;</span><span class="k">fun</span><span class="o">&gt;</span>
</span><span class="line">
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="n">is_connected</span> <span class="n">connectors</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">connected</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We'll investigate <code>c</code>, the first connected one:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">c</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">connected</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h4 id="a-note-on-ids">A note on IDs</h4>
<p>In the libdrm C API, IDs are just integers.
To avoid mix-ups, I made them distinct types in the OCaml API.
For example, if you try to use an encoder ID as a connector ID:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">resources</span><span class="o">.</span><span class="n">encoders</span><span class="o">);;</span>
</span><span class="line">                           <span class="o">^^^^^^^^^^^^^^^^^^^^^^^^^^^^</span>
</span><span class="line"><span class="nc">Error</span><span class="o">:</span> <span class="nc">This</span> <span class="n">expression</span> <span class="n">has</span> <span class="k">type</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Encoder</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">       <span class="n">but</span> <span class="n">an</span> <span class="n">expression</span> <span class="n">was</span> <span class="n">expected</span> <span class="k">of</span> <span class="k">type</span>
</span><span class="line">         <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">       <span class="nc">These</span> <span class="n">two</span> <span class="n">variant</span> <span class="n">types</span> <span class="n">have</span> <span class="n">no</span> <span class="n">intersection</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Normally this is what you want, but for interactive use it's annoying that you can't just pass a plain integer.
e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="mi">71</span><span class="o">;;</span>
</span><span class="line">                           <span class="o">^^</span>
</span><span class="line"><span class="nc">Error</span><span class="o">:</span> <span class="nc">The</span> <span class="n">constant</span> <span class="mi">71</span> <span class="n">has</span> <span class="k">type</span> <span class="kt">int</span> <span class="n">but</span> <span class="n">an</span> <span class="n">expression</span> <span class="n">was</span> <span class="n">expected</span> <span class="k">of</span> <span class="k">type</span>
</span><span class="line">         <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You can get any kind of ID with <code>Drm.Id.of_int</code> (e.g. <code>K.Connector.get dev (Drm.Id.of_int 71)</code>),
but that's still a bit verbose, so you might prefer to (re)define a prefix operator for it, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="o">(</span> <span class="o">!</span> <span class="o">)</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">of_int</span><span class="o">;;</span>
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">!</span><span class="mi">71</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="o">...}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(note: <code>!</code> is the only single-character prefix operator available in OCaml)</p>
<h3 id="modes">Modes</h3>
<p>Modes are shown in abbreviated form in the connector output.
To see the full list:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Mode_info</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[</span><span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">30</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">29</span><span class="o">.</span><span class="mi">97</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">2560</span><span class="n">x1440</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1920</span><span class="n">x1200</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1920</span><span class="n">x1080</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1920</span><span class="n">x1080</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1600</span><span class="n">x1200</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1680</span><span class="n">x1050</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1600</span><span class="n">x900</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x1024</span> <span class="mi">75</span><span class="o">.</span><span class="mi">02</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x1024</span> <span class="mi">60</span><span class="o">.</span><span class="mi">02</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1440</span><span class="n">x900</span> <span class="mi">59</span><span class="o">.</span><span class="mi">89</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x800</span> <span class="mi">59</span><span class="o">.</span><span class="mi">81</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1152</span><span class="n">x864</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x720</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1280</span><span class="n">x720</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">75</span><span class="o">.</span><span class="mi">03</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">70</span><span class="o">.</span><span class="mi">07</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">832</span><span class="n">x624</span> <span class="mi">74</span><span class="o">.</span><span class="mi">55</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">72</span><span class="o">.</span><span class="mi">19</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">60</span><span class="o">.</span><span class="mi">32</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">800</span><span class="n">x600</span> <span class="mi">56</span><span class="o">.</span><span class="mi">25</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">72</span><span class="o">.</span><span class="mi">81</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">66</span><span class="o">.</span><span class="mi">67</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">640</span><span class="n">x480</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">720</span><span class="n">x400</span> <span class="mi">70</span><span class="o">.</span><span class="mi">08</span><span class="n">Hz</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: I annotated various pretty-printer functions with <code>[@@ocaml.toplevel_printer]</code>,
which causes utop to use them by default to display values of the corresponding type.
For example, showing a list of modes uses this short summary form.
Displaying an individual mode shows all the information.
Here's the first mode:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">#</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Mode_info</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">name</span> <span class="o">=</span> <span class="s2">&quot;3840x2160&quot;</span><span class="o">;</span>
</span><span class="line"> <span class="n">typ</span> <span class="o">=</span> <span class="n">preferred</span><span class="o">+</span><span class="n">driver</span><span class="o">;</span>
</span><span class="line"> <span class="n">flags</span> <span class="o">=</span> <span class="n">phsync</span><span class="o">+</span><span class="n">nvsync</span><span class="o">;</span>
</span><span class="line"> <span class="n">stereo_mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">aspect_ratio</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">clock</span> <span class="o">=</span> <span class="mi">533250</span><span class="o">;</span>
</span><span class="line"> <span class="n">hdisplay</span><span class="o">,</span><span class="n">vdisplay</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line"> <span class="n">hsync_start</span> <span class="o">=</span> <span class="mi">3888</span><span class="o">;</span>
</span><span class="line"> <span class="n">hsync_end</span> <span class="o">=</span> <span class="mi">3920</span><span class="o">;</span>
</span><span class="line"> <span class="n">htotal</span> <span class="o">=</span> <span class="mi">4000</span><span class="o">;</span>
</span><span class="line"> <span class="n">hskew</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="n">vsync_start</span> <span class="o">=</span> <span class="mi">2163</span><span class="o">;</span>
</span><span class="line"> <span class="n">vsync_end</span> <span class="o">=</span> <span class="mi">2168</span><span class="o">;</span>
</span><span class="line"> <span class="n">vtotal</span> <span class="o">=</span> <span class="mi">2222</span><span class="o">;</span>
</span><span class="line"> <span class="n">vscan</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="n">vrefresh</span> <span class="o">=</span> <span class="mi">60</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="properties">Properties</h3>
<p>Some resources can also have extra <a href="https://drmdb.emersion.fr/properties">properties</a>.
Use <code>get_properties</code> to fetch them:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span> <span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="nc">EDID</span> <span class="o">=</span> <span class="mi">92</span><span class="o">;</span> <span class="nc">DPMS</span> <span class="o">=</span> <span class="nc">On</span><span class="o">;</span> <span class="nc">TILE</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">link</span><span class="o">-</span><span class="n">status</span> <span class="o">=</span> <span class="nc">Good</span><span class="o">;</span> <span class="n">non</span><span class="o">-</span><span class="n">desktop</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="nc">HDR_OUTPUT_METADATA</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">scaling</span> <span class="n">mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">underscan</span> <span class="o">=</span> <span class="n">off</span><span class="o">;</span>
</span><span class="line"> <span class="n">underscan</span> <span class="n">hborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">vborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">max</span> <span class="n">bpc</span> <span class="o">=</span> <span class="mi">8</span><span class="o">;</span>
</span><span class="line"> <span class="nc">Colorspace</span> <span class="o">=</span> <span class="nc">Default</span><span class="o">;</span> <span class="n">vrr_capable</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">subconnector</span> <span class="o">=</span> <span class="nc">Native</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Linux only returns a subset of the properties until you enable the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Client_cap/index.html#val-atomic">atomic</a> feature.
Let's turn that on now:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Client_cap</span><span class="p">.</span><span class="o">(</span><span class="n">set</span> <span class="n">atomic</span><span class="o">)</span> <span class="n">dev</span> <span class="bp">true</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">(</span><span class="kt">unit</span><span class="o">,</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">error</span><span class="o">)</span> <span class="n">result</span> <span class="o">=</span> <span class="nc">Ok</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(<code>Module.(expr)</code> is a short-hand that brings all of <code>Module</code>'s symbols into scope for <code>expr</code>,
so we don't have to repeat the module name for both <code>set</code> and <code>atomic</code>)</p>
<p>And getting the properties again, we now have an extra <code>CRTC_ID</code>,
telling us which controller this connector is currently using:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">c_props</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span> <span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">c_props</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="nc">EDID</span> <span class="o">=</span> <span class="mi">92</span><span class="o">;</span> <span class="nc">DPMS</span> <span class="o">=</span> <span class="nc">On</span><span class="o">;</span> <span class="nc">TILE</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">link</span><span class="o">-</span><span class="n">status</span> <span class="o">=</span> <span class="nc">Good</span><span class="o">;</span> <span class="n">non</span><span class="o">-</span><span class="n">desktop</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="nc">HDR_OUTPUT_METADATA</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="n">scaling</span> <span class="n">mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">underscan</span> <span class="o">=</span> <span class="n">off</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">hborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">vborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">max</span> <span class="n">bpc</span> <span class="o">=</span> <span class="mi">8</span><span class="o">;</span>
</span><span class="line"> <span class="nc">Colorspace</span> <span class="o">=</span> <span class="nc">Default</span><span class="o">;</span> <span class="n">vrr_capable</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">subconnector</span> <span class="o">=</span> <span class="nc">Native</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="encoders">Encoders</h3>
<p><a href="https://www.kernel.org/doc/html/latest/gpu/drm-kms.html">The Linux documentation</a> says:</p>
<blockquote>
<p>Those are really just internal artifacts of the helper libraries used to
implement KMS drivers. Besides that they make it unnecessarily more
complicated for userspace to figure out which connections between a CRTC and
a connector are possible, and what kind of cloning is supported, they serve
no purpose in the userspace API. Unfortunately encoders have been exposed to
userspace, hence can’t remove them at this point. Furthermore the exposed
restrictions are often wrongly set by drivers, and in many cases not powerful
enough to express the real restrictions.</p>
</blockquote>
<p>OK. Well, let's take a look anyway:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">e</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">c</span><span class="o">.</span><span class="n">encoder_id</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">e</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">encoder_id</span> <span class="o">=</span> <span class="mi">70</span><span class="o">;</span>
</span><span class="line">   <span class="n">encoder_type</span> <span class="o">=</span> <span class="nc">TMDS</span><span class="o">;</span>
</span><span class="line">   <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">   <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1f</span><span class="o">;</span>
</span><span class="line">   <span class="n">possible_clones</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: We need <code>Option.get</code> here because a connector might not have an encoder set yet.
Where the C API uses 0 to indicate no resource,
the OCaml API uses <code>None</code> to force us to think about that case.</p>
<p>As the documentation says, the encoder is mainly useful to get the CRTC ID:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">e</span><span class="o">.</span><span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">crtc_id</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">57</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We could instead have got that directly from the connector using its properties:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="nn">Values</span><span class="p">.</span><span class="n">get_value_exn</span> <span class="n">c_props</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Crtc</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="crt-controllers">CRT Controllers</h3>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">crtc</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">crtc</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">crtc_id</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">   <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">93</span><span class="o">;</span>
</span><span class="line">   <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">   <span class="n">width</span><span class="o">,</span><span class="n">height</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line">   <span class="n">mode</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An active CRTC has a mode set (presumably from the connector's list of supported modes),
and a framebuffer with the image to be displayed.</p>
<p>If I keep calling <code>Crtc.get</code>, I see that it is sometimes showing framebuffer 93 and sometimes 94.
My Wayland compositor (Sway) updates one framebuffer while the other is being shown, then switches which one is displayed.</p>
<h3 id="framebuffers">Framebuffers</h3>
<p>My CRTC is currently displaying the contents of framebuffer 93:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">fb_id</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">crtc</span><span class="o">.</span><span class="n">fb_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">fb_id</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">93</span>
</span></code></pre></td></tr></tbody></table></div></figure><figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="n">fb_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">fb</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">fb_id</span> <span class="o">=</span> <span class="mi">93</span><span class="o">;</span>
</span><span class="line">   <span class="n">width</span><span class="o">,</span><span class="n">height</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line">   <span class="n">pixel_format</span><span class="o">,</span> <span class="n">modifier</span> <span class="o">=</span> <span class="nc">XR24</span><span class="o">,</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">   <span class="n">interlaced</span> <span class="o">=</span> <span class="bp">false</span><span class="o">;</span>
</span><span class="line">   <span class="n">planes</span> <span class="o">=</span> <span class="o">[{</span><span class="n">handle</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">pitch</span> <span class="o">=</span> <span class="mi">15360</span><span class="o">;</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span><span class="o">}]}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A framebuffer has up to 4 framebuffer planes (not to be confused with CRTC planes; see later),
each of which references a <em>buffer object</em> (also known as a <em>BO</em> and referenced with a <em>GEM handle</em>).</p>
<p>This framebuffer is using the <code>XR24</code> format, where there is a single BO with 32 bits for each pixel
(8 for red, 8 green, 8 blue and 8 unused).
Some formats use e.g. a separate buffer for each component
(or a different part of the same buffer, using <code>offset</code>).</p>
<p>Modern graphics cards also support format <em>modifiers</em>, but my card is too old so I just get <code>None</code>.
Linux's <a href="https://github.com/torvalds/linux/blob/master/include/uapi/drm/drm_fourcc.h">fourcc.h</a> header file describes the various formats and modifiers.
Modifiers seem to be mainly used to specify the <a href="https://docs.mesa3d.org/isl/tiling.html">tiling</a>.</p>
<p>I don't have permission to see the buffer object, so it appears as (<code>handle = None</code>).
The <code>pitch</code> is the number of bytes from one row to the next (also known as the <em>stride</em>).
Here, the 15360 is simply the width (3840) multiplied by the 4 bytes per pixel.</p>
<h3 id="crtc-planes">CRTC planes</h3>
<p>In fact, <code>Crtc.get</code> is an old API that only covers the basic case of a single framebuffer.
In reality, a CRTC can combine multiple <em>CRTC planes</em>, which for some reason aren't returned with the other resources
and must be requested separately:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">plane_ids</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">list</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">plane_ids</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="kt">list</span> <span class="o">=</span> <span class="o">[</span><span class="mi">40</span><span class="o">;</span> <span class="mi">43</span><span class="o">;</span> <span class="mi">46</span><span class="o">;</span> <span class="mi">49</span><span class="o">;</span> <span class="mi">52</span><span class="o">;</span> <span class="mi">55</span><span class="o">;</span> <span class="mi">58</span><span class="o">;</span> <span class="mi">61</span><span class="o">;</span> <span class="mi">64</span><span class="o">;</span> <span class="mi">67</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(note: you need to enable &quot;atomic&quot; mode before requesting planes; we already did that above)</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">planes</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">)</span> <span class="n">plane_ids</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">planes</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">XR24</span><span class="o">;</span> <span class="nc">AR24</span><span class="o">;</span> <span class="nc">RA24</span><span class="o">;</span> <span class="nc">XR30</span><span class="o">;</span> <span class="nc">XB30</span><span class="o">;</span> <span class="nc">AR30</span><span class="o">;</span> <span class="nc">AB30</span><span class="o">;</span> <span class="nc">XR48</span><span class="o">;</span> <span class="nc">XB48</span><span class="o">;</span> 
</span><span class="line">               <span class="nc">AR48</span><span class="o">;</span> <span class="nc">AB48</span><span class="o">;</span> <span class="nc">XB24</span><span class="o">;</span> <span class="nc">AB24</span><span class="o">;</span> <span class="nc">RG16</span><span class="o">;</span> <span class="nc">XR4H</span><span class="o">;</span> <span class="nc">AR4H</span><span class="o">;</span> <span class="nc">XB4H</span><span class="o">;</span> <span class="nc">AB4H</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">40</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x10</span><span class="o">};</span>
</span><span class="line">   <span class="o">...</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A lot of these planes aren't being used (don't have a CRTC),
which we can check for with a helper function:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">has_crtc</span> <span class="o">(</span><span class="n">x</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">crtc_id</span> <span class="o">&lt;&gt;</span> <span class="nc">None</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">has_crtc</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">bool</span> <span class="o">=</span> <span class="o">&lt;</span><span class="k">fun</span><span class="o">&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Looks like Sway is using two planes at the moment:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">active_planes</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="n">has_crtc</span> <span class="n">planes</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">active_planes</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">XR24</span><span class="o">;</span> <span class="nc">AR24</span><span class="o">;</span> <span class="nc">RA24</span><span class="o">;</span> <span class="nc">XR30</span><span class="o">;</span> <span class="nc">XB30</span><span class="o">;</span> <span class="nc">AR30</span><span class="o">;</span> <span class="nc">AB30</span><span class="o">;</span> <span class="nc">XR48</span><span class="o">;</span> <span class="nc">XB48</span><span class="o">;</span> 
</span><span class="line">               <span class="nc">AR48</span><span class="o">;</span> <span class="nc">AB48</span><span class="o">;</span> <span class="nc">XB24</span><span class="o">;</span> <span class="nc">AB24</span><span class="o">;</span> <span class="nc">RG16</span><span class="o">;</span> <span class="nc">XR4H</span><span class="o">;</span> <span class="nc">AR4H</span><span class="o">;</span> <span class="nc">XB4H</span><span class="o">;</span> <span class="nc">AB4H</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">52</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">94</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">};</span>
</span><span class="line">   <span class="o">{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">AR24</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">55</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">98</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>More information is available as properties:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">active_plane_ids</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="n">active_planes</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">active_plane_ids</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="kt">list</span> <span class="o">=</span> <span class="o">[</span><span class="mi">52</span><span class="o">;</span> <span class="mi">55</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span><span class="o">)</span> <span class="n">active_plane_ids</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Plane</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[{</span><span class="nc">CRTC_H</span> <span class="o">=</span> <span class="mi">2160</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="nc">CRTC_W</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">;</span> <span class="nc">CRTC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="nc">CRTC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">  <span class="nc">FB_ID</span> <span class="o">=</span> <span class="mi">93</span><span class="o">;</span> <span class="nc">IN_FENCE_FD</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="nc">SRC_H</span> <span class="o">=</span> <span class="mi">141557760</span><span class="o">;</span> <span class="nc">SRC_W</span> <span class="o">=</span> <span class="mi">251658240</span><span class="o">;</span>
</span><span class="line">  <span class="nc">SRC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="nc">SRC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">rotation</span> <span class="o">=</span> <span class="o">[</span><span class="n">rotate</span><span class="o">-</span><span class="mi">0</span><span class="o">];</span> <span class="k">type</span> <span class="o">=</span> <span class="nc">Primary</span><span class="o">;</span> <span class="n">zpos</span> <span class="o">=</span> <span class="mi">0</span><span class="o">};</span>
</span><span class="line"> <span class="o">{</span><span class="nc">CRTC_H</span> <span class="o">=</span> <span class="mi">128</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="nc">CRTC_W</span> <span class="o">=</span> <span class="mi">128</span><span class="o">;</span> <span class="nc">CRTC_X</span> <span class="o">=</span> <span class="mi">3105</span><span class="o">;</span> <span class="nc">CRTC_Y</span> <span class="o">=</span> <span class="mi">1518</span><span class="o">;</span>
</span><span class="line">  <span class="nc">FB_ID</span> <span class="o">=</span> <span class="mi">98</span><span class="o">;</span> <span class="nc">IN_FENCE_FD</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="nc">SRC_H</span> <span class="o">=</span> <span class="mi">8388608</span><span class="o">;</span> <span class="nc">SRC_W</span> <span class="o">=</span> <span class="mi">8388608</span><span class="o">;</span> <span class="nc">SRC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">  <span class="nc">SRC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="k">type</span> <span class="o">=</span> <span class="nc">Cursor</span><span class="o">;</span> <span class="n">zpos</span> <span class="o">=</span> <span class="mi">255</span><span class="o">}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><ul>
<li>Plane 52 is a <code>Primary</code> plane and is using framebuffer 93 (as we saw before).
</li>
<li>Plane 55 is a <code>Cursor</code> plane, using framebuffer 98 (and the <code>AR24</code> format, with alpha/transparency).
</li>
</ul>
<p>A plane chooses which part of the frame buffer to show (<code>SRC_X</code>, <code>SRC_Y</code>, <code>SRC_W</code> and <code>SRC_H</code>)
and where it should appear on the screen (<code>CRTC_X</code>, <code>CRTC_Y</code>, <code>CRTC_W</code> and <code>CRTC_H</code>).
The source values are in 16.16 format (i.e. shifted left 16 bits).</p>
<p>Oddly, <code>Plane.get</code> returned <code>crtc_x,crtc_y = 0,0</code> for both planes, but
the properties show the correct cursor location (<code>CRTC_X = 3105; CRTC_Y = 1518;</code>).</p>
<p>Having the cursor on a separate plane avoids having to modify the main screen image
whenever the mouse pointer moves, which is good for low latency
(especially if the GPU is busy rendering something else at the time),
power consumption (the GPU can stay powered down),
and allows showing an application's buffer full screen without the compositor
needing to modify the application's buffer.</p>
<p>You might also have some <code>Overlay</code> planes,
which can be <a href="https://zamundaaa.github.io/wayland/2025/10/23/more-kms-offloading.html">useful for displaying video</a>.
My graphics card seems to be too old for that.</p>
<h3 id="expanded-resources-diagram">Expanded resources diagram</h3>
<p>Here's an expanded diagram showing some more possibilities:</p>
<p><a href="/blog/images/libdrm/arch.svg"><span class="caption-wrapper center"><img src="/blog/images/libdrm/arch.svg" title="Expanded resources diagram" class="caption"/><span class="caption-text">Expanded resources diagram</span></span></a></p>
<ul>
<li>Some framebuffer formats take the input data from multiple buffers.
</li>
<li>A framebuffer can be shared by multiple CRTCs (perhaps with each plane showing a different part of it).
</li>
<li>A CRTC can have multiple planes (e.g. primary and cursor).
</li>
<li>A single CRTC can show the same image on multiple monitors.
</li>
</ul>
<h2 id="making-changes">Making changes</h2>
<p>If I try turning off the CRTC (by setting the mode to <code>None</code>) from my desktop environment it fails:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">set</span> <span class="n">dev</span> <span class="n">crtc_id</span> <span class="o">~</span><span class="n">pos</span><span class="o">:(</span><span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">)</span> <span class="o">~</span><span class="n">connectors</span><span class="o">:</span><span class="bp">[]</span> <span class="nc">None</span><span class="o">;;</span>
</span><span class="line"><span class="nc">Exception</span><span class="o">:</span> <span class="nn">Unix</span><span class="p">.</span><span class="nc">Unix_error</span><span class="o">(</span><span class="nn">Unix</span><span class="p">.</span><span class="nc">EACCES</span><span class="o">,</span> <span class="s2">&quot;drmModeSetCrtc&quot;</span><span class="o">,</span> <span class="s2">&quot;&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The reason is that I'm currently running a graphical desktop and Sway owns the device
(so my <code>dev</code> is not the DRM &quot;master&quot;):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="n">is_master</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">false</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That can be fixed by switching to a different <a href="https://en.wikipedia.org/wiki/Virtual_console">VT</a> (e.g. with Ctrl-Alt-F2) and running it there.
However, this will result in a second problem: I won't be able to see what I'm doing!</p>
<p>If you have a second computer then you can SSH in and test things out from there, but
for simplicity we'll leave the utop REPL at this point and write some programs instead.</p>
<p>For example, <a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/query.ml">query.ml</a> shows the information we discovered above:</p>
<pre><code>dune exec -- ./examples/query.exe
</code></pre>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">devices</span><span class="o">:</span>                              
</span><span class="line">  <span class="o">[{</span><span class="n">primary_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/card0&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="n">render_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/renderD128&quot;</span><span class="o">;</span>
</span><span class="line"><span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="non-atomic-mode-setting">Non-atomic mode setting</h3>
<p>Linux provides two ways to configure modes: the old <em>non-atomic</em> API and the newer <em>atomic</em> one.</p>
<p><a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/nonatomic.ml">examples/nonatomic.ml</a> contains a simple example of the older (but simpler) API.
It starts by finding a device (the first one with a primary node supporting KMS), then
finds all connected connectors (as we did above), and calls <code>show_test_page</code> on each one:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">with_device</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="nn">Utils</span><span class="p">.</span><span class="n">is_connected</span> <span class="n">t</span><span class="o">.</span><span class="n">connectors</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">restoring_afterwards</span> <span class="n">t</span> <span class="o">@@</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">List</span><span class="p">.</span><span class="n">iter</span> <span class="o">(</span><span class="n">show_test_page</span> <span class="n">t</span><span class="o">)</span> <span class="n">connected</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Unix</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>restoring_afterwards</code> stores the current configuration, runs the callback,
and then puts things back to normal when that finishes (or you press Ctrl-C).</p>
<p>The program waits for 2 seconds after showing the test page before exiting.</p>
<p><code>show_test_page</code> finds the CRTC (as we did above),
takes the first supported mode, creates a test framebuffer of that size,
and configures the CRTC to display it:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">show_test_page</span> <span class="o">(</span><span class="n">t</span> <span class="o">:</span> <span class="nn">Resources</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">(</span><span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="n">c</span><span class="o">.</span><span class="n">encoder_id</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">println</span> <span class="s2">&quot;%a has no encoder (skipping)&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">encoder_id</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">match</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">get</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">encoder_id</span> <span class="k">with</span>
</span><span class="line">    <span class="o">|</span> <span class="o">{</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">println</span> <span class="s2">&quot;%a&#39;s encoder has no CRTC (skipping)&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span>
</span><span class="line">    <span class="o">|</span> <span class="o">{</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="n">crtc_id</span><span class="o">;</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">println</span> <span class="s2">&quot;Showing test page on %a&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span><span class="o">;</span>
</span><span class="line">      <span class="k">let</span> <span class="n">mode</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">size</span> <span class="o">=</span> <span class="o">(</span><span class="n">mode</span><span class="o">.</span><span class="n">hdisplay</span><span class="o">,</span> <span class="n">mode</span><span class="o">.</span><span class="n">vdisplay</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">Test_image</span><span class="p">.</span><span class="n">create</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">      <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">set</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">crtc_id</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">mode</span><span class="o">)</span> <span class="o">~</span><span class="n">fb</span> <span class="o">~</span><span class="n">pos</span><span class="o">:(</span><span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">)</span>
</span><span class="line">        <span class="o">~</span><span class="n">connectors</span><span class="o">:[</span><span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If the connector doesn't have a CRTC, we could find a suitable one and use that,
but for simplicity the example just skips such connectors.</p>
<p>To run the example (switch away from any graphical desktop first or it won't work):</p>
<pre><code>dune exec -- ./examples/nonatomic.exe
</code></pre>
<h3 id="dumb-buffers">Dumb buffers</h3>
<p>Typically the pixel data to be displayed comes from some complex rendering pipeline,
but Linux also provides <em>dumb buffers</em> for simple cases such as testing.
The <code>Test_image.create</code> function used above creates a dumb buffer with a test pattern:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">create_dumb</span> <span class="n">dev</span> <span class="n">size</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">dumb_buffer</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Buffer</span><span class="p">.</span><span class="nn">Dumb</span><span class="p">.</span><span class="n">create</span> <span class="n">dev</span> <span class="o">~</span><span class="n">bpp</span><span class="o">:</span><span class="mi">32</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">arr</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Buffer</span><span class="p">.</span><span class="nn">Dumb</span><span class="p">.</span><span class="n">map</span> <span class="n">dev</span> <span class="n">dumb_buffer</span> <span class="nc">Int32</span> <span class="k">in</span>
</span><span class="line">  <span class="k">for</span> <span class="n">row</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">snd</span> <span class="n">size</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span>
</span><span class="line">    <span class="k">for</span> <span class="n">col</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">fst</span> <span class="n">size</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span>
</span><span class="line">      <span class="k">let</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">        <span class="o">(</span><span class="n">row</span> <span class="ow">land</span> <span class="mh">0xff</span><span class="o">)</span> <span class="ow">lor</span>
</span><span class="line">        <span class="o">((</span><span class="n">col</span> <span class="ow">land</span> <span class="mh">0xff</span><span class="o">)</span> <span class="ow">lsl</span> <span class="mi">8</span><span class="o">)</span> <span class="ow">lor</span>
</span><span class="line">        <span class="o">(((</span><span class="n">row</span> <span class="n">lsr</span> <span class="mi">8</span><span class="o">)</span> <span class="ow">lor</span> <span class="o">(</span><span class="n">col</span> <span class="n">lsr</span> <span class="mi">8</span><span class="o">))</span> <span class="ow">lsl</span> <span class="mi">18</span><span class="o">)</span>
</span><span class="line">      <span class="k">in</span>
</span><span class="line">      <span class="n">arr</span><span class="o">.{</span><span class="n">row</span><span class="o">,</span> <span class="n">col</span><span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Int32</span><span class="p">.</span><span class="n">of_int</span> <span class="n">c</span>
</span><span class="line">    <span class="k">done</span><span class="o">;</span>
</span><span class="line">  <span class="k">done</span><span class="o">;</span>
</span><span class="line">  <span class="n">dumb_buffer</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Dumb.create</code> allocates memory for the image data.
<code>Dumb.map</code> makes it appear in host-memory as an OCaml bigarray.
The loop sets each 32-bit int in the image to some colour <code>c</code>.</p>
<p>Then we wrap this data up as an XR24-format framebuffer with a single plane:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">create</span> <span class="n">dev</span> <span class="n">size</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">buffer</span> <span class="o">=</span> <span class="n">create_dumb</span> <span class="n">dev</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">planes</span> <span class="o">=</span> <span class="o">[</span><span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">v</span> <span class="n">buffer</span><span class="o">.</span><span class="n">handle</span> <span class="o">~</span><span class="n">pitch</span><span class="o">:</span><span class="n">buffer</span><span class="o">.</span><span class="n">pitch</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">add</span> <span class="n">dev</span> <span class="o">~</span><span class="n">size</span> <span class="o">~</span><span class="n">planes</span> <span class="o">~</span><span class="n">pixel_format</span><span class="o">:</span><span class="nn">Drm</span><span class="p">.</span><span class="nn">Fourcc</span><span class="p">.</span><span class="n">xr24</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="atomic-mode-setting">Atomic mode setting</h3>
<p><a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/atomic.ml">examples/atomic.ml</a> demonstrates the newer atomic API:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">with_device</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Drm</span><span class="p">.</span><span class="nn">Client_cap</span><span class="p">.</span><span class="o">(</span><span class="n">set_exn</span> <span class="n">atomic</span><span class="o">)</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="bp">true</span><span class="o">;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="nn">Utils</span><span class="p">.</span><span class="n">is_connected</span> <span class="n">t</span><span class="o">.</span><span class="n">connectors</span> <span class="k">in</span>
</span><span class="line">  <span class="n">println</span> <span class="s2">&quot;Found %d connected connectors&quot;</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">length</span> <span class="n">connected</span><span class="o">);</span>
</span><span class="line">  <span class="k">let</span> <span class="n">free_planes</span> <span class="o">=</span> <span class="n">ref</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">list</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">rq</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">create</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">List</span><span class="p">.</span><span class="n">iter</span> <span class="o">(</span><span class="n">show_test_page</span> <span class="o">~</span><span class="n">free_planes</span> <span class="n">t</span> <span class="n">rq</span><span class="o">)</span> <span class="n">connected</span><span class="o">;</span>
</span><span class="line">  <span class="n">println</span> <span class="s2">&quot;Checking that commit will work...&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="k">match</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">commit</span> <span class="o">~</span><span class="n">test_only</span><span class="o">:</span><span class="bp">true</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">rq</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="k">exception</span> <span class="nn">Unix</span><span class="p">.</span><span class="nc">Unix_error</span> <span class="o">(</span><span class="n">code</span><span class="o">,</span> <span class="o">_,</span> <span class="o">_)</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">println</span> <span class="s2">&quot;Mode-setting would fail with error: %s&quot;</span> <span class="o">(</span><span class="nn">Unix</span><span class="p">.</span><span class="n">error_message</span> <span class="n">code</span><span class="o">)</span>
</span><span class="line">  <span class="o">|</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">println</span> <span class="s2">&quot;Pre-commit test passed.&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Utils</span><span class="p">.</span><span class="n">restoring_afterwards</span> <span class="n">t</span> <span class="o">@@</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">commit</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">rq</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Unix</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The steps are:</p>
<ol>
<li>Use <code>set_exn atomic</code> to enable atomic mode.
</li>
<li>Create an <em>atomic request</em> (<code>rq</code>).
</li>
<li>Use <code>show_test_page</code> to populate it with the desired property changes.
</li>
<li>(optional) Check that it will work (<code>~test_only:true</code>).
</li>
<li>Commit the changes (<code>Atomic_req.commit</code>).
</li>
</ol>
<p>The advantage here is that either all changes are successfully applied at once or nothing changes.
This avoids various problems with flickering or trying to roll back partial changes.</p>
<p><code>show_test_page</code> needs a couple of modifications.
First, we have to find a plane (rather than using the old <code>Crtc.set</code> which assumes a single plane),
and then we set the plane's <code>FB_ID</code> property to the new framebuffer in the request:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">add_property</span> <span class="n">rq</span> <span class="n">plane</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">fb_id</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">fb</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>For the example, I actually set more properties and defined an operator to make the code a bit neater:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="o">(</span> <span class="o">.%{}&lt;-</span> <span class="o">)</span> <span class="n">obj</span> <span class="n">prop</span> <span class="n">value</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">add_property</span> <span class="n">rq</span> <span class="n">obj</span> <span class="n">prop</span> <span class="n">value</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">fb_id</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nc">Some</span> <span class="n">fb</span><span class="o">;</span>
</span><span class="line"><span class="c">(* Source region on frame-buffer: *)</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_x</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_y</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_w</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="o">(</span><span class="n">fst</span> <span class="n">size</span><span class="o">);</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_h</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="o">(</span><span class="n">snd</span> <span class="n">size</span><span class="o">);</span>
</span><span class="line"><span class="c">(* Destination region on CRTC: *)</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_x</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_y</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_w</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="n">fst</span> <span class="n">size</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_h</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="n">snd</span> <span class="n">size</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In libdrm-ocaml, properties are typed, so you can't forget to convert the source values to fixed point format.</p>
<h2 id="d-rendering">3D rendering</h2>
<p>The examples above use a dumb-buffer, but it's fairly simple to replace that with a Vulkan buffer.
The code in the last post <a href="https://github.com/talex5/vulkan-test/blob/5a93c76eb4e0205c63e0d0c5f7b4785cd15c208a/vulkan/swap_chain.ml#L42">exported the image memory from Vulkan</a> as a dmabuf FD and sent it to the Wayland compositor.
Now, instead of sending it we just need to import it into our device (with <code>Drm.Dmabuf.to_handle</code>)
and use that handle instead of the dumb-buffer one.</p>
<p>I added a simple <a href="https://github.com/talex5/vulkan-test/commit/731e90d1d61d78b4ac557f9861839dbcc233e06b">surface abstraction</a> to the test code, wrapping the <code>Window</code> module's API
so that the rendering code doesn't need to care whether it's rendering to a Wayland window or directly to the screen.
Then I made a <a href="https://github.com/talex5/vulkan-test/blob/3f767e64f38b75e216a530b6b9747020d958f11b/src/vt.ml">Vt module</a> implementing the new <code>Surface.t</code> type for rendering directly to a Linux VT.</p>
<p>To get the animation working, I used <code>K.Crtc.page_flip</code> to update the framebuffer (I could also have used the atomic API).
The kernel waits until the encoder has finishing sending the current frame before switching to the new one,
which avoids tearing.
We also need to ask the kernel to tell us when this happens, which is done by setting the optional <code>~event</code> argument to some number.
You can read events from the device file and parse them with <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Event/index.html#val-parse">Drm.Event.parse</a>.</p>
<p>If you want to try it, this should produce an animated room:</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b kms-3d
cd vulkan-test
nix develop
make download-example
dune exec -- ./src/main.exe 10000 viking_room.obj viking_room.png
</code></pre>
<p>If run with <code>$WAYLAND_DISPLAY</code> set, it will open a Wayland window (as before),
but if run from a text console then it should render the animation directly using KMS.</p>
<h2 id="linux-vts">Linux VTs</h2>
<p>When the user switches to another virtual terminal (e.g. with Ctrl-Alt-F3),
we should call <code>Drm.Device.drop_master</code> to give up being the master,
allowing the application running on the new terminal to take over.</p>
<p>We should also switch the VT to <code>KD_GRAPHICS</code> mode while using it,
to stop the kernel trying to manage it.</p>
<p>I didn't implement either of these features, but see <a href="https://dvdhrm.wordpress.com/2013/08/24/how-vt-switching-works/">How VT-switching works</a> for details.</p>
<h2 id="debugging">Debugging</h2>
<p>If you get an unhelpful error code from the kernel (e.g. <code>EINVAL</code>), enabling debug messages is often helpful.
Writing <code>4</code> to <code>/sys/module/drm/parameters/debug</code> enables KMS debug messages, which can be seen in the <code>dmesg</code> output.
Write <code>0</code> to the file afterwards to turn the messages off again.
<code>modinfo -p drm</code> lists the various options.</p>
<h2 id="conclusions">Conclusions</h2>
<p>I hope you found being able to explore the libdrm API interactively from the OCaml top-level
made it easier to learn about how Linux manages displays.
As when doing <a href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/">Vulkan in OCaml</a>,
a lot of the noise from C is removed and I think that the essentials of what is going on are easier to see.</p>
<p>I used <a href="https://github.com/yallop/ocaml-ctypes">ocaml-ctypes</a> for the C bindings, and this was my first time using it in &quot;stubs&quot; mode
(where it pre-generates C bindings from OCaml definitions).
This has the advantage that the C type checker checks that the definitions are correct,
and it worked well.
Dune's <a href="https://dune.readthedocs.io/en/latest/foreign-code.html#stub-generation-with-dune-ctypes">Stub Generation</a> feature generates the build rules for this semi-automatically.</p>
<p>Deciding what OCaml types to use for the C types was quite difficult.
For example, C has many different integer types (<code>int</code>, <code>long</code>, <code>uint32_t</code>, etc),
but using lots of types is more painful in OCaml where e.g. <code>+</code> only works on <code>int</code>.
I used OCaml's int type when possible, and other types only when the value might not fit
(e.g. an image size on a 32-bit platform might not fit into an OCaml int, which is one bit shorter).</p>
<p>The C API is somewhat inconsistent about types.
e.g. <code>drmModePageFlipTarget</code> takes a <code>uint32_t target_vblank</code> argument for the sequence number,
while <code>page_flip_handler</code> confirms the event by giving it as <code>unsigned int sequence</code>.
Meanwhile, the <code>sequence_handler</code> event gives it as <code>uint64_t sequence</code>.
I'm not sure what happens if the sequence number gets too large to fit in a 32-bit integer.</p>
<p>Anyway, I think I understand mode setting a lot better now,
and I'm getting faster at debugging graphics problems on Linux
(e.g. when <a href="https://github.com/NixOS/nixpkgs/issues/458132">element-desktop failed to start</a> recently after I updated it).</p>
<p>Thanks to the OCaml Software Foundation for sponsoring this work.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Vulkan graphics in OCaml vs C</title>
    <link href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/"></link>
    <updated>2025-09-20T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan</id>
    <content type="html"><![CDATA[<p>I convert my Vulkan test program from C to OCaml and compare the results,
then continue the Vulkan tutorial in OCaml, adding 3D, textures and depth buffering.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#running-it-yourself">Running it yourself</a>
</li>
<li><a href="#the-direct-port">The direct port</a>
<ul>
<li><a href="#labelled-arguments">Labelled arguments</a>
</li>
<li><a href="#enums-and-bit-fields">Enums and bit-fields</a>
</li>
<li><a href="#optional-fields">Optional fields</a>
</li>
<li><a href="#loading-shaders">Loading shaders</a>
</li>
<li><a href="#logging">Logging</a>
</li>
<li><a href="#error-handling">Error handling</a>
</li>
</ul>
</li>
<li><a href="#refactored-version">Refactored version</a>
<ul>
<li><a href="#olivine-wrappers">Olivine wrappers</a>
</li>
<li><a href="#using-fibers--effects-for-control-flow">Using fibers / effects for control flow</a>
</li>
<li><a href="#using-the-cpu-and-gpu-in-parallel">Using the CPU and GPU in parallel</a>
</li>
<li><a href="#resizing-and-resource-lifetimes">Resizing and resource lifetimes</a>
</li>
</ul>
</li>
<li><a href="#the-3d-version">The 3D version</a>
</li>
<li><a href="#garbage-collection">Garbage collection</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://lobste.rs/s/pzhqdb/vulkan_graphics_ocaml_vs_c">Lobsters</a> )</p>
<h2 id="introduction">Introduction</h2>
<p>In <a href="https://roscidus.com/blog/blog/2025/06/24/graphics/">Investigating Linux graphics</a>,
I wrote a little C program to help me learn about GPUs by drawing a triangle.
But I wondered if using OCaml instead would make my life easier.
It didn't, because there were no released OCaml Vulkan bindings,
but I found some <a href="https://github.com/Octachron/olivine">unfinished ones</a> by Florian Angeletti.
The bindings are generated mostly automatically from <a href="https://github.com/KhronosGroup/Vulkan-Docs/tree/main/xml">the Vulkan XML specification</a>,
and with <a href="https://github.com/talex5/olivine/commits/blog">a bit of effort</a> I got them working well enough to
continue with the <a href="https://docs.vulkan.org/tutorial/latest/00_Introduction.html">Vulkan tutorial</a>,
which resulted in this nice <a href="https://sketchfab.com/3d-models/viking-room-a49f1b8e4f5c4ecf9e1fe7d81915ad38">Viking room</a>:</p>
<p><a href="/blog/images/vulkan-ocaml/viking.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/viking.png" title="Vulkan tutorial in OCaml" class="caption"/><span class="caption-text">Vulkan tutorial in OCaml</span></span></a></p>
<p>In this post, I'll be looking at how the C code compares to the OCaml.
First, I did a direct line-by-line port of the C, then I refactored it to take better advantage of OCaml.</p>
<p>(Note: the Vulkan tutorial is actually <a href="https://docs.vulkan.org/tutorial/latest/_attachments/28_model_loading.cpp">using C++</a>, but I'm comparing my C version to OCaml)</p>
<h2 id="running-it-yourself">Running it yourself</h2>
<p>If you want to try it yourself (note: it requires Wayland):</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b ocaml
cd vulkan-test
nix develop
dune exec -- ./src/main.exe 200
</code></pre>
<p>As the OCaml Vulkan bindings (Olivine) are unreleased,
I included a copy of my patched version in <code>vendor/olivine</code>.
The <code>dune exec</code> command will build them automatically.</p>
<p>The <code>ocaml</code> branch above just draws one triangle.
If you want to see the 3D room pictured above, use <code>ocaml-3d</code> instead:</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b ocaml-3d
cd vulkan-test
nix develop
make download-example
dune exec -- ./src/main.exe 10000 viking_room.obj viking_room.png
</code></pre>
<h2 id="the-direct-port">The direct port</h2>
<p>Porting the code directly, line by line, was pretty straight-forward:</p>
<p><a href="/blog/images/vulkan-ocaml/meld.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/meld.png" title="Comparing the code with meld" class="caption"/><span class="caption-text">Comparing the code with meld</span></span></a></p>
<p><a href="https://github.com/talex5/vulkan-test/tree/direct-port/src">The code</a> ended up slightly shorter, but not by much:</p>
<pre><code> 28 files changed, 1223 insertions(+), 1287 deletions(-)
</code></pre>
<p>This is only approximate; sometimes I added or removed blank lines, etc.
Some things were a bit easier and others a bit harder. It mostly balanced out.</p>
<p>As an example, one thing that makes the OCaml shorter is that arrays are passed as a single item,
whereas C takes the length separately.
On the other hand, single-item arrays can be passed in C by just giving the address of the pointer,
whereas OCaml requires an array to be constructed separately.
Also, I had to include some bindings for the libdrm C library.</p>
<h3 id="labelled-arguments">Labelled arguments</h3>
<p>The OCaml bindings use labelled arguments
(e.g. the <code>VK_TRUE</code> argument in the screenshot above became <code>~wait_all:true</code> in the OCaml),
which is longer but clearer.</p>
<p>The OCaml code uses functions to create C structures, which looks pretty similar due to labels.
For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">const</span><span class="w"> </span><span class="n">VkSemaphoreGetFdInfoKHR</span><span class="w"> </span><span class="n">get_fd_info</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">sType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_STRUCTURE_TYPE_SEMAPHORE_GET_FD_INFO_KHR</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">semaphore</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">semaphore</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">handleType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_SYNC_FD_BIT</span><span class="p">,</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">get_fd_info</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore_get_fd_info_khr</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">semaphore</span>
</span><span class="line">    <span class="o">~</span><span class="n">handle_type</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">External_semaphore_handle_type_flags</span><span class="p">.</span><span class="n">sync_fd</span>
</span><span class="line"><span class="k">in</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An advantage is that the <code>sType</code> field gets filled in automatically.</p>
<h3 id="enums-and-bit-fields">Enums and bit-fields</h3>
<p>Enumerations and bit-fields are namespaced, which is a lot clearer
as you can see which part is the name of the enum and which part is the particular value.
For example, <code>VK_ATTACHMENT_STORE_OP_STORE</code> becomes <code>Vkt.Attachment_store_op.Store</code>.
Also, OCaml usually knows the expected type and you can omit the module, so:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="n">VkAttachmentDescription</span><span class="w"> </span><span class="n">colorAttachment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">format</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_SAMPLE_COUNT_1_BIT</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">loadOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_LOAD_OP_CLEAR</span><span class="p">,</span><span class="w">  </span><span class="c1">// Clear framebuffer before rendering</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">storeOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_STORE_OP_STORE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">stencilLoadOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_LOAD_OP_DONT_CARE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">stencilStoreOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_STORE_OP_DONT_CARE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">initialLayout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_IMAGE_LAYOUT_UNDEFINED</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">finalLayout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_IMAGE_LAYOUT_GENERAL</span><span class="p">,</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">color_attachment</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Attachment_description</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">format</span><span class="o">:</span><span class="n">format</span>
</span><span class="line">    <span class="o">~</span><span class="n">samples</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Sample_count_flags</span><span class="p">.</span><span class="n">n1</span>
</span><span class="line">    <span class="o">~</span><span class="n">load_op</span><span class="o">:</span><span class="nc">Clear</span>	<span class="c">(* Clear framebuffer before rendering *)</span>
</span><span class="line">    <span class="o">~</span><span class="n">store_op</span><span class="o">:</span><span class="nc">Store</span>
</span><span class="line">    <span class="o">~</span><span class="n">stencil_load_op</span><span class="o">:</span><span class="nc">Dont_care</span>
</span><span class="line">    <span class="o">~</span><span class="n">stencil_store_op</span><span class="o">:</span><span class="nc">Dont_care</span>
</span><span class="line">    <span class="o">~</span><span class="n">initial_layout</span><span class="o">:</span><span class="nc">Undefined</span>
</span><span class="line">    <span class="o">~</span><span class="n">final_layout</span><span class="o">:</span><span class="nc">General</span>
</span><span class="line"><span class="k">in</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Bit-fields and enums get their own types (they're not just integers), so you can't use them in the wrong place
or try to combine things that aren't bit-fields (and so the <code>_BIT</code> suffix isn't needed).
One particularly striking example of the difference is that</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="p">.</span><span class="n">colorWriteMask</span><span class="w"> </span><span class="o">=</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_R_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_G_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_B_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_A_BIT</span><span class="p">,</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">~</span><span class="n">color_write_mask</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Color_component_flags</span><span class="p">.</span><span class="o">(</span><span class="n">r</span> <span class="o">+</span> <span class="n">g</span> <span class="o">+</span> <span class="n">b</span> <span class="o">+</span> <span class="n">a</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Vkt.Color_component_flags.(...)</code> brings all the module's symbols into scope,
including the <code>+</code> operator for combining the flags.</p>
<h3 id="optional-fields">Optional fields</h3>
<p>The specification says which fields are optional. In C you can ignore that, but OCaml enforces it.
This can be annoying sometimes, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="p">.</span><span class="n">blendEnable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_FALSE</span><span class="p">,</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">~</span><span class="n">blend_enable</span><span class="o">:</span><span class="bp">false</span>
</span><span class="line"><span class="o">~</span><span class="n">src_color_blend_factor</span><span class="o">:</span><span class="nc">One</span>
</span><span class="line"><span class="o">~</span><span class="n">dst_color_blend_factor</span><span class="o">:</span><span class="nc">Zero</span>
</span><span class="line"><span class="o">~</span><span class="n">color_blend_op</span><span class="o">:</span><span class="nc">Add</span>
</span><span class="line"><span class="o">~</span><span class="n">src_alpha_blend_factor</span><span class="o">:</span><span class="nc">One</span>
</span><span class="line"><span class="o">~</span><span class="n">dst_alpha_blend_factor</span><span class="o">:</span><span class="nc">Zero</span>
</span><span class="line"><span class="o">~</span><span class="n">alpha_blend_op</span><span class="o">:</span><span class="nc">Add</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>because the spec says these are all non-optional, rather than that they are only needed when blending is enabled.</p>
<p>There's a similar situation with the Wayland code:
the OCaml compiler requires you to provide a handler for all possible events.
For example, OCaml forced me to write a handler for the window <code>close</code> event
(and so closing the window works in the OCaml version, but not in the C one).
Likewise, if the compositor returns an error from <code>create_immed</code> the OCaml version logs it,
while the C version ignored the error message, because the C compiler didn't remind me about that.</p>
<h3 id="loading-shaders">Loading shaders</h3>
<p>Loading the shaders was easier.
The C version has code to load the shader bytecode from disk, but in the OCaml I used <a href="https://github.com/johnwhitington/ppx_blob">ppx_blob</a>
to include it at compile time, producing a self-contained executable file:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">load_shader_module</span> <span class="n">device</span> <span class="o">[%</span><span class="n">blob</span> <span class="s2">&quot;./vert.spv&quot;</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="logging">Logging</h3>
<p>OCaml has a somewhat standard logging library, so I was able to get the log messages shown as I wanted
without having to pipe the output through <code>awk</code>.
And, as a bonus, the log messages get written in the correct order now.
e.g. the C libwayland logs:</p>
<pre><code>wl_display#1.delete_id(3)
...
wl_callback#3.done(59067)
</code></pre>
<p>which appears to show a callback firing some time after it was deleted,
while <a href="https://github.com/talex5/ocaml-wayland">ocaml-wayland</a> logs:</p>
<pre><code>&lt;- wl_callback@3.done callback_data:1388855
&lt;- wl_display@1.delete_id id:3
</code></pre>
<h3 id="error-handling">Error handling</h3>
<p>The OCaml bindings return a <code>result</code> type for functions that can return errors,
using polymorphic variants to say exactly which errors can be returned by each function.
That's clever, but I found it pretty useless in practice and I followed the Olivine example code
in immediately turning every <code>Error</code> result into an exception.
You can then handle errors at a higher level (unlike the C, which just calls <code>exit</code>).
Maybe Olivine should be changed to do that itself.</p>
<p>I thought I'd been rigorous about checking for errors in the C, but I missed some places (e.g. <code>vkMapMemory</code>).
The OCaml compiler forced me to handle those too, of course.</p>
<h2 id="refactored-version">Refactored version</h2>
<p>One reason to switch to OCaml was because I was finding it hard to see how all the C code fit together.
I felt that the overall structure was getting lost in the noise.
While the initial OCaml version was similar to the C,
I think <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src">the refactored version</a> is quite a bit easier to read.</p>
<p>Moving code to separate files is much easier than in C.
There, you typically need to write a header file too, and then include it from the other files.
But in the OCaml I could just move e.g. <code>export_semaphore</code> to <code>export</code> in a new file called <code>semaphore.ml</code> and
refer to it as <code>Semaphore.export</code>.
Because each file gets its own namespace, you don't have to guess where functions are defined,
and you don't get naming conflicts between symbols in different files.
The build system (dune) automatically builds all modules in the correct order.</p>
<h3 id="olivine-wrappers">Olivine wrappers</h3>
<p>I added a <code>vulkan</code> directory with wrappers around the auto-generated Vulkan functions
with the aim of removing some noise.
For example, the wrappers take OCaml lists and convert them to C arrays as needed,
and raise exceptions on error instead of returning a result type.</p>
<p>Sometimes they do more, as in the case of <code>queue_submit</code>.
That took separate <code>wait_semaphores</code> and <code>wait_dst_stage_mask</code> arrays,
requiring them to be the same length.
By taking a list of tuples, the wrapper avoids the possibility of this error.
The old submit code:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">wait_semaphores</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">t</span><span class="o">.</span><span class="n">image_available</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">wait_stages</span> <span class="o">=</span> <span class="o">[</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">color_attachment_output</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">submit_info</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Submit_info</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">wait_semaphores</span>
</span><span class="line">    <span class="o">~</span><span class="n">wait_dst_stage_mask</span><span class="o">:(</span><span class="nn">A</span><span class="p">.</span><span class="n">of_list</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">ctype</span> <span class="n">wait_stages</span><span class="o">)</span>
</span><span class="line">    <span class="o">~</span><span class="n">command_buffers</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Command_buffer</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">t</span><span class="o">.</span><span class="n">command_buffer</span><span class="o">])</span>
</span><span class="line">    <span class="o">~</span><span class="n">signal_semaphores</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">frame_state</span><span class="o">.</span><span class="n">render_finished</span><span class="o">])</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="nn">Vkc</span><span class="p">.</span><span class="n">queue_submit</span> <span class="n">t</span><span class="o">.</span><span class="n">graphics_queue</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">~</span><span class="n">submits</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Submit_info</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">submit_info</span><span class="o">])</span>
</span><span class="line">  <span class="o">~</span><span class="n">fence</span><span class="o">:</span><span class="n">t</span><span class="o">.</span><span class="n">in_flight_fence</span> <span class="o">&lt;?&gt;</span> <span class="s2">&quot;queue_submit&quot;</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Vulkan</span><span class="p">.</span><span class="nn">Cmd</span><span class="p">.</span><span class="n">submit</span> <span class="n">device</span> <span class="n">t</span><span class="o">.</span><span class="n">command_buffer</span>
</span><span class="line">  <span class="o">~</span><span class="n">wait</span><span class="o">:[</span><span class="n">t</span><span class="o">.</span><span class="n">image_available</span><span class="o">,</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">color_attachment_output</span><span class="o">]</span>
</span><span class="line">  <span class="o">~</span><span class="n">signal_semaphores</span><span class="o">:[</span><span class="n">frame_state</span><span class="o">.</span><span class="n">render_finished</span><span class="o">]</span>
</span><span class="line">  <span class="o">~</span><span class="n">fence</span><span class="o">:</span><span class="n">t</span><span class="o">.</span><span class="n">in_flight_fence</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Sometimes the new API drops features I don't use (or don't currently understand).
For example, my new <code>submit</code> only lets you submit one command buffer at a time
(though each buffer can have many commands).</p>
<p>I moved various generic helper functions like <code>find_memory_type</code> to the wrapper library,
getting them out of the main application code.</p>
<p>Separating out these libraries made the code longer, but I think it makes it easier to read:</p>
<pre><code> 20 files changed, 843 insertions(+), 663 deletions(-)
</code></pre>
<h3 id="using-fibers--effects-for-control-flow">Using fibers / effects for control flow</h3>
<p>The C code has a single thread with a single stack,
using callbacks to redraw when the compositor is ready.
OCaml has fibers (light-weight cooperative threads), so we can use a plain loop:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">while</span> <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;</span> <span class="n">frame_limit</span> <span class="k">do</span>
</span><span class="line">  <span class="k">let</span> <span class="n">next_frame_due</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">frame</span> <span class="n">window</span> <span class="k">in</span>
</span><span class="line">  <span class="n">draw_frame</span> <span class="n">t</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">next_frame_due</span><span class="o">;</span>
</span><span class="line">  <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;-</span> <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">+</span> <span class="mi">1</span>
</span><span class="line"><span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Promise.await</code> suspends this fiber, allowing e.g. the Wayland code to handle incoming events.
I find that makes the logic easier to follow.</p>
<h3 id="using-the-cpu-and-gpu-in-parallel">Using the CPU and GPU in parallel</h3>
<p>Next I split off the input handling from the huge <code>render.ml</code> file into <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src/input.ml">input.ml</a>.</p>
<p>The Vulkan tutorial creates one uniform buffer for the input data for each frame-buffer, but this seems wasteful.
I think we only need at most two: one for the GPU to read, and one for the CPU to write for the next frame,
if we want to do that in parallel.</p>
<p>To allow this parallel operation I also had to create a pair of command buffers.
The <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src/duo.ml">duo.ml</a> module holds the two (input, command-buffer) jobs and swaps them on submit.</p>
<h3 id="resizing-and-resource-lifetimes">Resizing and resource lifetimes</h3>
<p>When the window size changes we need to destroy the old swap-chain and recreate all the images, views
and framebuffers.
My C code didn't bother, and just kept things at 640x480.</p>
<p>The main problem here is how to clean up the old resources.
We could use the garbage collector, but the framebuffers are rather large and I'd like to get them freed promptly.
Also, Vulkan requires things to be freed in the correct order, which the GC wouldn't ensure.</p>
<p>I added code to free resources by having each constructor take a <code>sw</code> switch argument.
When the switch is turned off, all resources attached to it are freed.
That makes it easy to scope things to the stack: when the <code>Switch.run</code> block ends, all resources it created are freed.</p>
<p>But the life-cycle of the swap-chain is a little complicated.
I don't want to clutter the main application loop with the logic of adapting to size changes.
Again, OCaml's fibers system makes it easy to have multiple stacks so I have another fiber run:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">render_loop</span> <span class="n">t</span> <span class="n">duo</span> <span class="o">=</span>
</span><span class="line">  <span class="k">while</span> <span class="bp">true</span> <span class="k">do</span>
</span><span class="line">    <span class="k">let</span> <span class="n">geometry</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">geometry</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">Switch</span><span class="p">.</span><span class="n">run</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">sw</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">framebuffers</span> <span class="o">=</span> <span class="n">create_swapchain</span> <span class="o">~</span><span class="n">sw</span> <span class="n">t</span> <span class="n">geometry</span> <span class="k">in</span>
</span><span class="line">    <span class="k">while</span> <span class="n">geometry</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">geometry</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="k">do</span>
</span><span class="line">      <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">Vulkan</span><span class="p">.</span><span class="nn">Swap_chain</span><span class="p">.</span><span class="n">get_framebuffer</span> <span class="n">framebuffers</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">redraw_needed</span> <span class="o">=</span> <span class="n">next_as_promise</span> <span class="n">t</span><span class="o">.</span><span class="n">redraw_needed</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">job</span> <span class="o">=</span> <span class="nn">Duo</span><span class="p">.</span><span class="n">get</span> <span class="n">duo</span> <span class="k">in</span>
</span><span class="line">      <span class="n">record_commands</span> <span class="n">t</span> <span class="n">job</span> <span class="n">fb</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Duo</span><span class="p">.</span><span class="n">submit</span> <span class="n">duo</span> <span class="n">fb</span> <span class="n">job</span><span class="o">.</span><span class="n">command_buffer</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Window</span><span class="p">.</span><span class="n">attach</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="o">~</span><span class="n">buffer</span><span class="o">:</span><span class="n">fb</span><span class="o">.</span><span class="n">wl_buffer</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">redraw_needed</span>
</span><span class="line">    <span class="k">done</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The C code created a fixed set of 4 framebuffers on each resize, but the OCaml only creates them as needed.
When dragging the window to resize that means we may only need to create one at each size,
and when keeping a steady size, it seems I only need 3 framebuffers with Sway.</p>
<p>The main loop changes slightly so that it just triggers the <code>render_loop</code> fiber:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">while</span> <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;</span> <span class="n">frame_limit</span> <span class="k">do</span>
</span><span class="line">  <span class="k">let</span> <span class="n">next_frame_due</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">frame</span> <span class="n">window</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Render</span><span class="p">.</span><span class="n">trigger_redraw</span> <span class="n">render</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">next_frame_due</span><span class="o">;</span>
</span><span class="line">  <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;-</span> <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">+</span> <span class="mi">1</span>
</span><span class="line"><span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I'm not sure if freeing the framebuffers immediately is safe,
since in theory the GPU might still be using them if the display server requests a new frame
at a new size before the previous one has finished rendering on the GPU.
Possibly freed OCaml resources should instead get added to
a list of things to free on the C side the next time the GPU is idle.</p>
<h2 id="the-3d-version">The 3D version</h2>
<p>Although it looks a lot more impressive, the <a href="https://github.com/talex5/vulkan-test/commit/ocaml-3d">3D version</a> isn't that much more work than the 2D triangle.</p>
<p>I used the <a href="https://github.com/Chris00/ocaml-cairo">Cairo</a> library to load the PNG file with the textures
and then added a Vulkan <em>sampler</em> for it.
The shader code has to be modified to read the colour from the texture.
The most complex bit is that the texture needs to be copied
from Cairo's memory to host memory that's visible to the GPU,
and from there to fast local memory on the GPU (see <a href="https://github.com/talex5/vulkan-test/tree/ocaml-3d/src/texture.ml">texture.ml</a>).</p>
<p>Other changes needed:</p>
<ul>
<li>There's a bit of matrix stuff to position the model and project it in 3D.
</li>
<li>I added <a href="https://github.com/talex5/vulkan-test/tree/ocaml-3d/src/obj_format.ml">obj_format.ml</a> to parse the model data.
</li>
<li>The pipeline adds a depth buffer so near things obscure things behind them, regardless of the drawing order.
</li>
</ul>
<p>I didn't get my C version to do the 3D bits, but for comparison here's the Vulkan tutorial's official <a href="https://docs.vulkan.org/tutorial/latest/_attachments/28_model_loading.cpp">C++ version</a>.</p>
<h2 id="garbage-collection">Garbage collection</h2>
<p>To render smoothly at 60Hz, we have about 16ms for each frame.
You might wonder if using a garbage collector would introduce pauses and cause us to miss frames,
but this doesn't seem to be a problem.</p>
<p>In C, you can improve performance for frame-based applications by using a <a href="https://en.wikipedia.org/wiki/Region-based_memory_management">bump allocator</a>:</p>
<ol>
<li>Create a fixed buffer with enough space for every allocation needed for one frame.
</li>
<li>Allocate memory just by allocating sequentially in the region (bumping the next-free-address pointer).
</li>
<li>At the end of each frame, reset the pointer.
</li>
</ol>
<p>This makes allocation really fast and freeing things at the end costs nothing.
Implementing this in C requires special code,
but OCaml works this way by default, allocating new values sequentially onto the <em>minor heap</em>.
At the end of each frame, we can call <code>Gc.minor</code> to reset the heap.</p>
<p><code>Gc.minor</code> scans the stack looking for pointers to values that are still in use
and copies any it finds to the <em>major heap</em>.
However, since we're at the end of the frame, the stack is pretty much empty and there's almost nothing to scan.
I captured a trace of running the 3D room version with a forced minor GC at the end of every frame:</p>
<pre><code>make &amp;&amp; eio-trace run ./_build/default/src/main.exe
</code></pre>
<p><a href="/blog/images/vulkan-ocaml/trace-3d.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/trace-3d.png" title="Tracing the full 3D version" class="caption"/><span class="caption-text">Tracing the full 3D version</span></span></a></p>
<p>The four long grey horizontal bars are the main fibers.
From top to bottom they are:</p>
<ul>
<li>The main application loop (incrementing the frame counter and triggering the render loop fiber).
</li>
<li>An <a href="https://github.com/talex5/ocaml-wayland">ocaml-wayland</a> fiber, receiving messages from the display server
(and spawning some short-lived sending fibers).
</li>
<li>The <code>render_loop</code> fiber (sending graphics commands to the GPU).
</li>
<li>A fiber used internally by the IO system.
</li>
</ul>
<p>The green sections show when each fiber is running and the yellow background indicates when the process is sleeping.
The thin red columns indicate time spent in GC (which we're here triggering after every frame).</p>
<p>If I remove the forced <code>Gc.minor</code> after each frame then the GC happens less often,
but can take a bit longer when it does.
Still not nearly long enough to miss the deadline for rendering the frame though.</p>
<p>Collection of the major heap is done incrementally in small slices and doesn't cause any trouble.</p>
<p>So, we're only using a tiny fraction of the available time.
Also, I suspect the CPU is running in a slow power-saving mode due to all the sleeping;
if we had more work to do then it would probably speed up.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Doing Vulkan programming in OCaml has advantages (clearer code, easier refactoring),
but also disadvantages (unfinished and unreleased Vulkan bindings, some friction using a C API from OCaml,
and I had to write more support code, such as some bindings for libdrm).</p>
<p>As a C API, Vulkan is not safe and will happily segfault if passed incorrect arguments.
The OCaml bindings do not fix this, and so care is still needed.
I didn't bother about that because it wasn't a problem in practice,
and properly protecting against use-after-free will probably require some changes to OCaml
(e.g. <a href="https://github.com/ocaml/ocaml/pull/389">unmapping memory</a> isn't safe without something like the &quot;modes&quot; being prototyped in
<a href="https://oxcaml.org/">OxCaml</a>).</p>
<p>I'm slowly upstreaming my changes to Olivine; hopefully this will all be easier to use one day!</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Investigating Linux graphics</title>
    <link href="https://roscidus.com/blog/blog/2025/06/24/graphics/"></link>
    <updated>2025-06-24T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/06/24/graphics</id>
    <content type="html"><![CDATA[<p>I learn how to draw a triangle with a GPU, and then trace the code to find out how the graphics system works (or doesn't),
looking at Mesa3D, GLFW, OpenGL, Vulkan, Wayland and Linux DRM.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#overview">Overview</a>
</li>
<li><a href="#opengl">OpenGL</a>
</li>
<li><a href="#vulkan">Vulkan</a>
</li>
<li><a href="#synchronisation">Synchronisation</a>
</li>
<li><a href="#first-attempt-at-tracing">First attempt at tracing</a>
</li>
<li><a href="#removing-glfw">Removing GLFW</a>
</li>
<li><a href="#removing-vulkans-wayland-extension">Removing Vulkan's Wayland extension</a>
</li>
<li><a href="#wayland-walk-through">Wayland walk-through</a>
</li>
<li><a href="#kernel-details-with-bpftrace">Kernel details with bpftrace</a>
<ul>
<li><a href="#start-up-and-library-loading">Start-up and library loading</a>
</li>
<li><a href="#enumerating-devices">Enumerating devices</a>
</li>
<li><a href="#setting-up-the-pipeline">Setting up the pipeline</a>
</li>
<li><a href="#rendering-one-frame">Rendering one frame</a>
</li>
</ul>
</li>
<li><a href="#re-examining-the-errors">Re-examining the errors</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="introduction">Introduction</h2>
<p>In the past, I've avoided graphics driver problems on Linux by using only Intel integrated graphics.
But, due to a poor choice of motherboard, I ended up needing a separate graphics card.
Now my computer takes 14s to resume from suspend and <code>dmesg</code> is spewing this kind of thing:</p>
<pre><code>[59829.886009] [drm] Fence fallback timer expired on ring sdma0
[59830.390003] [drm] Fence fallback timer expired on ring sdma0
[59830.894002] [drm] Fence fallback timer expired on ring sdma0

[79622.739495] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
[79622.909019] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.2 test failed (-110)
[79623.075056] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.3 test failed (-110)
[79623.241971] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.4 test failed (-110)
[79623.408604] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.6 test failed (-110)

[80202.893020] [drm] scheduler comp_1.0.1 is not ready, skipping
[80202.893023] [drm] scheduler comp_1.0.2 is not ready, skipping
[80202.893024] [drm] scheduler comp_1.0.3 is not ready, skipping
[80202.893025] [drm] scheduler comp_1.0.4 is not ready, skipping
[80202.893025] [drm] scheduler comp_1.0.6 is not ready, skipping
[80202.936910] [drm] scheduler comp_1.0.1 is not ready, skipping
</code></pre>
<p>But what is a &quot;fence&quot; or an &quot;sdma0&quot; ring? What are these <code>comp_</code> schedulers,
and why does Linux <a href="https://gitlab.freedesktop.org/drm/amd/-/issues/3579">Oops when enough of them aren't ready</a>?
And why does <a href="https://github.com/NixOS/nixpkgs/issues/409093">Firefox hang</a> when playing videos since upgrading NixOS?
I thought it was time I learnt something about how Linux graphics is supposed to work...</p>
<h2 id="overview">Overview</h2>
<p>To show something on the screen, we allocate a chunk of memory (called a <em>framebuffer</em>) to hold the colour of each pixel.
After calculating all the (millions of) colour values, we tell the <em>display hardware</em> the address of the framebuffer
and it sends all the values to the monitor for display.
While it's doing this, we can be rendering the next frame to another framebuffer.</p>
<p>Computers aren't well optimised for this kind of work, but a <em>graphics card</em> speeds things up.
A graphics card is like a second computer, with its own memory, processors and display hardware,
but optimised for graphics:</p>
<dl><dt>The main computer (host)</dt>
<dd>
Typically has a <em>small number</em> of <em>very fast</em> processors.
</dd>
<dt>A graphics card</dt>
<dd>
Has a <em>large number</em> of <em>relatively slow</em> processors.
</dd>
</dl>
<p>The graphics card architecture is useful because we can split the screen into many small tiles
and render them in parallel on different processors.
Running the processors relatively slowly saves energy (and heat), allowing us to have more of them.</p>
<p>Note: a GPU (Graphics Processing Unit) doesn't have to be on a separate card; it can also be part of the main computer
and use the main memory instead of dedicated RAM.</p>
<p>Usually we run multiple applications and have them share the screen.
Ideally, each application (e.g. Firefox) runs code on the GPU to render its window contents to GPU memory (1),
then shares the reference to that memory with the <em>display server</em> (2).
The display server (Sway in my case) runs more code on the GPU (3) to copy this window into the final image (4),
which the hardware sends to the screen (5):</p>
<p><a href="/blog/images/graphics/arch.svg"><span class="caption-wrapper center"><img src="/blog/images/graphics/arch.svg" title="Wayland desktop with a graphics card" class="caption"/><span class="caption-text">Wayland desktop with a graphics card</span></span></a></p>
<p>Application processes send instructions to the GPU via a Linux kernel driver (<code>amdgpu</code> in my case).
Every GPU has its own API, and these are very low-level,
so applications generally use the <a href="https://mesa3d.org/">Mesa</a> library to provide an API that works across all devices.
Mesa has backends for all the different GPUs, and also a software-rendering fallback if there isn't a GPU available.</p>
<p><a href="https://lwn.net/Articles/955376/">The Linux graphics stack in a nutshell</a> has more explanations, but I wanted to try it out myself...</p>
<h2 id="opengl">OpenGL</h2>
<p>Mesa supports OpenGL, a cross-platform standard API for graphics.
However, you also need some platform-specific code to open a window and connect up a suitable backend.
After a bit of searching I found <a href="https://www.glfw.org/">GLFW</a> (Graphics Library FrameWork),
which has a nice <a href="https://www.glfw.org/docs/latest/quick.html">tutorial</a> showing how to draw a triangle into a window.</p>
<p>That worked, and I got a window with a colourful rotating triangle,
animating smoothly even fullscreen, while showing no CPU load
(use <code>LIBGL_ALWAYS_SOFTWARE=true</code> to compare with software rendering):</p>
<p><a href="/blog/images/graphics/opengl.png"><span class="caption-wrapper center"><img src="/blog/images/graphics/opengl.png" title="OpenGL triangle" class="caption"/><span class="caption-text">OpenGL triangle</span></span></a></p>
<p>It starts by setting everything up:</p>
<ol>
<li>Create a window (<code>window = glfwCreateWindow(640, 480, ...)</code>)
</li>
<li>Make it the OpenGL target (<code> glfwMakeContextCurrent(window)</code>)
</li>
<li>Create a buffer for the triangle's vertices and send it to the graphics card.
</li>
<li>Compile vertex and fragment &quot;shaders&quot; for the graphics card.
</li>
</ol>
<p>Programs that run on the GPU are called &quot;shaders&quot; (whether they do shading or not).
They're written in a C-like language and embedded in the example's C source code as strings.</p>
<p>The vertex shader runs for each of the triangle's 3 vertices, rotating them according to an input parameter.
The fragment shader then runs for each screen pixel covered by the rotated triangle, choosing its colour.
This is basically just the identity function,
because OpenGL automatically interpolates the colours of the three vertices
and passes that as an input to the fragment shader.</p>
<p>The example then runs the main loop, rendering each frame:</p>
<ol>
<li>Set the OpenGL viewport to the size of the window.
</li>
<li>Clear to the background colour.
</li>
<li>Set the vertex shader input to the desired rotation.
</li>
<li>Draw the triangle, using the shaders.
</li>
<li>Make the final image be the one displayed (<code>glfwSwapBuffers(window)</code>).
</li>
</ol>
<p>Mesa offers several other APIs as alternatives to OpenGL.
OpenGL ES (OpenGL for Embedded Systems) is mostly a subset of OpenGL,
though with some minor improvements of its own.
But I was particularly interested in Vulkan...</p>
<h2 id="vulkan">Vulkan</h2>
<p>Vulkan describes itself as &quot;a low-level API that removes many of the
abstractions found in previous generation graphics APIs&quot;.
For example, each OpenGL driver includes a compiler for the shader language,
but in Vulkan you compile the shader sources to <a href="https://en.wikipedia.org/wiki/Standard_Portable_Intermediate_Representation">SPIR-V bytecode</a> with an external tool,
then just pass the bytecode to the driver.
So, a Vulkan driver should be simpler and easier to understand.</p>
<p>The <a href="https://docs.vulkan.org/tutorial/latest/00_Introduction.html">Vulkan tutorial</a> warns that
&quot;the price you pay for these benefits is that you have to work with a significantly more verbose API&quot;.
Indeed; while the OpenGL triangle example is 171 lines of C, Vulkan's triangle example is 900 lines of C++!</p>
<p>But this is just what I want; a detailed break-down of the different steps.
There are a lot of objects you need to create and I ended up making a diagram to keep them straight:</p>
<p><a href="/blog/images/graphics/vulkan-deps.svg"><span class="caption-wrapper center"><img src="/blog/images/graphics/vulkan-deps.svg" title="Vulkan object dependencies" class="caption"/><span class="caption-text">Vulkan object dependencies</span></span></a></p>
<p>Here are the set-up steps:</p>
<ul>
<li>As before, we start by creating a normal Wayland window.
</li>
<li>Make a Vulkan <em>instance</em>, requesting the Wayland-support extension.
</li>
<li>Use the instance and the window to create a Vulkan <em>surface</em>.
</li>
<li>Use the instance to get a list of the <em>physical devices</em>.
In my case, this finds two devices: <code>AMD Radeon RX 550 Series</code> and <code>llvmpipe</code> (software rendering).
Oddly, my Intel integrated GPU doesn't show up here.<br />
[Update: in the BIOS settings, &quot;Internal Graphics&quot; was set to &quot;Auto&quot;, which disabled it]
</li>
<li>Create a <em>device</em> for the physical device you want to use.
Each device can have queues of various &quot;families&quot;, and we need to say how many queues we want of each type.
We need to choose a device supporting queues that can a) render graphics, and b) present the rendered image on our Wayland surface.
My AMD GPU supports three queue families, of which two can present to the surface and one can render graphics.
I chose to use two queues, though I could have used one for both on my card.
</li>
<li>Create a <em>render pass</em> with metadata about what the rendering will do.
In the tutorial example,
this says that rendering will clear the framebuffer, and that it depends on getting an image to draw on.
</li>
<li>Create the <em>shaders</em> by loading the pre-compiled SPIR-V bytecode.
</li>
<li>Create a <em>graphics pipeline</em> to implement the render pass using the shaders.
</li>
<li>Create a <em>swap chain</em> for the device and surface.
The swap chain creates a number of <em>images</em> which can be rendered to and displayed.
</li>
<li>Wrap each image from the swap chain in an <em>image view</em>, which gives the format of the data.
</li>
<li>Wrap each image view with a <em>framebuffer</em> for the <em>render pass</em>.
</li>
<li>Create a <em>command pool</em> to manage command buffers.
</li>
<li>Get a <em>command buffer</em> from the pool.
</li>
</ul>
<p>Now we're ready for the main loop:</p>
<ul>
<li>Get the next <em>image</em> from the <em>swap chain</em> (asking Vulkan to notify a <em>semaphore</em> when the image is ready to be overwritten).
</li>
<li>Record the operations we want to perform to the <em>command buffer</em>:
<ol>
<li>Clear the image's <em>framebuffer</em> to the background colour.
</li>
<li>Set the pipeline input's viewport to the current window size.
</li>
<li>Use the <em>graphics pipeline</em> to render the scene (a single triangle in this case).
</li>
</ol>
</li>
<li>Enqueue the <em>command buffer</em> to the device's <em>graphics queue</em>.
We pass it the semaphore to wait on (as required by the <em>render pass</em>),
another semaphore to signal when it's done,
and a <em>fence</em> to signal the host that the command buffer can be reused.
A &quot;fence&quot; turns out to be just a semaphore shared with the host
(whereas Vulkan uses &quot;semaphore&quot; to mean one used to signal within the GPU).
</li>
<li>Enqueue a presentation of the <em>image</em> to the device's <em>present queue</em>,
asking it to present the image (using the display server) when the rendering semaphore is notified.
</li>
</ul>
<p>Well, that was a lot more work! And animating the triangle required quite a bit more effort on top of that.
But I think I have a better understanding of how things work than with OpenGL,
and I suspect this will be easier to trace.</p>
<p>The <code>vulkaninfo</code> command will dump out pages of information about detected GPUs.
Some highlights from mine:</p>
<pre><code>Instance Extensions: count = 24
    VK_KHR_wayland_surface : extension revision 6
    VK_KHR_xcb_surface     : extension revision 6
    VK_KHR_xlib_surface    : extension revision 6
GPU id : 0 (AMD Radeon RX 550 Series (RADV POLARIS11)) [VK_KHR_wayland_surface]:
    VkSurfaceCapabilitiesKHR:
        minImageCount = 4
VkQueueFamilyProperties:
    queueProperties[0]:
        queueFlags = QUEUE_GRAPHICS_BIT | QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
    queueProperties[1]:
        queueFlags = QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
    queueProperties[2]:
        queueFlags = QUEUE_SPARSE_BINDING_BIT
GPU id : 1 (llvmpipe (LLVM 19.1.7, 256 bits)) [VK_KHR_wayland_surface]:
</code></pre>
<p>One interesting feature of Vulkan is that you can enable &quot;validation layers&quot;,
which check that you're using the API correctly during development, but can be turned off
for better performance in production.
That showed, for example, that the synchronisation in the tutorial isn't quite right
(see <a href="https://docs.vulkan.org/guide/latest/swapchain_semaphore_reuse.html">Swapchain Semaphore Reuse</a> for details; the tutorial uses only one <code>renderFinishedSemaphore</code>,
not one per framebuffer).</p>
<h2 id="synchronisation">Synchronisation</h2>
<p>One interesting thing here is that the swap chain sends the image to the compositor (display server) without waiting for rendering to finish.
How does that work?</p>
<p>There are two ways to make sure the compositor doesn't try to use the image before rendering is complete:
<em>implicit</em> or <em>explicit</em> synchronisation.
With implicit synchronisation, the Linux kernel keeps track of which GPU jobs are accessing which buffers.
So when we submit a job to render our triangle, the kernel attaches the job's completion fence to the output buffer.
When the compositor (Sway in my case) submits a GPU job to copy the image, the kernel waits for the fence first.</p>
<p>This apparently has various problems.
For example, the compositor might prefer to use the previous already-complete frame rather than waiting for this one.
There is an explicit synchronisation Wayland protocol to fix that, but my version of Sway doesn't support it.</p>
<p>For more details, see <a href="https://www.collabora.com/news-and-blog/blog/2022/06/09/bridging-the-synchronization-gap-on-linux/">Bridging the synchronization gap on Linux</a> and <a href="https://zamundaaa.github.io/wayland/2024/04/05/explicit-sync.html">Explicit sync</a>.</p>
<h2 id="first-attempt-at-tracing">First attempt at tracing</h2>
<p>How does the rendered image get from the test application to the Wayland compositor (display server)?
I ran with <code>WAYLAND_DEBUG=1</code> to log all the messages sent and received by the tutorial application:</p>
<pre><code>$ WAYLAND_DEBUG=1 ./vulkan-test 
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#2)
{Default Queue}  -&gt; wl_display#1.sync(new id wl_callback#3)
{Display Queue} wl_display#1.delete_id(3)
...
</code></pre>
<p>But the output was confusing, and
<code>strace</code> showed that the test application is connecting 4 times to the display server:</p>
<pre><code>$ strace -e connect ./vulkan-test 2&gt;&amp;1 | grep /wayland
connect(3, {sa_family=AF_UNIX, sun_path=&quot;/run/user/1000/wayland-1&quot;}, 27) = 0
connect(5, {sa_family=AF_UNIX, sun_path=&quot;/run/user/1000/wayland-1&quot;}, 27) = 0
connect(23, {sa_family=AF_UNIX, sun_path=&quot;/run/user/1000/wayland-1&quot;}, 27) = 0
connect(23, {sa_family=AF_UNIX, sun_path=&quot;/run/user/1000/wayland-1&quot;}, 27) = 0
</code></pre>
<p>Further complicating things, the example is using 6 Wayland queues:</p>
<pre><code>$ WAYLAND_DEBUG=1 ./vulkan-test 2&gt;&amp;1 | sed -n 's/.*\({[^}]*}\).*/\1/p' | sort | uniq -c
    618 {Default Queue}
    135 {Display Queue}
    288 {mesa formats query}
     72 {mesa image count query}
    144 {mesa present modes query}
    385 {mesa vk display queue}
</code></pre>
<p>In total the test application asks Sway for the list of supported extensions (protocols) 14 times!
It binds the same extensions (and calls <code>zwp_linux_dmabuf_v1.get_default_feedback</code>) over and over again:</p>
<pre><code>$ WAYLAND_DEBUG=1 ./vulkan-test 2&gt;&amp;1 | grep get_registry
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#2)
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#3)
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#21)
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#2)
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#2)
{Default Queue}  -&gt; wl_display#1.get_registry(new id wl_registry#2)
{mesa image count query}  -&gt; wl_display#1.get_registry(new id wl_registry#52)
{mesa formats query}  -&gt; wl_display#1.get_registry(new id wl_registry#51)
{mesa formats query}  -&gt; wl_display#1.get_registry(new id wl_registry#50)
{mesa present modes query}  -&gt; wl_display#1.get_registry(new id wl_registry#43)
{mesa present modes query}  -&gt; wl_display#1.get_registry(new id wl_registry#44)
{mesa formats query}  -&gt; wl_display#1.get_registry(new id wl_registry#46)
{mesa formats query}  -&gt; wl_display#1.get_registry(new id wl_registry#48)
{mesa vk display queue}  -&gt; wl_display#1.get_registry(new id wl_registry#42)
</code></pre>
<p>What a mess!</p>
<p>There's too much going on here.
GLFW is pulling in support for Wayland and X11, OpenGL and Vulkan, plus keyboard and pointer support.
It's also using 11 CPU threads!</p>
<h2 id="removing-glfw">Removing GLFW</h2>
<p>I removed the GLFW library and replaced it with a minimal Wayland skeleton (from <a href="https://wayland-book.com/">wayland-book.com</a>).
The main integration point between Wayland and Vulkan is the <code>VkSurfaceKHR</code>, which was previously created by GLFW.
I just had to replace <code>glfwCreateWindowSurface</code> with <code>vkCreateWaylandSurfaceKHR</code>.
That takes the libwayland <code>wl_display</code> and <code>wl_surface</code> objects and stores them in a Vulkan surface struct.</p>
<p><code>strace</code> says we're now making 3 Wayland connections rather than 4,
using 5 threads rather than 11, and calling <code>get_registry</code> 11 times rather than 14.
Still too much!</p>
<p>The noise is made worse because of the Vulkan API's need to call enumeration functions twice,
once to find out how big the results array needs to be, and then a second time once you've allocated it.
For example, this is how the tutorial suggests getting the devices:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="kt">uint32_t</span><span class="w"> </span><span class="n">deviceCount</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
</span><span class="line"><span class="n">vkEnumeratePhysicalDevices</span><span class="p">(</span><span class="n">instance</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">deviceCount</span><span class="p">,</span><span class="w"> </span><span class="n">nullptr</span><span class="p">);</span>
</span><span class="line"><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">VkPhysicalDevice</span><span class="o">&gt;</span><span class="w"> </span><span class="n">devices</span><span class="p">(</span><span class="n">deviceCount</span><span class="p">);</span>
</span><span class="line"><span class="n">vkEnumeratePhysicalDevices</span><span class="p">(</span><span class="n">instance</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">deviceCount</span><span class="p">,</span><span class="w"> </span><span class="n">devices</span><span class="p">.</span><span class="n">data</span><span class="p">());</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Mesa makes a fresh connection to Wayland both times.
It does this to ensure it lists the compositor's default device first, even though we don't care about that
(we want to select a device to match our specific Wayland surface, not a generic fallback).
This seems to be a hack for applications that always pick the first returned device.</p>
<h2 id="removing-vulkans-wayland-extension">Removing Vulkan's Wayland extension</h2>
<p>To simplify things further, I removed <code>VK_KHR_wayland_surface</code> (which handles Wayland integration) from the instance extensions,
and replaced the swapchain with my own code.
This was actually quite difficult, and involved a fair bit of stepping through things with gdb to see how Mesa did it.</p>
<p>Oddly, the present queue doesn't seem to be needed at all.
Some code-paths require blitting some data first, and I think the queue is just being used to wait for that,
but the way I'm using it with Wayland doesn't need it.</p>
<p>I had to fight a bit with warnings from the validation layer (though it was displaying the triangle fine).
For example, Mesa creates the images with <code>VK_EXTERNAL_MEMORY_HANDLE_TYPE_DMA_BUF_BIT_EXT</code>, which seems reasonable,
but that triggers <code>VUID-VkImageCreateInfo-pNext-00990</code>.
I'm now using <code>VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT</code> instead, which seems wrong
but works and makes validation happy.</p>
<p>Even this simple version was still using 3 threads!
Running under <code>gdb</code> with <code>break __clone3</code> showed that
the Radeon driver's <code>radv_physical_device_try_create</code> function creates a couple of disk caches
(both called <code>vulkan-:disk$0</code>), which create threads.
For tracing purposes, having a cache just adds noise and confusion,
so I set <code>MESA_SHADER_CACHE_DISABLE=1</code> in the code to disable it.</p>
<p>The final version uses a single thread, connects only once to the Wayland compositor, and calls <code>get_registry</code> only once.
If you'd like to follow along, the code is at <a href="https://github.com/talex5/vulkan-test">vulkan-test</a> and you can run it like this:</p>
<pre><code>git clone https://github.com/talex5/vulkan-test.git
cd vulkan-test
nix develop
make &amp;&amp; ./vulkan-test 200
</code></pre>
<p>You should see a triangle sliding to the right.
The <code>200</code> is the number of frames it will show before quitting.</p>
<h2 id="wayland-walk-through">Wayland walk-through</h2>
<p>To run the test application with Wayland logging enabled:</p>
<pre><code>$ make trace &gt; wayland.log
</code></pre>
<p>The generated <a href="/blog/data/graphics/wayland.log">wayland.log</a> shows all messages exchanged with Sway (my Wayland compositor),
as well as some <code>printf</code> messages I added in the code.</p>
<p>The <a href="https://github.com/talex5/vulkan-test/blob/blog/main.c#L227">main</a> function starts by creating a new Vulkan instance with some extensions:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">const</span><span class="w"> </span><span class="kt">char</span><span class="o">*</span><span class="w"> </span><span class="n">instanceExtensions</span><span class="p">[]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="n">VK_KHR_GET_PHYSICAL_DEVICE_PROPERTIES_2_EXTENSION_NAME</span><span class="p">,</span><span class="w"> </span><span class="c1">// Get Unix device ID to compare with Wayland</span>
</span><span class="line"><span class="w">  </span><span class="n">VK_KHR_EXTERNAL_MEMORY_CAPABILITIES_EXTENSION_NAME</span><span class="p">,</span><span class="w">	  </span><span class="c1">// Share images over Wayland</span>
</span><span class="line"><span class="w">  </span><span class="n">VK_KHR_EXTERNAL_SEMAPHORE_CAPABILITIES_EXTENSION_NAME</span><span class="p">,</span><span class="w">  </span><span class="c1">// Use Linux sync files</span>
</span><span class="line"><span class="p">};</span>
</span><span class="line"><span class="n">VkInstance</span><span class="w"> </span><span class="n">instance</span><span class="p">;</span>
</span><span class="line"><span class="p">...</span>
</span><span class="line"><span class="n">LOG</span><span class="p">(</span><span class="s">&quot;Create instance with %d layers</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">createInfo</span><span class="p">.</span><span class="n">enabledLayerCount</span><span class="p">);</span>
</span><span class="line"><span class="n">vkCreateInstance</span><span class="p">(</span><span class="o">&amp;</span><span class="n">createInfo</span><span class="p">,</span><span class="w"> </span><span class="nb">NULL</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">instance</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You can uncomment the <code>//#define VALIDATION</code> if you want to enable the validation layer,
but for tracing I've turned it off:</p>
<pre><code>Create instance with 0 layers
</code></pre>
<p>Next, the test application <a href="https://github.com/talex5/vulkan-test/blob/blog/wayland.c#L128">connects to Wayland</a> and requests the registry of supported extensions (protocols):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="n">state</span><span class="o">-&gt;</span><span class="n">display</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wl_display_connect</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
</span><span class="line">
</span><span class="line"><span class="k">struct</span><span class="w"> </span><span class="nc">wl_registry</span><span class="w"> </span><span class="o">*</span><span class="n">registry</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wl_display_get_registry</span><span class="p">(</span><span class="n">state</span><span class="o">-&gt;</span><span class="n">display</span><span class="p">);</span>
</span><span class="line"><span class="n">wl_registry_add_listener</span><span class="p">(</span><span class="n">registry</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">registry_listener</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">);</span>
</span><span class="line"><span class="n">wl_display_roundtrip</span><span class="p">(</span><span class="n">state</span><span class="o">-&gt;</span><span class="n">display</span><span class="p">);</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The logs show Sway replying with lots of extensions.
When we see one we want, we <em>bind</em> a particular version of it:</p>
<pre><code> -&gt; wl_display#1.get_registry(new id wl_registry#2)
 -&gt; wl_display#1.sync(new id wl_callback#3)
wl_display#1.delete_id(3)
wl_registry#2.global(1, &quot;wl_shm&quot;, 2)
wl_registry#2.global(2, &quot;zwp_linux_dmabuf_v1&quot;, 4)
 -&gt; wl_registry#2.bind(2, &quot;zwp_linux_dmabuf_v1&quot;, 4, new id [unknown]#4)
wl_registry#2.global(3, &quot;wl_compositor&quot;, 6)
 -&gt; wl_registry#2.bind(3, &quot;wl_compositor&quot;, 4, new id [unknown]#5)
wl_registry#2.global(4, &quot;wl_subcompositor&quot;, 1)
wl_registry#2.global(5, &quot;wl_data_device_manager&quot;, 3)
wl_registry#2.global(6, &quot;zwlr_gamma_control_manager_v1&quot;, 1)
wl_registry#2.global(7, &quot;zxdg_output_manager_v1&quot;, 3)
wl_registry#2.global(8, &quot;ext_idle_notifier_v1&quot;, 1)
wl_registry#2.global(9, &quot;zwp_idle_inhibit_manager_v1&quot;, 1)
wl_registry#2.global(10, &quot;zwlr_layer_shell_v1&quot;, 4)
wl_registry#2.global(11, &quot;xdg_wm_base&quot;, 5)
 -&gt; wl_registry#2.bind(11, &quot;xdg_wm_base&quot;, 1, new id [unknown]#6)
</code></pre>
<p>Lines starting with <code>-&gt;</code> are messages sent by us to the compositor.
For example, <code>-&gt; wl_display#1.get_registry(new id wl_registry#2)</code> means
we sent a <code>get_registry</code> request to object #1 (of type <a href="https://wayland.app/protocols/wayland#wl_display">wl_display</a>),
asking it to create a new <code>wl_registry</code> object with ID #2.</p>
<p>We bind <a href="https://wayland.app/protocols/linux-dmabuf-v1">zwp_linux_dmabuf_v1</a> (for sharing GPU memory),
<a href="https://wayland.app/protocols/wayland#wl_compositor">wl_compositor</a> (basic support for displaying things),
and <a href="https://wayland.app/protocols/xdg-shell#xdg_wm_base">xdg_wm_base</a> (creating desktop windows).</p>
<p>There are more extensions, but we don't care about them:</p>
<pre><code>wl_registry#2.global(12, &quot;zwp_tablet_manager_v2&quot;, 1)
wl_registry#2.global(13, &quot;org_kde_kwin_server_decoration_manager&quot;, 1)
wl_registry#2.global(14, &quot;zxdg_decoration_manager_v1&quot;, 1)
wl_registry#2.global(15, &quot;zwp_relative_pointer_manager_v1&quot;, 1)
wl_registry#2.global(16, &quot;zwp_pointer_constraints_v1&quot;, 1)
wl_registry#2.global(17, &quot;wp_presentation&quot;, 1)
wl_registry#2.global(18, &quot;wp_alpha_modifier_v1&quot;, 1)
wl_registry#2.global(19, &quot;zwlr_output_manager_v1&quot;, 4)
wl_registry#2.global(20, &quot;zwlr_output_power_manager_v1&quot;, 1)
wl_registry#2.global(21, &quot;zwp_input_method_manager_v2&quot;, 1)
wl_registry#2.global(22, &quot;zwp_text_input_manager_v3&quot;, 1)
wl_registry#2.global(23, &quot;ext_foreign_toplevel_list_v1&quot;, 1)
wl_registry#2.global(24, &quot;zwlr_foreign_toplevel_manager_v1&quot;, 3)
wl_registry#2.global(25, &quot;ext_session_lock_manager_v1&quot;, 1)
wl_registry#2.global(26, &quot;wp_drm_lease_device_v1&quot;, 1)
wl_registry#2.global(27, &quot;zwlr_export_dmabuf_manager_v1&quot;, 1)
wl_registry#2.global(28, &quot;zwlr_screencopy_manager_v1&quot;, 3)
wl_registry#2.global(29, &quot;zwlr_data_control_manager_v1&quot;, 2)
wl_registry#2.global(30, &quot;wp_security_context_manager_v1&quot;, 1)
wl_registry#2.global(31, &quot;wp_viewporter&quot;, 1)
wl_registry#2.global(32, &quot;wp_single_pixel_buffer_manager_v1&quot;, 1)
wl_registry#2.global(33, &quot;wp_content_type_manager_v1&quot;, 1)
wl_registry#2.global(34, &quot;wp_fractional_scale_manager_v1&quot;, 1)
wl_registry#2.global(35, &quot;wp_tearing_control_manager_v1&quot;, 1)
wl_registry#2.global(36, &quot;zxdg_exporter_v1&quot;, 1)
wl_registry#2.global(37, &quot;zxdg_importer_v1&quot;, 1)
wl_registry#2.global(38, &quot;zxdg_exporter_v2&quot;, 1)
wl_registry#2.global(39, &quot;zxdg_importer_v2&quot;, 1)
wl_registry#2.global(40, &quot;xdg_activation_v1&quot;, 1)
wl_registry#2.global(41, &quot;wp_cursor_shape_manager_v1&quot;, 1)
wl_registry#2.global(42, &quot;zwp_virtual_keyboard_manager_v1&quot;, 1)
wl_registry#2.global(43, &quot;zwlr_virtual_pointer_manager_v1&quot;, 2)
wl_registry#2.global(44, &quot;zwp_keyboard_shortcuts_inhibit_manager_v1&quot;, 1)
wl_registry#2.global(45, &quot;zwp_pointer_gestures_v1&quot;, 3)
wl_registry#2.global(46, &quot;ext_transient_seat_manager_v1&quot;, 1)
wl_registry#2.global(47, &quot;wl_seat&quot;, 9)
wl_registry#2.global(48, &quot;zwp_primary_selection_device_manager_v1&quot;, 1)
wl_registry#2.global(50, &quot;wl_output&quot;, 4)
wl_callback#3.done(59067)
</code></pre>
<p>Mesa's own Wayland support also binds
<a href="https://wayland.app/protocols/presentation-time">wp_presentation</a> (&quot;accurate presentation timing feedback to ensure smooth video playback&quot;)
and <a href="https://wayland.app/protocols/tearing-control-v1">wp_tearing_control_manager_v1</a> (&quot;reduce latency by accepting tearing&quot;), but I didn't implement either.
Mesa would also bind <a href="https://wayland.app/protocols/linux-drm-syncobj-v1">wp_linux_drm_syncobj_manager_v1</a>, <a href="https://wayland.app/protocols/fifo-v1">wp_fifo_manager_v1</a> and
<a href="https://wayland.app/protocols/commit-timing-v1">wp_commit_timing_manager_v1</a> if Sway supported them.</p>
<p>Next, we create a window and ask Sway for &quot;feedback&quot; about it:</p>
<pre><code> -&gt; wl_compositor#5.create_surface(new id wl_surface#3)
 -&gt; xdg_wm_base#6.get_xdg_surface(new id xdg_surface#7, wl_surface#3)
 -&gt; xdg_surface#7.get_toplevel(new id xdg_toplevel#8)
 -&gt; xdg_toplevel#8.set_title(&quot;Example client&quot;)
 -&gt; wl_surface#3.commit()
 -&gt; zwp_linux_dmabuf_v1#4.get_surface_feedback(new id zwp_linux_dmabuf_feedback_v1#9, wl_surface#3)
 -&gt; wl_display#1.sync(new id wl_callback#10)
wl_display#1.delete_id(10)
</code></pre>
<p>Sway checks we're still alive, and we confirm:</p>
<pre><code>xdg_wm_base#6.ping(59068)
 -&gt; xdg_wm_base#6.pong(59068)
</code></pre>
<p>Then the feedback arrives:</p>
<pre><code>zwp_linux_dmabuf_feedback_v1#9.main_device(array[8])
zwp_linux_dmabuf_feedback_v1#9.format_table(fd 4, 1600)
zwp_linux_dmabuf_feedback_v1#9.tranche_target_device(array[8])
zwp_linux_dmabuf_feedback_v1#9.tranche_flags(0)
zwp_linux_dmabuf_feedback_v1#9.tranche_formats(array[200])
Wayland compositor supports DRM_FORMAT_XRGB8888 with modifier 0xffffffffffffff
zwp_linux_dmabuf_feedback_v1#9.tranche_done()
zwp_linux_dmabuf_feedback_v1#9.done()
wl_callback#10.done(59069)
</code></pre>
<p><code>main_device</code> tells us &quot;the main device that the server prefers to use&quot;.
It's just a number (of type <code>dev_t</code>), which Wayland sends as an array of bytes.
We store it for later.</p>
<p>The format table lists Linux formats and we need to find a suitable one that Vulkan also supports.
The test application only supports <code>DRM_FORMAT_XRGB8888</code> and logs when it finds that.</p>
<p>As well as a format, images also have a <em>modifier</em> saying how the pixels are arranged (see <a href="https://docs.mesa3d.org/isl/tiling.html">Tiling</a>).
My card doesn't seem to support that so I'm using the legacy <code>DRM_FORMAT_RESERVED</code> (<code>0xffffffffffffff</code>).</p>
<pre><code>discarded xdg_toplevel#8.configure(0, 0, array[0])
xdg_surface#7.configure(59069)
 -&gt; xdg_surface#7.ack_configure(59069)
 -&gt; zwp_linux_dmabuf_feedback_v1#9.destroy()
</code></pre>
<p>A quirk of Sway is that it sends a <code>configure</code> with a size of 0x0 initially,
asking us to choose the size (which it will then ignore).
The log message says <code>discarded</code> because I didn't register a handler for this message.</p>
<p>The compositor can send further feedback while the window is open.
I guess if you drag a window to another screen, plugged into a different GPU,
it might tell you it now prefers to use that one.
But Sway sends updates anyway so to avoid log clutter I destroyed the feedback object.</p>
<p>Back in <code>main.c</code>, we now call <a href="https://github.com/talex5/vulkan-test/blob/blog/main.c#L186">find_wayland_device</a> to find the Vulkan device matching Wayland's &quot;main device&quot; above:</p>
<pre><code>Wayland compositor main device is 226,128
</code></pre>
<p>That's the major and minor part of this device:</p>
<pre><code>$ stat --format=&quot;%Hr,%Lr&quot; /dev/dri/renderD128
226,128
</code></pre>
<p>There doesn't seem to be a way to ask Vulkan for a specific device.
Instead, we call <code>vkEnumeratePhysicalDevices</code> to get all of them and search the list.
The test application lists the two returned devices.
It sees that the first device's ID matches the one from Wayland and chooses that.</p>
<pre><code>Vulkan found 2 physical devices
0: AMD Radeon RX 550 Series (RADV POLARIS11)
1: llvmpipe (LLVM 19.1.7, 256 bits)
Using device 0 (matches Wayland rendering node)
</code></pre>
<p>We use <code>vkGetPhysicalDeviceQueueFamilyProperties</code> to find a queue that can support graphics operations
and create a logical device with one such queue:</p>
<pre><code>Device has 3 queue families
Found graphics queue family (0)
Create logical device
</code></pre>
<p>Next we create a command pool, load the pre-compiled shader bytecode, and set up the rendering pipeline:</p>
<pre><code>Create command pool
Loaded shaders/vert.spv (1856 bytes)
Loaded shaders/frag.spv (572 bytes)
vkCreateDescriptorSetLayout
vkCreatePipelineLayout
vkCreateRenderPass
vkCreateDescriptorPool
vkAllocateDescriptorSets
Create uniform buffer 0
createBuffer
vkMapMemory
vkUpdateDescriptorSets
Create uniform buffer 1
createBuffer
vkMapMemory
vkUpdateDescriptorSets
Create uniform buffer 2
createBuffer
vkMapMemory
vkUpdateDescriptorSets
Create uniform buffer 3
createBuffer
vkMapMemory
vkUpdateDescriptorSets
vkCreateGraphicsPipeline
</code></pre>
<p>The uniform buffers are used to pass input parameters to the rendering code.
We'll pass one number for each frame, telling <a href="https://github.com/talex5/vulkan-test/blob/blog/shaders/shader.vert#L3-L5">shader.vert</a> how far to move the triangle to the right.</p>
<p>Allocating a buffer requires choosing what kind of memory to use.
Some memory types are fast to access but must be on the graphics card,
while others are slower but are also accessible to the CPU.
<code>vkGetBufferMemoryRequirements</code> tells us what memory types are suitable for the buffer,
and we search for one that also has the properties we want (with <a href="https://github.com/talex5/vulkan-test/blob/blog/helpers.c#L63">findMemoryType</a>).
The example code allocates memory that is also visible to the host and is coherent with it (doesn't require manual synchronisation).</p>
<p>Next, we create a <code>VkImage</code> for the first framebuffer, allocate some device memory for it, and bind the memory to the image:</p>
<pre><code>Create framebuffer 0
vkCreateImage
vkAllocateMemory
vkBindImageMemory
</code></pre>
<p>For the images, we choose memory that is fast (local to the GPU, rather than accessible to the host).</p>
<p>Now we need to tell the Wayland compositor about the framebuffer's memory.
We export a reference to the GPU memory as a Unix file descriptor
and use the <code>zwp_linux_dmabuf_feedback_v1</code> protocol to send it to the compositor,
along with some information about how Vulkan plans to lay it out:</p>
<pre><code>vkGetMemoryFdKHR
 -&gt; zwp_linux_dmabuf_v1#4.create_params(new id zwp_linux_buffer_params_v1#10)
 -&gt; zwp_linux_buffer_params_v1#10.add(fd 12, 0, 0, 2560, 16777215, 4294967295)
 -&gt; zwp_linux_buffer_params_v1#10.create_immed(new id wl_buffer#11, 640, 480, 875713112, 0)
 -&gt; zwp_linux_buffer_params_v1#10.destroy()
</code></pre>
<p>We do the same for the other framebuffers:</p>
<pre><code>Create framebuffer 1
vkCreateImage
vkAllocateMemory
vkBindImageMemory
vkGetMemoryFdKHR
 -&gt; zwp_linux_dmabuf_v1#4.create_params(new id zwp_linux_buffer_params_v1#12)
 -&gt; zwp_linux_buffer_params_v1#12.add(fd 14, 0, 0, 2560, 16777215, 4294967295)
 -&gt; zwp_linux_buffer_params_v1#12.create_immed(new id wl_buffer#13, 640, 480, 875713112, 0)
 -&gt; zwp_linux_buffer_params_v1#12.destroy()
Create framebuffer 2
vkCreateImage
vkAllocateMemory
vkBindImageMemory
vkGetMemoryFdKHR
 -&gt; zwp_linux_dmabuf_v1#4.create_params(new id zwp_linux_buffer_params_v1#14)
 -&gt; zwp_linux_buffer_params_v1#14.add(fd 16, 0, 0, 2560, 16777215, 4294967295)
 -&gt; zwp_linux_buffer_params_v1#14.create_immed(new id wl_buffer#15, 640, 480, 875713112, 0)
 -&gt; zwp_linux_buffer_params_v1#14.destroy()
Create framebuffer 3
vkCreateImage
vkAllocateMemory
vkBindImageMemory
vkGetMemoryFdKHR
 -&gt; zwp_linux_dmabuf_v1#4.create_params(new id zwp_linux_buffer_params_v1#16)
 -&gt; zwp_linux_buffer_params_v1#16.add(fd 18, 0, 0, 2560, 16777215, 4294967295)
 -&gt; zwp_linux_buffer_params_v1#16.create_immed(new id wl_buffer#17, 640, 480, 875713112, 0)
 -&gt; zwp_linux_buffer_params_v1#16.destroy()
Start main loop
</code></pre>
<p>Now we're ready to run the main loop, displaying frames.
We start each frame by asking the compositor to tell us when it wants the frame after this one:</p>
<pre><code> -&gt; wl_surface#3.frame(new id wl_callback#18)
</code></pre>
<p>If a previous frame is still rendering we need to wait for that,
because we're reusing the same command buffer for all frames
(though obviously this doesn't matter for the first frame):</p>
<pre><code>Wait for inFlightFence
Rendering frame 0 with framebuffer 0
</code></pre>
<p>Now for the implicit synchronisation support.
We ask Linux to give us a <em>sync file</em> for any operations currently using the framebuffer
and attach it to our <code>imageAvailableSemaphore</code>.
As this is the first frame, there won't be any, but in general we might be reusing a framebuffer that the compositor's rendering job is still using.
Then we populate the command buffer with the rendering commands and the dependency on <code>imageAvailableSemaphore</code> and submit it to the GPU:</p>
<pre><code>Import imageAvailableSemaphore
Submit to graphicsQueue
</code></pre>
<p>While the GPU is rendering the frame, we export the Vulkan <code>renderFinishedSemaphore</code> as a Linux sync file
and get Linux to attach that to the image.
This ensures that the compositor won't try to use the image until rendering completes.
Finally, we attach the in-progress framebuffer (<code>wl_buffer#11</code> for framebuffer 0) to the surface
and tell the compositor to use it for the next frame:</p>
<pre><code>Export renderFinishedSemaphore
 -&gt; wl_surface#3.attach(wl_buffer#11, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
</code></pre>
<p>2147483647 is the maximum 32-bit signed value, to invalidate the whole window.</p>
<p>Now the first frame is done, we continue processing incoming Wayland messages.
The compositor confirms deletion of various objects (the feedback object and the buffer params for the 4 framebuffers):</p>
<pre><code>wl_display#1.delete_id(9)
wl_display#1.delete_id(10)
wl_display#1.delete_id(12)
wl_display#1.delete_id(14)
wl_display#1.delete_id(16)
</code></pre>
<p>The compositor notifies us that the next-frame callback no longer exists,
and then the callback notifies us that it's complete.
That seems like the wrong way round but it works somehow!</p>
<pre><code>wl_display#1.delete_id(18)
wl_callback#18.done(40055490)
</code></pre>
<p>So now we do the second frame, just as before:</p>
<pre><code> -&gt; wl_surface#3.frame(new id wl_callback#18)
Wait for inFlightFence
Rendering frame 1 with framebuffer 1
Import imageAvailableSemaphore
Submit to graphicsQueue
Export renderFinishedSemaphore
 -&gt; wl_surface#3.attach(wl_buffer#13, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
</code></pre>
<p>The second frame done, Sway finally gets around to letting us know how big the window is:</p>
<pre><code>discarded xdg_toplevel#8.configure(962, 341, array[8])
xdg_surface#7.configure(59070)
 -&gt; xdg_surface#7.ack_configure(59070)
</code></pre>
<p>We should recreate all the framebuffers at this point,
but I'm too lazy and just continue rendering at 640x480:</p>
<pre><code>wl_display#1.delete_id(18)
wl_callback#18.done(40055491)
 -&gt; wl_surface#3.frame(new id wl_callback#18)
Wait for inFlightFence
Rendering frame 2 with framebuffer 2
Import imageAvailableSemaphore
Submit to graphicsQueue
Export renderFinishedSemaphore
 -&gt; wl_surface#3.attach(wl_buffer#15, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
</code></pre>
<p>After sending frame 2 (the third frame), Sway lets us know it's done with frames 0 and 1
and asks for frame 3 (with even more dubious ordering of the <code>delete_id</code> and <code>done</code>!):</p>
<pre><code>discarded wl_buffer#11.release()
wl_display#1.delete_id(18)
discarded wl_buffer#13.release()
wl_callback#18.done(40055493)
</code></pre>
<p>In theory, we should wait for the <code>release</code> before trying to reuse a framebuffer,
because that lets us know that Sway has started displaying it and we can safely extract the sync file from it.
But we've got 4 framebuffers and it seems highly unlikely that
Sway would ask for frame 4 before starting to render frame 0,
so I didn't implement that for this test.</p>
<pre><code> -&gt; wl_surface#3.frame(new id wl_callback#18)
Wait for inFlightFence
Rendering frame 3 with framebuffer 3
Import imageAvailableSemaphore
Submit to graphicsQueue
Export renderFinishedSemaphore
 -&gt; wl_surface#3.attach(wl_buffer#17, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
discarded wl_buffer#15.release()
</code></pre>
<p>And for frame 4 we reuse the first framebuffer (<code>#11</code>):</p>
<pre><code>wl_display#1.delete_id(18)
wl_callback#18.done(40055497)
 -&gt; wl_surface#3.frame(new id wl_callback#18)
Wait for inFlightFence
Rendering frame 4 with framebuffer 0
Import imageAvailableSemaphore
Submit to graphicsQueue
Export renderFinishedSemaphore
 -&gt; wl_surface#3.attach(wl_buffer#11, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
discarded wl_buffer#17.release()
wl_display#1.delete_id(18)
</code></pre>
<h2 id="kernel-details-with-bpftrace">Kernel details with bpftrace</h2>
<p>To see what it's doing in the kernel, I made a bpftrace script (<a href="https://github.com/talex5/vulkan-test/blob/blog/trace.bt">trace.bt</a>).
That script also intercepts the application's writes to stdout and stderr and includes them in its own output,
so everything appears in sequence.</p>
<pre><code>$ sudo bpftrace trace.bt &gt; bpftrace.log
</code></pre>
<p>Then I ran the application as before in another window.
The script traces various amdgpu-specific functions,
so if you want to try it yourself you'll need to modify it a bit depending on your GPU and kernel version.</p>
<p>For reference, here's the full log from my machine: <a href="/blog/data/graphics/bpftrace.log">bpftrace.log</a></p>
<h3 id="start-up-and-library-loading">Start-up and library loading</h3>
<p>The bpftrace log shows a few libraries being opened:</p>
<pre><code>Attaching 25 probes...
open(.../wayland-1.23.1/lib/libwayland-client.so.0) =&gt; 3
open(.../vulkan-loader-1.4.313.0/lib/libvulkan.so.1) =&gt; 3
open(.../libdrm-2.4.124/lib/libdrm.so.2) =&gt; 3
open(.../glibc-2.40-66/lib/libm.so.6) =&gt; 3
open(.../glibc-2.40-66/lib/libc.so.6) =&gt; 3
open(.../libffi-3.4.8/lib/libffi.so.8) =&gt; 3
open(.../glibc-2.40-66/lib/libdl.so.2) =&gt; 3
</code></pre>
<p>(I removed the <code>/nix/store/HASH-</code> prefixes to save space.
3 is the FD; I didn't trace closing and it's getting reused.)</p>
<p>These libraries are all expected; we're using <code>libwayland-client</code> to speak the Wayland protocol, and
<code>libvulkan</code> and <code>libdrm</code> to talk to the graphics card. <code>libc</code> and <code>libm</code> are standard libraries (C and maths),
and <code>ffi</code> and <code>dl</code> are for loading more libraries dynamically.</p>
<p>The Vulkan loader starts by scanning for layers:</p>
<pre><code>Create instance with 0 layers
open(/run/opengl-driver/share/vulkan/implicit_layer.d) =&gt; 3
open(/run/opengl-driver/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json) =&gt; 3
open(.../vulkan-validation-layers-1.4.313.0/share/vulkan/explicit_layer.d) =&gt; 3
open(.../vulkan-validation-layers-1.4.313.0/share/vulkan/explicit_layer.d/VkLayer_khronos_validation.json) =&gt; 3
</code></pre>
<p>Then it loads the Radeon driver:</p>
<pre><code>open(/run/opengl-driver/share/vulkan/icd.d/radeon_icd.x86_64.json) =&gt; 3
open(.../mesa-25.0.7/lib/libvulkan_radeon.so) =&gt; 3
open(.../llvm-19.1.7-lib/lib/libLLVM.so.19.1) =&gt; 3
open(.../elfutils-0.192/lib/libelf.so.1) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-dri3.so.0) =&gt; 3
open(.../zlib-1.3.1/lib/libz.so.1) =&gt; 3
open(.../zstd-1.5.7/lib/libzstd.so.1) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb.so.1) =&gt; 3
open(.../libX11-1.8.12/lib/libX11-xcb.so.1) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-present.so.0) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-xfixes.so.0) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-sync.so.1) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-randr.so.0) =&gt; 3
open(.../libxcb-1.17.0/lib/libxcb-shm.so.0) =&gt; 3
open(.../libxshmfence-1.3.3/lib/libxshmfence.so.1) =&gt; 3
open(.../xcb-util-keysyms-0.4.1/lib/libxcb-keysyms.so.1) =&gt; 3
open(.../systemd-minimal-libs-257.5/lib/libudev.so.1) =&gt; 3
open(.../expat-2.7.1/lib/libexpat.so.1) =&gt; 3
open(.../libdrm-2.4.124/lib/libdrm_amdgpu.so.1) =&gt; 3
open(.../gcc-14.2.1.20250322-lib/lib/libstdc++.so.6) =&gt; 3
open(.../gcc-14.2.1.20250322-lib/lib/libgcc_s.so.1) =&gt; 3
open(.../glibc-2.40-66/lib/librt.so.1) =&gt; 3
open(.../libxml2-2.13.8/lib/libxml2.so.2) =&gt; 3
open(.../xz-5.8.1/lib/liblzma.so.5) =&gt; 3
open(.../bzip2-1.0.8/lib/libbz2.so.1) =&gt; 3
open(.../libXau-1.0.12/lib/libXau.so.6) =&gt; 3
open(.../libXdmcp-1.1.5/lib/libXdmcp.so.6) =&gt; 3
open(.../libcap-2.75-lib/lib/libcap.so.2) =&gt; 3
open(.../glibc-2.40-66/lib/libpthread.so.0) =&gt; 3
</code></pre>
<p>I see some AMD Radeon driver stuff (<code>libvulkan_radeon.so</code> and <code>libdrm_amdgpu.so</code>),
plus 4 compression libraries, 2 XML parsers, and some compiler stuff.</p>
<p>And 12 libraries for the old X11 protocol (<code>libxcb</code>, etc).
That's quite surprising, as I'm not using any X11 stuff.</p>
<p>After a mysterious 7ms delay, Vulkan then loads the fallback software renderer (&quot;lavapipe&quot;):</p>
<pre><code>[7 ms]
open(/run/opengl-driver/share/vulkan/icd.d/lvp_icd.x86_64.json) =&gt; 3
open(.../mesa-25.0.7/lib/libvulkan_lvp.so) =&gt; 3
</code></pre>
<p>Normally it would load more drivers, but I used <code>VK_DRIVER_FILES</code> to disable them,
just to make the trace a bit shorter.</p>
<p>Then it loads a Vulkan layer for device selection:</p>
<pre><code>open(.../mesa-25.0.7/lib/libVkLayer_MESA_device_select.so) =&gt; 3
</code></pre>
<p>That's the thing that was making the unwanted Wayland connections to sort the drivers list.
Since removing the Wayland extension it hasn't been doing that any more.</p>
<p>Then Mesa loads some configuration files.
These seem to be lists of buggy applications and the corresponding workarounds:</p>
<pre><code>open(.../mesa-25.0.7/share/drirc.d) =&gt; 3
open(.../mesa-25.0.7/share/drirc.d/00-mesa-defaults.conf) =&gt; 3
open(.../mesa-25.0.7/share/drirc.d/00-radv-defaults.conf) =&gt; 3
</code></pre>
<h3 id="enumerating-devices">Enumerating devices</h3>
<p>After Wayland tells us which device to use, we need to call <code>vkEnumeratePhysicalDevices</code> to find it.
This tries all the drivers Mesa knows about, looking for all available GPUs.
The lavapipe software driver goes first, probing to check it can export sync files:</p>
<pre><code>Wayland compositor main device is 226,128
open(/sys/devices/system/cpu/possible) =&gt; 4
open(/dev/udmabuf) =&gt; 4
ioctl(/dev/udmabuf, UDMABUF_CREATE) =&gt; fd 6
ioctl(6, DMA_BUF_IOCTL_EXPORT_SYNC_FILE) =&gt; fd 7
</code></pre>
<p>Then Mesa's Radeon driver lists the <code>/dev/dri</code> directory,
getting the <code>dev_t</code> ID for each device and looking up information about it in <code>/sys</code>:</p>
<pre><code>stat(/dev/dri/card0) =&gt; rdev=226,0
readlink(/sys/dev/char/226:0) =&gt; ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/card
open(/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vendor) =&gt; 10
...
stat(/dev/dri/renderD128) =&gt; rdev=226,128
readlink(/sys/dev/char/226:128) =&gt; ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/renderD12
open(/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vendor) =&gt; 10
...
</code></pre>
<p>Mesa then opens <code>/dev/dri/renderD128</code>, which causes various interesting things to happen:</p>
<pre><code>  drm_sched_entity_init ([sdma0, sdma1])
  drm_sched_entity_init ([sdma0, sdma1])
  drm_sched_job (sched=918:1 finished=919:1)
  drm_sched_job (sched=918:2 finished=919:2)
  drm_sched_job (sched=918:3 finished=919:3)
open(/dev/dri/renderD128) =&gt; 9
</code></pre>
<p>Note that the <code>open</code> call is logged when it returns, not when it starts.
I'm indenting things that happen inside the kernel.</p>
<p>The main computer sends commands to a GPU by writing them to a <em>ring</em>.
Each device can have several rings.
For my device, <code>/sys/kernel/debug/dri/0000:01:00.0/amdgpu_fence_info</code> lists 18 rings,
including 1 graphics ring, 8 compute rings and 2 SDMA rings
(see <a href="https://docs.kernel.org/gpu/amdgpu/debugfs.html">https://docs.kernel.org/gpu/amdgpu/debugfs.html</a> for details).</p>
<p>SDMA is &quot;System DMA&quot; according to the <a href="https://docs.kernel.org/gpu/amdgpu/driver-core.html">Core Driver Infrastructure</a> docs.
DMA is <a href="https://en.wikipedia.org/wiki/Direct_memory_access">Direct Memory Access</a>.
So I think SDMA rings are how we ask the GPU to transfer things to and from host memory.</p>
<p>There can be many processes wanting to use the GPU, and they get their own &quot;entity&quot; queues.
The <em>GPU scheduler</em> takes jobs from these queues puts them on the rings, deciding which processes take priority.
Each entity queue can use some set of rings.</p>
<p>Opening the device called <code>amdgpu_vm_init_entities</code>, which created two entity queues
(<code>immediate</code> and <code>delayed</code>), both of which can submit to both the <code>sdma0</code> and <code>sdma1</code> rings.
I'm not sure why it uses two queues.</p>
<p><code>drm_sched_job</code> indicates that a job got added to an entity queue (but, confusingly, not that it has been scheduled yet).
<code>sched=X</code> means that fence X will be signalled when the job is written to the hardware ring,
and <code>finished=Y</code> means that fence Y will be signalled when the GPU has finished the job.</p>
<p>There seem to be several different things called fences:</p>
<ul>
<li>
<p>A <code>dma_fence</code> looks like what I would call a &quot;promise&quot;.
It's initially unsignalled and has a list of callbacks to call when signalled.
Then it becomes signalled and notifies the callbacks.
It cannot be reused.</p>
<p>(I'm not very keen on this terminology.
&quot;Making a promise&quot;, &quot;Fulfilling a promise&quot; and &quot;Waiting for a promise to be fulfilled&quot; make sense.
&quot;Signalling a fence&quot; or &quot;Waiting on a fence&quot; doesn't make sense.)</p>
</li>
<li>
<p>A <code>drm_sched_fence</code> is a collection of 3 fences: <code>submitted</code> and <code>finished</code> (as above), plus
<code>parent</code>, which is the GPU's fence.
<code>parent</code> is like <code>finished</code>, except that <code>parent</code> doesn't exist until the job has been added to the ring.</p>
<p>Again, the terminology is odd here. I'd expect a parent to exist before its child.</p>
</li>
<li>
<p>There's also Vulkan's <code>VkFence</code>, which is more like a mutable container of <code>dma_fence</code>s
and can be reused.</p>
</li>
</ul>
<p>DMA fences get unique IDs of the form <code>CONTEXT:SEQNO</code>.
Each entity queue allocates two context IDs from the global pool
(for <code>sched</code> and <code>finished</code> fences), and then numbers requests sequentially within that context.</p>
<p>The <code>open</code> call then returned and Mesa queried some details about the device.
Shortly after that, the kernel submitted the first two previously-queued jobs to the ring:</p>
<pre><code>ioctl(/dev/dri/renderD128, VERSION) =&gt;  3.61
ioctl(/dev/dri/renderD128, VERSION) =&gt; amdgpu 3.61
  amdgpu_job_run on sdma1 (finished=919:1) =&gt; parent=12:19449
  dma_fence_signaled 918:1
  amdgpu_job_run on sdma1 (finished=919:2) =&gt; parent=12:19450
  dma_fence_signaled 918:2
</code></pre>
<p><code>parent=Z</code> indicates the creation of a device fence Z.
The <code>sched</code> fence gets signalled to indicate that the job has been scheduled to run on the device.</p>
<p>The reason the <code>sched</code> fences are needed is that we can only ask the device to wait for a job that has a device fence.
So if job A depends on some job B that hasn't been submitted yet then the kernel must wait for B's <code>sched</code> fence first.
Then it knows B's <code>parent</code> fence and can specify that as the job's dependency when submitting A to the GPU.</p>
<p>Eventually the <code>parent</code> fences get signalled, which then signals the <code>finished</code> fences too:</p>
<pre><code>  dma_fence_signaled 12:19449
  dma_fence_signaled 919:1
  dma_fence_signaled 12:19450
  dma_fence_signaled 919:2
</code></pre>
<p>This is triggered by <code>amdgpu_fence_process</code>, which is called from <code>amdgpu_irq_dispatch</code>;
you can always ask bpftrace to print the kernel stack (<code>kstack</code>) if you want to see why something is getting called.
I'm not tracing these interrupt handlers because they get called for all activity on the GPU, not just my test application,
and it clutters up the trace.</p>
<h3 id="setting-up-the-pipeline">Setting up the pipeline</h3>
<p>Mesa eventually returns the list of devices. We choose the one we want and create a <code>VkDevice</code> for it.
Various jobs get submitted to the device:</p>
<pre><code>Create logical device
ioctl(/dev/dri/renderD128, AMDGPU_CTX)
ioctl(/dev/dri/renderD128, AMDGPU_GEM_CREATE) =&gt; handle 1
  drm_sched_job (sched=918:4 finished=919:4)
  drm_sched_job (sched=918:5 finished=919:5)
  drm_sched_job (sched=918:6 finished=919:6)
ioctl(/dev/dri/renderD128, AMDGPU_GEM_VA, 1)
ioctl(/dev/dri/renderD128, AMDGPU_GEM_CREATE) =&gt; handle 2
  drm_sched_job (sched=918:7 finished=919:7)
ioctl(/dev/dri/renderD128, AMDGPU_GEM_VA, 2)
ioctl(/dev/dri/renderD128, AMDGPU_GEM_MMAP)
  amdgpu_job_run on sdma0 (finished=919:4) =&gt; parent=11:46118
  dma_fence_signaled 918:4
  amdgpu_job_run on sdma0 (finished=919:5) =&gt; parent=11:46119
  dma_fence_signaled 918:5
ioctl(/dev/dri/renderD128, AMDGPU_GEM_CREATE) =&gt; handle 3
...
</code></pre>
<p>Note that the <code>ioctl</code> lines are logged when the call returns, not when it starts.</p>
<p>Creating the command buffer also allocates some device memory,
as does creating the descriptor sets, each uniform buffer (for passing input data to the shaders),
and the framebuffers.</p>
<p>The most interesting setup step for me was <code>vkUpdateDescriptorSets</code>.
This doesn't make any system calls, and it doesn't use the entity queues above, but it does submit jobs to another entity queue.
It looks like Mesa's <a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/35721f19866d07dc671d4a83d6f6b77240629cb6/src/amd/common/ac_descriptors.c#L807">ac_build_buffer_descriptor</a> just tries to write the memory for the descriptor and that triggers a fault,
which runs a job to transfer the memory from the GPU to the host first:</p>
<pre><code>vkUpdateDescriptorSets
  (via amdgpu_bo_fault_reserve_notify)
  drm_sched_job (sched=789:865 finished=790:865)
  (via amdgpu_bo_fault_reserve_notify)
  drm_sched_job (sched=789:866 finished=790:866)
  amdgpu_job_run on sdma0 (finished=790:865) =&gt; parent=11:46127
  dma_fence_signaled 789:865
  amdgpu_job_run on sdma0 (finished=790:866) =&gt; parent=11:46129
  dma_fence_signaled 789:866
</code></pre>
<p>I noticed this because I was originally tracing all <code>dma_fence_init</code> calls.
I'm not doing that in the final version because it includes fences made for other processes.</p>
<h3 id="rendering-one-frame">Rendering one frame</h3>
<p>Finally, let's take a look at the rendering of one frame with the extra tracing.</p>
<p>We start by waiting for <code>inFlightFence</code> (which signals when we've finished rendering the previous frame).
That's implemented as a Linux <a href="https://docs.kernel.org/gpu/drm-mm.html#drm-sync-objects">sync obj</a>.
A sync obj (not to be confused with a <em>sync file</em>) is a container for a <code>dma_fence</code>.
Sync obj fences can be removed or replaced, making them reusable.</p>
<pre><code>Wait for inFlightFence
ioctl(/dev/dri/renderD128, SYNCOBJ_WAIT, 6)
ioctl(/dev/dri/renderD128, SYNCOBJ_RESET, 6)
Rendering frame 0 with framebuffer 0
</code></pre>
<p>Next, we export the <code>dma_fence</code> of the rendering job Sway ran to use this framebuffer last time.
We export from the dmabuf (the image) to a <a href="https://www.kernel.org/doc/html/latest/driver-api/sync_file.html#sync-file-api-guide">sync file</a> (an immutable container of a single, fixed <code>dma_fence</code>),
and then import that into a new fresh <em>sync obj</em> (I don't know why Vulkan doesn't reuse the existing one):</p>
<pre><code>Import imageAvailableSemaphore
  dma_resv_get_fences =&gt; 0 fences
ioctl(11, DMA_BUF_IOCTL_EXPORT_SYNC_FILE) =&gt; fd 19
ioctl(/dev/dri/renderD128, SYNCOBJ_CREATE) =&gt; handle 7
ioctl(/dev/dri/renderD128, SYNCOBJ_FD_TO_HANDLE, fd 19, handle 7)
</code></pre>
<p>As this is the first frame, Sway can't be using it yet and we just get a sync file with no fence.
But even for the later frames (which reuse framebuffers) I still always see <code>0 fences</code> here.
With 4 framebuffers (the number Mesa uses), and rendering frames when Sway asks for them,
it doesn't seem likely we'll ever need to wait for anything.</p>
<p>When we submit the rendering job, Mesa first waits for the sync obj to have been filled in;
this seems to be something to do with supporting multi-threaded code:</p>
<pre><code>Submit to graphicsQueue
ioctl(/dev/dri/renderD128, SYNCOBJ_TIMELINE_WAIT, 7, ALL|FOR_SUBMIT|AVAILABLE)
</code></pre>
<p>Mesa submits the rendering job with <code>DRM_AMDGPU_CS</code> (Command Submission? Command Stream?).
The kernel lazily creates an entity queue that submits to the graphics ring the first time we do this:</p>
<pre><code>  drm_sched_entity_init ([gfx])
  drm_sched_job (sched=918:28 finished=919:28)
  drm_sched_job (sched=918:29 finished=919:29)
  amdgpu_job_run on sdma1 (finished=919:28) =&gt; parent=12:19458
  dma_fence_signaled 918:28
  drm_sched_job (sched=921:1 finished=922:1)
ioctl(/dev/dri/renderD128, AMDGPU_CS) =&gt; handle 1
ioctl(/dev/dri/renderD128, SYNCOBJ_DESTROY, 7)
</code></pre>
<p>We export the semaphore that says when rendering is complete and attach it to the image
(where Sway will find it):</p>
<pre><code>Export renderFinishedSemaphore
ioctl(/dev/dri/renderD128, SYNCOBJ_TIMELINE_WAIT, 1, ALL|FOR_SUBMIT|AVAILABLE)
ioctl(/dev/dri/renderD128, SYNCOBJ_HANDLE_TO_FD, 1) =&gt; fd 19
ioctl(/dev/dri/renderD128, SYNCOBJ_RESET, 1)
ioctl(11, DMA_BUF_IOCTL_IMPORT_SYNC_FILE)
</code></pre>
<p>We finish by notifying Sway of the new frame:</p>
<pre><code> -&gt; wl_surface#3.attach(wl_buffer#11, 0, 0)
 -&gt; wl_surface#3.damage(0, 0, 2147483647, 2147483647)
 -&gt; wl_surface#3.commit()
</code></pre>
<p>Soon afterwards, the kernel submits the job to the device's graphics ring:</p>
<pre><code>  amdgpu_job_run on gfx (finished=922:1) =&gt; parent=1:1525895
  dma_fence_signaled 921:1
</code></pre>
<h2 id="re-examining-the-errors">Re-examining the errors</h2>
<p>OK, time to see if the error messages at the start make more sense now.
First, the &quot;Fence fallback&quot;:</p>
<pre><code>[59829.886009 &lt;    0.504003&gt;] [drm] Fence fallback timer expired on ring sdma0
[59830.390003 &lt;    0.503994&gt;] [drm] Fence fallback timer expired on ring sdma0
[59830.894002 &lt;    0.503999&gt;] [drm] Fence fallback timer expired on ring sdma0
</code></pre>
<p>When Linux sends a request to the GPU, it expects the GPU to trigger an interrupt when done.
But in case that doesn't happen, Linux also sets a timer for 0.5s and checks manually.
If the interrupt arrives, the timer is cancelled (or reset for another 0.5s if there are more fences pending).
This log message is written if the timer fires <em>and</em> completed fences were found.
So, it indicates that progress is being made, but very slowly.
My log had 20 of these in a row, so it looks like interrupts weren't getting delivered for 10s for some reason.
Upgrading from Linux 6.6.89 to 6.12.28 seems to have fixed the problem.</p>
<p>Now the ring tests:</p>
<pre><code>[79622.739495 &lt;    0.001128&gt;] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
[79622.909019 &lt;    0.169524&gt;] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.2 test failed (-110)
[79623.075056 &lt;    0.166037&gt;] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.3 test failed (-110)
[79623.241971 &lt;    0.166915&gt;] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.4 test failed (-110)
[79623.408604 &lt;    0.166633&gt;] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.6 test failed (-110)
</code></pre>
<p>I ran <code>bpftrace</code> on <code>amdgpu_ring_test_helper</code> and got it to print a kernel stack-trace when called.
It seems to be called on each ring when suspending (<code>amdgpu_device_ip_suspend_phase2</code> / <code>gfx_v8_0_hw_fini</code>) and resuming (<code>amdgpu_device_ip_resume_phase2</code>, <code>gfx_v8_0_hw_init</code>).</p>
<p><code>gfx_v8_0_ring_test_ring</code> runs and sets <code>mmSCRATCH_REG0</code> (whatever that is) to <code>0xCAFEDEAD</code>,
submits a job to the ring to change it to <code>0xDEADBEEF</code> and waits for <code>mmSCRATCH_REG0</code> to change,
checking every 1μs for up to 100ms. The error code <code>-110</code> is <code>-ETIMEDOUT</code>.</p>
<p>Dumping the ring, it looks like the command is there (I see <code>beef dead</code>):</p>
<pre><code># hexdump /sys/kernel/debug/dri/0000:01:00.0/amdgpu_ring_comp_1.0.1
0000000 0100 0000 0100 0000 0100 0000 7900 c001
0000010 0040 0000 beef dead 1000 ffff 1000 ffff
...
</code></pre>
<p>Though I suppose that could be from a previous test.</p>
<p>After timing out, the rings are marked as not ready.
When trying to schedule a compute job, this gets logged by <code>drm_sched_pick_best</code> as:</p>
<pre><code>[80202.893020 &lt;  576.587829&gt;] [drm] scheduler comp_1.0.1 is not ready, skipping
</code></pre>
<p>But why are we using a compute ring anyway? Tracing with bpftrace again,
it seems that opening a new window with Sway submits a compute job if scaling is in effect.</p>
<p>I found a <a href="https://lore.kernel.org/lkml/ede4dabb-d3e9-45bf-8e56-aebbb8a37ae5@amd.com/">Linux kernel mailing list post</a> that said:</p>
<blockquote>
<p>Either we need to tell Mesa to stop using the compute queues by default
(what is that good for anyway?) or we need to get the compute queues
reliable working after a resume.</p>
</blockquote>
<p>I tried setting <code>RADV_DEBUG=nocompute</code> (from <a href="https://docs.mesa3d.org/envvars.html">RADV driver environment variables</a>),
but that didn't help; Sway is still causing compute jobs to be submitted somehow.
It looks like AMD engineers are aware this doesn't work and don't know how to fix it either,
so I'm going to give up trying to fix it myself.</p>
<p>Finally, I took a look at the Firefox problem.
Running with <code>WAYLAND_DEBUG=1</code> and trying to play a YouTube video, the last thing logged was:</p>
<pre><code>{mesa egl surface queue}  -&gt; wl_display#1.sync(new id wl_callback#78)
</code></pre>
<p>So Mesa is sending a sync request (ping) to Sway and expecting a response,
which it doesn't get. <code>ss -x -p</code> showed that it wasn't reading Sway's response:</p>
<pre><code>Netid State Recv-Q Send-Q  Local Address:Port   Peer Address:Port  Process
u_str ESTAB 6480   0                 * 519138            * 528576 users:((&quot;.firefox-wrappe&quot;,pid=70663,fd=9))
</code></pre>
<p>With a bit of bpftrace, I could see that while the Mesa thread sends events, it relies on Firefox's main thread to read them.
And checking that with gdb showed that the main thread was stuck waiting for a D-Bus service.
Looks like some kind of race, as it doesn't always happen and running <code>strace</code> or <code>dbus-monitor</code> often makes the problem go away.
Running with <code>DBUS_SESSION_BUS_ADDRESS= firefox</code> to stop Firefox from using D-Bus seems to have fixed it.
(I don't know why people use D-Bus, it doesn't seem to add anything over using a simple socket)</p>
<h2 id="conclusions">Conclusions</h2>
<p>I've been wanted to understand a bit about how graphics works for a long time,
but could never find anything that was both detailed and beginner-friendly.
Creating and then tracing this simple example program seems to have worked well,
though it took a long time.
I now understand what the errors from Linux mean, even if not how to fix them.</p>
<p><code>WAYLAND_DEBUG</code> and <code>bpftrace</code> make tracing things on Linux much easier than it used to be,
and being able to see Wayland messages, application logs and kernel internals in a single trace is really useful.
Nix's ability to <a href="https://github.com/symphorien/nixseparatedebuginfod">download debug symbols on demand</a> as gdb needs them also worked really well.</p>
<p>Vulkan seems like a good low-level API and made it much easier to understand how things work.
However, it's disappointing that Vulkan's Wayland support forces you to use the libwayland C library.
I'd probably want to use e.g. ocaml-wayland instead, and that seems to require a different approach that doesn't use the swapchain.
Some useful functions, such as mapping Linux DRM formats to Vulkan formats,
are not exposed in Mesa's public API and have to be duplicated.</p>
<p>It's also a shame the Mesa drivers depend on so many libraries,
including 12 libraries for the deprecated X11 protocol.
We've already seen (e.g. with the <a href="https://en.wikipedia.org/wiki/XZ_Utils_backdoor">XZ Utils backdoor</a>) how this makes supply-chain attacks easier
(merely linking a C library is dangerous, even if you don't use it for anything).</p>
<p>I hope this approach of going through the trace line-by-line was useful to someone.
When explaining how software works it's easy to skip over the boring or awkward bits and this helps prevent that.
Though obviously it would be better to read a description from someone who already knows how it works.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Trying Tamarin on Applied Cryptography</title>
    <link href="https://roscidus.com/blog/blog/2025/04/09/tamarin/"></link>
    <updated>2025-04-09T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/04/09/tamarin</id>
    <content type="html"><![CDATA[<p>Tamarin is a tool for checking cryptographic security protocols (such as TLS or WireGuard).
In this post, I try it out on some (very old, but simple) protocols from <em>Applied Cryptography</em>.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#key-exchange-with-symmetric-cryptography">Key Exchange with Symmetric Cryptography</a>
<ul>
<li><a href="#checking-its-usable">Checking it's usable</a>
</li>
<li><a href="#checking-its-secure">Checking it's secure</a>
</li>
</ul>
</li>
<li><a href="#key-exchange-with-public-key-cryptography">Key Exchange with Public-Key Cryptography</a>
</li>
<li><a href="#interlock-protocol">Interlock Protocol</a>
</li>
<li><a href="#key-and-message-transmission">Key and Message Transmission</a>
<ul>
<li><a href="#fixing-it">Fixing it</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="introduction">Introduction</h2>
<p>I recently came across <a href="https://tamarin-prover.com/">Tamarin</a> while reading about <a href="https://www.wireguard.com/formal-verification/">WireGuard</a>, and decided to try it out.
You give Tamarin a description of a security protocol and some properties you'd like to check,
and for each one it tries to prove that it's true (or find a counter-example).
It can normally do this automatically, but there's also a web UI that lets you direct the solver manually:</p>
<p><a href="/blog/images/tamarin/tamarin-ui.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/tamarin-ui.png" title="The Tamarin UI" class="caption"/><span class="caption-text">The Tamarin UI</span></span></a></p>
<p>This interested me because I don't feel that I have a good understanding of what makes a cryptographic protocol secure,
and I thought making formal models might help.
The Tamarin site has a tutorial and a free (draft) book,
but you should be able to follow this blog post without having read them.</p>
<p><a href="https://www.schneier.com/books/applied-cryptography/">Applied Cryptography</a> was a popular (but now long-obsolete) book showing
&quot;how programmers and electronic communications professionals can use cryptography&quot;.
I have the second edition, from 1996.
Testing Tamarin on very old protocols has several advantages:</p>
<ol>
<li>Old protocols tend to be much simpler than newer ones.
</li>
<li>It's more likely I'll be able to find some weakness.
</li>
<li>If I do find a problem, it doesn't need reporting anywhere.
</li>
</ol>
<p>Chapter 1 of the book introduces some basic concepts.
Chapter 2 introduces the idea of protocols and the conventional names used
(Alice wants to use the protocol to talk to Bob; Trent is a trusted third party;
Mallory is an attacker, intercepting, modifying and replaying messages).
Chapter 3 starts going though some useful protocols, so I started there...</p>
<h2 id="key-exchange-with-symmetric-cryptography">Key Exchange with Symmetric Cryptography</h2>
<p>Chapter 3 (&quot;Basic Protocols&quot;) starts with a key exchange protocol in which
Alice and Bob communicate using a session key that they get from Trent,
a trusted third party.
Alice and Bob have each previously agreed long-term keys to communicate with Trent:</p>
<blockquote>
<ol>
<li>Alice calls Trent and requests a session key to communicate with Bob.
</li>
<li>Trent generates a random session key.
He encrypts two copies of it: one in Alice's key and the other in Bob's key.
Trent sends both copies to Alice.
</li>
<li>Alice decrypts her copy of the session key.
</li>
<li>Alice sends Bob his copy of the session key.
</li>
<li>Bob decrypts his copy of the session key.
</li>
<li>Both Alice and Bob use this session key to communicate securely.
</li>
</ol>
</blockquote>
<p>A Tamarin file has this basic structure:</p>
<pre><code>theory Example1
begin

builtins: symmetric-encryption

...

end
</code></pre>
<p><code>symmetric-encryption</code> gets us a function <code>senc(msg, key)</code> to represent <code>msg</code> encrypted with <code>key</code>,
plus an equation allowing the attacker to recover <code>msg</code> if they have <code>key</code>.</p>
<p>Tamarin represents the state of the world as a collection of <em>state facts</em>.
You can use whatever fact names you want, but there are a few special built-in ones:</p>
<ul>
<li><code>Fr(x)</code> indicates that <code>x</code> is a fresh unguessable value that isn't known to anyone yet.
There is an endless supply of such facts, as you can always generate a new random number.
</li>
<li><code>Out(msg)</code> says that <code>msg</code> was sent to the network (which is controlled by the adversary).
</li>
<li><code>In(msg)</code> says that <code>msg</code> was sent from the network/adversary.
</li>
</ul>
<p>The behaviour of the system is given by a set of rules, each of which can consume state facts and produce new ones.</p>
<p>The first rule allows people to register with the trusted authority Trent:</p>
<pre><code>rule Register:
    [ Fr(k) ]
  --&gt;
    [ !Key($A, 'Trent', k) ]
</code></pre>
<p>This rule consumes a fresh unguessable value <code>k</code>
and produces a fact that some public name <code>$A</code> has registered that key.
Variables starting with <code>$</code> are restricted to <em>public</em> (guessable) values.
The adversary knows all public values without having to see them first.</p>
<p>I'm using the state fact <code>!Key('Alice', 'Trent', k)</code> to mean that Alice has registered symmetric key <code>k</code> with Trent.
The secord argument of <code>!Key</code> is always 'Trent'; I just included it for clarity.
<code>!Key</code> starts with <code>!</code> to indicate that this is a <em>persistent</em> fact - it can be used multiple times without being consumed.</p>
<p>The Tamarin book says we also need a rule to represent someone revealing their registered key:</p>
<pre><code>rule Reveal_key:
    [ !Key($A, 'Trent', k) ]
  --[ Reveal($A) ]-&gt;
    [ Out(k) ]
</code></pre>
<p>This says that if <code>$A</code> is using key <code>k</code> then they might send it out on the network (<code>Out</code>),
which has the effect of handing it to the adversary.
The <code>--&gt;</code> arrow here is annotated with an <em>action fact</em> <code>Reveal($A)</code>.
These facts are like log messages, and are used when checking properties.
They allow us to say things like &quot;Assuming Alice doesn't reveal her key, ...&quot;.</p>
<p>There are two reasons to include this rule.
First, we could model what happens if e.g. Alice reveals her key and Bob doesn't.
Second, and more important, this allows the attacker to get hold of registered keys.
Without this, we ignore the possibility of Mallory having access to any keys at all.</p>
<p>I didn't model the possibility of Trent revealing any keys, as that's obviously fatal.</p>
<p>Then I wrote four rules describing the protocol.
<code>$A</code> (e.g. Alice) sends a message to Trent saying she wants to talk to <code>$B</code> (e.g. Bob):</p>
<pre><code>rule A1:
    [ ]
  --&gt;
    [ Out(&lt;$A, $B&gt;),
      A1_sent($A, $B) ]
</code></pre>
<p>She also adds an <code>A1_sent</code> fact to remember that she requested this.
The convention is that the first argument of a fact is the actor owning it,
when that makes sense.</p>
<p>Note: the <code>Out</code> fact doesn't say where the message is going.
That would be pointless, since the network isn't trusted and Mallory can route messages anywhere.</p>
<p>Trent receives the request,
gets the registered keys for <code>$A</code> and <code>$B</code>, and
generates a fresh session key for them (<code>kab</code>):</p>
<pre><code>rule T:
    [ In(&lt;$A, $B&gt;),
      !Key($A, 'Trent', kat),
      !Key($B, 'Trent', kbt),
      Fr(kab) ]
  --[ Secret(kab), Honest($A), Honest($B) ]-&gt;
    [ Out( &lt;senc(kab, kat), senc(kab, kbt)&gt; ) ]
</code></pre>
<p>Trent replies with <code>kab</code> encrypted for each of them.</p>
<p>There are some more action facts here:
<code>Secret(kab)</code> means that <code>kab</code> is supposed to be secret (not known to the adversary),
assuming all the <code>Honest</code> parties don't reveal their registered keys.
The names <code>Secret</code> and <code>Honest</code> aren't special, but they are suggested by the Tamarin book.
We'll write properties using them later.</p>
<p><code>$A</code> gets the two replies, decrypting the first using her key <code>kat</code>.
She outputs <code>mb</code> (the encrypted message for <code>$B</code>, which she can't decrypt)
and her message (<code>msg</code>) encrypted with the new session key:</p>
<pre><code>rule A2:
    [ A1_sent($A, $B),
      !Key($A, 'Trent', kat),
      In(&lt;senc(kab, kat), mb&gt;),
      Fr(msg) ]
  --[ Sent($A, $B, msg),
      Secret(msg), Honest($A), Honest($B) ]-&gt;
    [ Out( &lt;mb, senc(msg, kab)&gt; ) ]
</code></pre>
<p>Action facts indicate that <code>msg</code> is supposed to be secret,
and that <code>$A</code> sent <code>msg</code> to <code>$B</code>, as that's something we'll want to check.</p>
<p>Finally, <code>$B</code> uses his registered key <code>kbt</code> to decrypt the session key <code>kab</code>,
and uses that to decrypt the message, recording the fact that he got it:</p>
<pre><code>rule B:
    [ !Key($B, 'Trent', kbt),
      In( &lt;senc(kab, kbt), senc(msg, kab)&gt; ) ]
  --[ Received($B, $A, msg) ]-&gt;
    [ ]
</code></pre>
<h3 id="checking-its-usable">Checking it's usable</h3>
<p>The book recommends checking that the protocol works in at least one case,
which can be done like this:</p>
<pre><code>lemma works:
  exists-trace
    &quot;Ex msg #i #j.
      Sent('Alice', 'Bob', msg)@i
      &amp; Received('Bob', 'Alice', msg)@j&quot;
</code></pre>
<p>This says that there exists some trace (possible sequence of action facts) in which
there exists a message <code>msg</code> and time-points <code>#i</code> and <code>#j</code>,
such that Alice sent <code>msg</code> to Bob at time <code>i</code> and Bob received it at time <code>j</code>.
Tamarin finds this example of it working:</p>
<p><a href="/blog/images/tamarin/kdc-works.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/kdc-works.png" title="An example trace" class="caption"/><span class="caption-text">An example trace</span></span></a></p>
<p>The boxes represent the rules being used.
For example, the top two <code>Register</code> boxes show Alice and Bob registering their keys.
Ellipses represent actions of the adversary/network.
They are also generated by rules, but are shown differently by default.</p>
<p>The example isn't quite what I initially expected.
Here's what happened (from top to bottom):</p>
<ol>
<li>Alice and Bob registered their keys.
</li>
<li>Mallory asked Trent to generate a session key for Alice to talk to Bob.
</li>
<li>Trent generated a session key and sent out the two encrypted copies.
</li>
<li>Mallory removed the encrypted message for Bob and replaced it with an arbitrary message before forwarding to Alice.
</li>
<li>Alice sent a request to Trent asking for a session key with Bob, which got discarded.
</li>
<li>Alice received the session key from Trent, along with Mallory's arbitrary message.
She forwarded Mallory's part, adding her message (encrypted with the session key).
</li>
<li>Mallory replaced the arbitrary message with the original message to Bob from Trent.
</li>
<li>Bob received and decrypted the message.
</li>
</ol>
<p>This is a bit more general than the expected path:</p>
<ul>
<li>Trent can receive the request for the key before Alice sends it,
as there's nothing to stop Mallory sending requests.
</li>
<li>Alice forwarding Trent's message to Bob is irrelevant for modelling purposes.
Mallory controls the network and anything can be sent anywhere.
</li>
</ul>
<p>There doesn't seem to be a built-in way to ask for an example where Mallory just forwards messages.
However, you can replace <code>In</code> and <code>Out</code> with something else (e.g. <code>Secure_net</code>) to model a network
without an attacker:</p>
<p><a href="/blog/images/tamarin/kdc-works-secnet.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/kdc-works-secnet.png" title="A protocol example without Mallory" class="caption"/><span class="caption-text">A protocol example without Mallory</span></span></a></p>
<h3 id="checking-its-secure">Checking it's secure</h3>
<p>Next, I ask Tamarin to check that no secrets are leaked,
using a standard pattern recommended by the Tamarin book.</p>
<p>For all secrets <code>k</code> (marked <code>Secret(k)</code> at time <code>#i</code>),
if all actors <code>A</code> required at <code>#i</code> to be honest never reveal their keys,
then there is no time <code>#j</code> at which Mallory knows <code>k</code> (<code>K(x)</code> means the adversary knows <code>x</code>):</p>
<pre><code>lemma secrets:
  &quot;All k #i. Secret(k)@i ==&gt;
     (All A. Honest(A)@i ==&gt; not Ex #j. Reveal(A)@j)
     ==&gt; (not Ex #j. K(k)@j)&quot;
</code></pre>
<p>Tamarin immediately shows a counter-example:</p>
<p><a href="/blog/images/tamarin/kdc-broken.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/kdc-broken.png" title="Mallory reads the secret message" class="caption"/><span class="caption-text">Mallory reads the secret message</span></span></a></p>
<p>Here's what happened:</p>
<ol>
<li>Alice (<code>$A</code>) and Mallory (<code>$B.1</code>) register keys with Trent.
</li>
<li>Mallory asks Trent for a session key between Alice and himself.
</li>
<li>Alice asks Trent for a session key between herself and Bob.
</li>
<li>Mallory sends the reply to his request to Alice.
</li>
<li>Alice encrypts her message to Bob with the Alice-Mallory session key and sends it.
</li>
<li>Mallory decrypts it.
</li>
</ol>
<p>I'm not surprised this protocol doesn't work.
The Tamarin book starts with a more complicated version of it,
where Trent says who each key is for,
and shows that it still has problems.</p>
<p>I find it odd that <em>Applied Cryptography</em> doesn't warn about this when describing the protocol.
It does mention various improvements to it later in the chapter, so the flaws are obviously known.
The introduction says the book intends to be &quot;both a lively introduction to the field of cryptography
and a comprehensive reference&quot;.
For a reference aimed at programmers, I'd expect to see warnings on protocols with known problems.
The book does warn that this protocol relies completely on Trent though,
and is perhaps just here to contrast with the next example, using public-key cryptography.</p>
<h2 id="key-exchange-with-public-key-cryptography">Key Exchange with Public-Key Cryptography</h2>
<p>With separate public and private keys, the next protocol is much simpler:</p>
<blockquote>
<ol>
<li>Alice gets Bob's public key from the KDC [Key Distribution Centre].
</li>
<li>Alice generates a random session key, encrypts it using Bob's public key, and sends it to Bob.
</li>
<li>Bob then decrypts Alice's message using his private key.
</li>
<li>Both of them encrypt their communications using the same session key.
</li>
</ol>
</blockquote>
<p>The Tamarin book recommends modelling a PKI (Public Key Infrastructure) like this:</p>
<pre><code>builtins: asymmetric-encryption

rule Register:
    [ Fr(ltk) ]
  --&gt;
    [ !Ltk($A, ltk), !Pk($A, pk(ltk)), Out(pk(ltk)) ]

rule Reveal_key:
    [ !Ltk($A, ltk) ]
  --[ Reveal($A) ]-&gt;
    [ Out(ltk) ]
</code></pre>
<p>Here, we have <code>!Ltk</code> for the long-term (private) key and <code>!Pk</code> for the public one.
<code>pk(x)</code> represents the public key corresponding to private key <code>x</code>.
<code>Register</code> immediately leaks the public key to the adversary,
as we want to prove the protocol works assuming that public keys are known.</p>
<p>The behaviours are simpler than before:</p>
<pre><code>rule A:
    [ !Pk($B, pkB),
      Fr(sesk) ]
  --[ Running($A, $B, &lt;'R', 'I', sesk&gt;),
      Secret(sesk), Honest($B) ]-&gt;
    [ Out(aenc(sesk, pkB)) ]

rule B:
    [ !Ltk($B, ltk),
      In(aenc(sesk, pk(ltk))) ]
  --[ Commit($B, $A, &lt;'R', 'I', sesk&gt;) ]-&gt;
    [ ]
</code></pre>
<p>The <code>Running</code> and <code>Commit</code> action facts here are suggested by the Tamarin book:</p>
<ul>
<li><code>Commit(B, A, p)</code> means that <code>B</code> believes he has agreed parameters <code>p</code> with <code>A</code>.
</li>
<li><code>Running(A, B, p)</code> means that <code>A</code> is trying to agree <code>p</code> with <code>B</code>.
</li>
<li><code>&lt;'R', 'I', sesk&gt;</code> means that the committing party is the Responder and the other is the Initiator.
The roles are ordered to match the <code>Commit</code> fact in both cases.
</li>
</ul>
<p>Here's an example of it working:</p>
<pre><code>lemma works:
  exists-trace
    &quot;Ex p #i #j.
      Running('Alice', 'Bob', p)@i
      &amp; Commit('Bob', 'Alice', p)@j&quot;
</code></pre>
<p><a href="/blog/images/tamarin/pubkey-works.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/pubkey-works.png" title="Using asymmetric encryption" class="caption"/><span class="caption-text">Using asymmetric encryption</span></span></a></p>
<p>The <code>secrets</code> lemma from before passes (and only Bob needs to be honest).</p>
<p>The Tamarin book also recommends checking agreement, which I did like this:</p>
<pre><code>lemma agree_injective:
  &quot;All A B p #i.
     Commit(B, A, p)@i ==&gt;
       (All X. Honest(X)@i ==&gt; not Ex #j. Reveal(X)@j) ==&gt;
         (Ex #j. Running(A, B, p)@j &amp; j &lt; i)
         &amp; (All A2 B2 #k. Commit(B2, A2, p)@k ==&gt; #i = #k)&quot;
</code></pre>
<p>This says that if <code>B</code> thinks he has agreed <code>p</code> with <code>A</code>,
and all parties required to be honest are, then:</p>
<ul>
<li><code>A</code> was trying to agree <code>p</code> with <code>B</code> in the past, and
</li>
<li>this is the only <code>Commit</code> with <code>p</code>.
</li>
</ul>
<p>Tamarin finds an obvious flaw with this protocol:</p>
<p><a href="/blog/images/tamarin/pubkey-broken.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/pubkey-broken.png" title="Mallory tricks Bob" class="caption"/><span class="caption-text">Mallory tricks Bob</span></span></a></p>
<p>Here:</p>
<ol>
<li>Mallory (<code>$A.1</code>) generates a session key and encrypts it for Bob.
</li>
<li>Bob decrypts the key and believes he is talking to Alice.
</li>
</ol>
<p>Not very surprising; Alice doesn't use her private key for anything in this protocol!
Is <em>Applied Cryptography</em> only considering passive eavesdroppers?
But the next section (&quot;Man-in-the-Middle Attack&quot;) does consider active attacks when sharing the public keys
(which I didn't model here), so this does seem to be in-scope.</p>
<p>This flaw would be pretty obvious if you were writing a service with multiple clients,
because how do you know which client it is?
But you might conceivably miss it if there were only one client.</p>
<h2 id="interlock-protocol">Interlock Protocol</h2>
<p>The next section presents the Interlock Protocol.
This is designed to stop man-in-the-middle attacks.
The idea is that Alice sends half of her message first (e.g. every other bit),
so that Mallory can't decrypt it yet even if he tricked Alice into using his public key.
Then Alice expects the first half of Bob's response before she sends the rest of her message.</p>
<p>This makes no sense to me. How can Bob commit to his response before being able to decrypt Alice's request?
I read the <a href="https://en.wikipedia.org/wiki/Interlock_protocol">Interlock Protocol</a>'s Wikipedia page,
and that seems to suggest that Alice and Bob also require pre-shared passwords.
But then why not pre-share their public keys and solve the problem more simply?
Also, Wikipedia points out a simple attack on it.
I'm going to skip this one.</p>
<h2 id="key-and-message-transmission">Key and Message Transmission</h2>
<p>In this section, the book finally offers a solution to active attackers.
Curiously, it's not the main topic, which is that Alice can send her message (M) without waiting for Bob:</p>
<blockquote>
<ol>
<li>Alice generates a random session key, K, and encrypts M using K.
</li>
<li>Alice gets Bob's public key from the database.
</li>
<li>Alice encrypts K with Bob's public key.
</li>
<li>Alice sends both the encrypted message and encrypted session key to Bob.<br />
For added security against man-in-the-middle attacks, Alice can sign the transmission.
</li>
<li>Bob decrypts Alice's session key K, using his private key.
</li>
<li>Bob decrypts Alice's message using the session key.
</li>
</ol>
</blockquote>
<p>Modifying the previous example to model this was easy enough,
but including the message M clutters things up a bit, so for simplicity I removed it for the blog post:</p>
<pre><code>builtins: asymmetric-encryption, signing

rule A:
  let c = aenc(sesk, pkB) in
    [ !Ltk($A, ltk),
      !Pk($B, pkB),
      Fr(sesk) ]
  --[ Running($A, $B, &lt;'R', 'I', sesk&gt;),
      Secret(sesk), Honest($B) ]-&gt;
    [ Out( &lt;c, sign(c, ltk)&gt; ) ]

rule B:
  let c = aenc(sesk, pk(ltk)) in 
    [ !Ltk($B, ltk),
      !Pk($A, pkA),
      In( &lt;c, sig&gt; ) ]
  --[ _restrict( verify(sig, c, pkA) = true ), Unique(&lt;$B, sig&gt;),
      Honest($A), Honest($B),
      Commit($B, $A, &lt;'R', 'I', sesk&gt;) ]-&gt;
    [ ]

restriction r_unique:
  &quot;All m #i. Unique(m)@i ==&gt;
     All #j. Unique(m)@j ==&gt; #i = #j&quot;
</code></pre>
<ul>
<li><code>sign(c, ltk)</code> is Alice's signature of the ciphertext <code>c</code> with her long-term private key.
</li>
<li><code>_restrict( verify(sig, c, pkA) = true )</code> says rule <code>B</code> can only be used if the signature matches.
</li>
<li><code>Unique(&lt;$B, sig&gt;)</code> says Bob will reject signatures he has seen before, to prevent replay attacks.
</li>
</ul>
<p>Tamarin immediately finds a counter-example:</p>
<p><a href="/blog/images/tamarin/sign-broken.png"><span class="caption-wrapper center"><img src="/blog/images/tamarin/sign-broken.png" title="Doesn't work with even with signing" class="caption"/><span class="caption-text">Doesn't work with even with signing</span></span></a></p>
<p>Here's what happened:</p>
<ol>
<li>Alice (<code>$A</code>), Bob (<code>$B</code>) and Mallory (<code>$A.1</code>) register public keys with the PKI.
</li>
<li>Alice generates a new session key (<code>~n.2</code>), encrypts it for Bob, and signs the ciphertext with her public key.
</li>
<li>Mallory removes Alice's signature and re-signs with his key (<code>c_sign</code>).
</li>
<li>Bob commits to using <code>~n.2</code> to talk to Mallory, but is actually talking to Alice.
</li>
</ol>
<p>This could be a problem if, say, Bob is a file-sync service and Alice is trying to send her files.
The result would be Alice's files getting stored in Mallory's account at Bob rather than her own.</p>
<p>This seems like a more interesting attack than the previous ones;
a non-cryptographer like me could easily miss this.</p>
<p>However, I think it's also a bit of a fluke that Tamarin detected it.
The error is detected because Bob also gets Alice's original message, and now has two conflicting Commits.
But to make <code>agree_injective</code> pass you need to prevent simple replay attacks (where Malloy just sends Alice's message twice).
I did that using <code>Unique(&lt;$B, sig&gt;)</code> to ignore duplicate signatures,
but I could instead have used <code>Unique(&lt;$B, sesk&gt;)</code> to ignore duplicate session keys.
In that case, Tamarin wouldn't spot the problem.</p>
<p>I think the real issue here is that I'm only checking things from Bob's point of view.
In the Tamarin book, it generally also has a <code>Commit</code> on Alice to check that if Alice thinks she's talking to Bob,
then Bob thinks he's talking to Alice.
Checking just Bob's <code>Commit</code> isn't enough;
he thinks he's talking to Mallory and Mallory isn't <code>Honest</code> so we don't care.
But in this protocol Alice commits to Bob without exchanging any previous message, so this <code>Running</code>/<code>Commit</code> system won't do.</p>
<h3 id="fixing-it">Fixing it</h3>
<p>If I add Alice's name to the encrypted session key (e.g. <code>let c = aenc(&lt;$A, sesk&gt;, pkB)</code>)
then Tamarin reports everything is fine.
Is this actually safe, though?
Tamarin assumes that the only thing Mallory can do with an encrypted message is decrypt it (given the key).
But what about if he modifies the ciphertext?</p>
<p><em>Applied Cryptography</em> has previously introduced us to the one-time pad encryption scheme.
If Alice encrypts &quot;Alice,1234&quot; with a one-time pad and signs it as from &quot;Alice&quot;, then
Mallory knows the first 5 characters are &quot;Alice&quot; and can recover the first 5 characters of the pad,
then replace the name with any other 5-character name he has registered without disturbing the key.</p>
<p>I suppose better encryption schemes prevent that sort of thing, but I'm not sure.
My fix relies on encryption (rather than signatures) to prevent tampering, which seems risky.</p>
<p>Chapter 16 of the Tamarin book (&quot;Advanced modeling of cryptographic primitives&quot;) notes that
Taramin's default models of digital signatures and hashes
make assumptions that are not true for many real-world algorithms.
It has little to say about encryption though, except this:</p>
<blockquote>
<p>Real-world symmetric encryption schemes, such as AES, protect against an
adversary trying to learn the message that was encrypted. However, they do not
provide authentication: if we provide a random string to the AES decryption
algorithm instead of a real ciphertext, it will not fail, but instead produce a
(random looking) output.</p>
</blockquote>
<p>Thinking on about this, I wonder if I've misinterpreted the protocol.
The text in <em>Applied Cryptography</em> is:</p>
<blockquote>
<p>For added security against man-in-the-middle attacks, Alice can sign the transmission.</p>
</blockquote>
<p>( I dislike the word &quot;added&quot; here. It would be more reassuring without it.
Is it just saying that signing helps a bit, but not completely? )</p>
<p>I assumed that &quot;the transmission&quot; meant the ciphertext,
but I don't think &quot;transmission&quot; has been formally defined.
Could &quot;the transmission&quot; mean &quot;the plaintext&quot;?
Though if so, there are two things that could be signed here (the session key and the message),
and it doesn't say which one it means.</p>
<p>OK, reading around a bit further, I see that Chapter 2 says &quot;signing before encrypting is a prudent practice&quot;,
so I guess that's what it meant
(looks like this is still being debated today, e.g. <a href="https://crypto.stackexchange.com/questions/5458/should-we-sign-then-encrypt-or-encrypt-then-sign">on crypto.stackexchange.com</a>).
If so, this still seems like a good reason to use Tamarin - checking that you haven't misunderstood the protocol!</p>
<p>Hmm, but Tamarin says it still doesn't work if Alice signs the session key first, then encrypts:</p>
<ol>
<li>Alice generates a session key to talk with Mallory, signs it, and encrypts it for Mallory.
</li>
<li>Mallory decrypts the signed key and re-encrypts it for Bob.
</li>
<li>Bob thinks he's talking to Alice, but he's really talking to Mallory.
</li>
</ol>
<p>Well then I have no idea what the book meant.</p>
<h2 id="conclusions">Conclusions</h2>
<p>I found Tamarin fairly easy to use, though a few things caused me a bit of trouble:</p>
<ul>
<li>
<p>I forgot the <code>!</code> in <code>!Key</code> at first, so the fact could only be used once.
Then Tamarin couldn't find an example of it working and it took me quite a while to realise why.</p>
</li>
<li>
<p>If you forget a close quote in a lemma body, it reports a confusing syntax error.</p>
</li>
<li>
<p>Using a function without declaring it (e.g. <code>pk(x)</code>) is reported as a confusing syntax error.</p>
</li>
</ul>
<p>As well as using <code>$</code> for public values, the book also uses <code>~</code> for &quot;fresh&quot; ones.
However, it also says that doing that can affect the traces (see &quot;Effects of type annotations&quot; in the book)
so I avoided using it myself.
I'm not sure why it would be needed.</p>
<p>Tamarin's solver can usually solve lemmas in a second or so by itself.
Sometimes though it gets stuck in a loop.
This happened with my <code>Secure_net</code> example in this post.
When that happens, you can either direct the solver manually using the UI
(which is mostly useful for discovering what the problem is),
give hints to tell it to try exploring other branches first,
or write a helper lemma (tagged with <code>[reuse]</code>).</p>
<p>One important thing to be aware of is that Taramin files can execute external commands,
so never ask Tamarin to verify a protocol without reading it first
(this feature is mentioned on page 253 of the book, under &quot;Oracles&quot;).</p>
<p>The main problem I have with Tamarin is that the built-in models of e.g. encryption seem a bit unrealistic,
and so I'm not sure how much I trust a verified proof to be secure with a real algorithm rather than the idealised one,
as it seems hard to know what assumptions it's making.</p>
<p>Whereas after <a href="https://roscidus.com/blog/blog/2019/01/01/using-tla-plus-to-understand-xen-vchan/">Using TLA+ to Understand Xen Vchan</a> I feel confident that the (fixed) protocol actually works,
the situation seems different with cryptography.
I still wouldn't feel confident that I'd designed a secure protocol, even if Tamarin verified it.</p>
<p>Making models of the protocols in <em>Applied Cryptography</em> was a useful way to understand them better.
Trying to formalise them certainly made me realise how vague the book is!
The book often didn't indicate what properties you could expect the protocols to provide
(e.g. whether a protocol was secure with an active attacker),
or mention problems that would be addressed later in the book!</p>
<p>I suppose things were different when the book was published in 1996.
SSL and SSH had only just arrived in 1995 (with OpenSSH in 1999), and most things still used plain text then;
Internet security relied on people being too polite to peek at other people's traffic.</p>
<p>As well as finding flaws in protocols, and discovering what properties protocols provide,
Tamarin also seems useful to check that you've understood the description of a protocol.
Including a Tamarin model when describing a protocol would avoid a lot of confusion.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">OCaml 5 performance part 2</title>
    <link href="https://roscidus.com/blog/blog/2024/07/22/performance-2/"></link>
    <updated>2024-07-22T11:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2024/07/22/performance-2</id>
    <content type="html"><![CDATA[<p>The <a href="/blog/blog/2024/07/22/performance/">last post</a> looked at using various tools to understand why an OCaml 5 program was waiting a long time for IO.
In this post, I'll be trying out some tools to investigate a compute-intensive program that uses multiple CPUs.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The problem</a>
</li>
<li><a href="#threadsanitizer">ThreadSanitizer</a>
</li>
<li><a href="#perf">perf</a>
</li>
<li><a href="#mpstat">mpstat</a>
</li>
<li><a href="#offcputime">offcputime</a>
</li>
<li><a href="#the-ocaml-garbage-collector">The OCaml garbage collector</a>
</li>
<li><a href="#statmemprof">statmemprof</a>
</li>
<li><a href="#magic-trace">magic-trace</a>
</li>
<li><a href="#tuning-gc-parameters">Tuning GC parameters</a>
</li>
<li><a href="#simplifying-further">Simplifying further</a>
</li>
<li><a href="#perf-sched">perf sched</a>
</li>
<li><a href="#olly">olly</a>
</li>
<li><a href="#magic-trace-on-the-simple-allocator">magic-trace on the simple allocator</a>
</li>
<li><a href="#perf-annotate">perf annotate</a>
</li>
<li><a href="#perf-c2c">perf c2c</a>
</li>
<li><a href="#perf-stat">perf stat</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
<li><a href="#update-2024-08-22">Update 2024-08-22</a>
</li>
</ul>
<p>Further discussion about this post can be found on <a href="https://discuss.ocaml.org/t/ocaml-5-performance/15014/15">discuss.ocaml.org</a>.</p>
<h2 id="the-problem">The problem</h2>
<p>OCaml 4 allowed running multiple &quot;system threads&quot;, but only one can have the OCaml runtime lock,
so only one can be running OCaml code at a time.
OCaml 5 allows running multiple &quot;domains&quot;, all of which can be running OCaml code at the same time
(each domain can also have multiple system threads; only one system thread can be running OCaml code per domain).</p>
<p>The <a href="https://github.com/ocurrent/ocaml-ci/">ocaml-ci</a> service provides CI for many OCaml programs,
and its first step when testing a commit is to run a solver to select compatible versions for its dependencies.
Running a solve typically only takes about a second, but it has to do it for each possible test platform,
which includes versions of the OCaml compiler from 4.02 to 4.14 and 5.0 to 5.2,
multiple architectures (32-bit and 64-bit x86, 32-bit and 64-bit ARM, PPC64 and s390x),
operating systems (Alpine, Debian, Fedora, FreeBSD, macos, OpenSUSE and Ubuntu, in multiple versions), etc.
In total, this currently does 132 solver runs per commit being tested
(which seems too high to me, but let's ignore that for now).</p>
<p>The solves are done by <a href="https://github.com/ocurrent/solver-service">the solver-service</a>,
which runs on a couple of ARM machines with 160 cores each.
The old OCaml 4 version used to work by spawning lots of sub-processes,
but when OCaml 5 came out, I ported it to use a single process with multiple domains.
That removed the need for lots of communication logic,
and allowed sharing common data such as the package definitions.
The code got a lot shorter and simpler, and I'm told it's been much more reliable too.</p>
<p>But the performance was surprisingly bad.
Here's a graph showing how the number of solves per second scales with the number of CPUs (workers) being used:</p>
<p><a href="/blog/images/perf/solver-arm-orig.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-orig.svg" title="Processes scaling better than domains" class="caption"/><span class="caption-text">Processes scaling better than domains</span></span></a></p>
<p>The &quot;Processes&quot; line shows performance when forking multiple processes to do the work, which looks pretty good.
The &quot;Domains&quot; line shows what happens if you instead spawn domains inside a single process.</p>
<p>Note: The original service used many libraries (a mix of Eio and Lwt ones),
but to make investigation easier I simplified it by removing most of them.
The <a href="https://github.com/talex5/solver-service/tree/simplify">simplified version</a> doesn't use Eio or Lwt;
it just spawns some domains/processes and has each of them do the same solve in a loop a fixed number of times.</p>
<h2 id="threadsanitizer">ThreadSanitizer</h2>
<p>When converting a single-domain OCaml 4 program to use multiple cores it's easy to introduce races.
OCaml has <a href="https://ocaml.org/manual/5.2/tsan.html">ThreadSanitizer</a> (TSan) support which can detect these.
To use it, install an OCaml compiler with the <code>tsan</code> option:</p>
<pre><code>$ opam switch create 5.2.0-tsan ocaml-variants.5.2.0+options ocaml-option-tsan
</code></pre>
<p>Things run a lot slower and require more memory with this compiler, but it's good to check:</p>
<pre><code>$ ./_build/default/stress/stress.exe --internal-workers=2
[...]
WARNING: ThreadSanitizer: data race (pid=133127)
  Write of size 8 at 0x7ff2b7814d38 by thread T4 (mutexes: write M88):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Previous write of size 8 at 0x7ff2b7814d38 by thread T1 (mutexes: write M83):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Mutex M88 (0x558368b95358) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Mutex M83 (0x558368b95240) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Thread T4 (tid=133132, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

  Thread T1 (tid=133129, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

SUMMARY: ThreadSanitizer: data race lib/model.ml:70 in camlOpam_0install__Model.group_ors_1288
</code></pre>
<p>The two mutexes mentioned in the output, M83 and M88, are the <code>domain_lock</code>,
used to ensure only one sys-thread runs at a time in each domain.
In this program we only have one sys-thread per domain and so can ignore them.</p>
<p>The output reveals that the solver used a global variable to generate unique IDs:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">fresh_id</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line">  <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">incr</span> <span class="n">i</span><span class="o">;</span>           <span class="c">(* model.ml:70 *)</span>
</span><span class="line">    <span class="o">!</span><span class="n">i</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With that fixed, TSan finds no further problems (in this simplified version).
This gives us good confidence that there isn't any shared state:
TSan would report use of shared state not protected by a mutex,
and since the program was written for OCaml 4 it won't be using any mutexes.</p>
<p>That's good, because if one thread writes to a location that another reads then that requires coordination between CPUs,
which is relatively slow
(though we could still experience slow-downs due to <a href="https://en.wikipedia.org/wiki/False_sharing">false sharing</a>,
where two separate mutable items end up in the same cache line).
However, while important for correctness, it didn't make any noticeable difference to the benchmark results.</p>
<h2 id="perf">perf</h2>
<p><a href="https://perf.wiki.kernel.org/">perf</a> is the obvious tool to use when facing CPU performance problems.
<code>perf record -g PROG</code> takes samples of the program's stack regularly,
so that functions that run a lot or for a long time will appear often.
<code>perf report</code> provides a UI to explore the results:</p>
<pre><code>$ perf report
  Children      Self  Command     Shared Object      Symbol
+   59.81%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.do_solve_2283
+   59.44%     0.00%  stress.exe  stress.exe         [.] Opam_0install.Solver.solve_1428
+   59.25%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Domain_worker.solve_951
+   58.88%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Stress.run_worker_332
+   58.18%     0.00%  stress.exe  stress.exe         [.] Stdlib.Domain.body_735
+   57.91%     0.00%  stress.exe  stress.exe         [.] caml_start_program
+   34.39%     0.69%  stress.exe  stress.exe         [.] Stdlib.List.iter_366
+   34.39%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.lookup_845
+   34.39%     0.09%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.process_dep_2024
+   33.14%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Sat.run_solver_1446
+   27.28%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.build_problem_2092
+   26.27%     0.02%  stress.exe  stress.exe         [.] caml_call_gc
</code></pre>
<p>Looks like we're spending most of our time solving, as expected.
But this can be misleading.
Because perf only records stack traces when the code is running, it doesn't report any time the process spent sleeping.</p>
<pre><code>$ /usr/bin/time ./_build/default/stress/stress.exe --count=10 --internal-workers=7
73.08user 0.61system 0:12.65elapsed 582%CPU (0avgtext+0avgdata 596608maxresident)k
</code></pre>
<p>With 7 workers, we'd expect to see <code>700%CPU</code>, but we only see <code>582%</code>.</p>
<h2 id="mpstat">mpstat</h2>
<p><a href="https://www.man7.org/linux/man-pages/man1/mpstat.1.html">mpstat</a> can show a per-CPU breakdown.
Here are a couple of one second intervals on my machine while the solver was running:</p>
<pre><code>$ mpstat --dec=0 -P ALL 1
16:24:39     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:40     all      78      1       2       1       0       0      18
16:24:40       0      19      1       0       1       0       1      78
16:24:40       1      88      1       0       1       0       0      10
16:24:40       2      88      1       0       1       0       0      10
16:24:40       3      88      0       0       0       0       1      11
16:24:40       4      89      1       0       0       0       0      10
16:24:40       5      90      0       0       1       0       0       9
16:24:40       6      79      1       0       1       1       1      17
16:24:40       7      86      0      12       1       1       0       0

16:24:40     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:41     all      80      1       2       1       0       0      17
16:24:41       0      85      0      12       1       0       1       1
16:24:41       1      91      1       0       1       0       0       7
16:24:41       2      90      0       0       1       1       0       8
16:24:41       3      89      1       0       1       0       0       9
16:24:41       4      67      1       0       1       0       0      31
16:24:41       5      52      1       0       0       0       1      46
16:24:41       6      76      1       0       1       0       0      22
16:24:41       7      90      1       0       0       0       0       9
</code></pre>
<p>Note: I removed some columns with all zero values to save space.</p>
<p>We might expect to see 7 CPUs running at 100% and one idle CPU,
but in fact they're all moderately busy.
On the other hand, none of them spent more than 91% of its time running the solver code.</p>
<h2 id="offcputime">offcputime</h2>
<p><a href="https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html">offcputime</a> will show why a process wasn't using a CPU
(it's like <code>offwaketime</code>, which we saw earlier, but doesn't record the waker).
Here I'm using <a href="https://www.man7.org/linux/man-pages/man1/pidstat.1.html">pidstat</a> to see all running threads and then examining one of the workers,
to avoid the problem we saw last time where the diagram included multiple threads:</p>
<pre><code>$ pidstat 1 -t
...
^C
Average:      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000     78304         -  550.50    9.41    0.00    0.00  559.90     -  stress.exe
Average:     1000         -     78305   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78307    8.42    0.99    0.00    0.00    9.41     -  |__stress.exe
Average:     1000         -     78308   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78310   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78312   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78314   89.11    1.49    0.00    0.00   90.59     -  |__stress.exe
Average:     1000         -     78316   89.60    1.98    0.00    0.00   91.58     -  |__stress.exe

$ sudo offcputime-bpfcc -f -t 78310 &gt; off-cpu
</code></pre>
<p>Note: The ARM machine's kernel was too old to run <code>offcputime</code>, so I ran this on my machine instead,
with one main domain and six workers.
As I needed good stacks for C functions too, I ran stress.exe in an Ubuntu 24.04 docker container,
as recent versions of Ubuntu compile with <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html">frame pointers by default</a>.</p>
<p>The raw output was very noisy, showing it waiting in many different places.
Looking at a few, it was clear it was mostly the GC (which can run from almost anywhere).
The output is just a text-file with one line per stack-trace, and bit of <code>sed</code> cleaned it up:</p>
<pre><code>$ sed -E 's/stress.exe;.*;(caml_call_gc|caml_handle_gc_interrupt|caml_poll_gc_work|asm_sysvec_apic_timer_interrupt|asm_sysvec_reschedule_ipi);/stress.exe;\\1;/' off-cpu &gt; off-cpu-gc
$ flamegraph.pl --colors=blue off-cpu-gc &gt; off-cpu-gc.svg
</code></pre>
<p>That removes the part of the stack-trace before any of various interrupt-type functions that can be called from anywhere.
The graph is blue to indicate that it shows time when the process wasn't running.</p>
<p><a href="/blog/images/perf/off-cpu-gc.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/off-cpu-gc.svg" title="Time spent off-CPU" class="caption"/><span class="caption-text">Time spent off-CPU</span></span></a></p>
<p>There are rather a lot of traces where we missed the user stack.
However, the results seem clear enough: when our worker is waiting, it's in the garbage collector,
calling <code>caml_plat_spin_wait</code>.
This is used to sleep when a spin-lock has been spinning for too long (after 1000 iterations).</p>
<h2 id="the-ocaml-garbage-collector">The OCaml garbage collector</h2>
<p>OCaml has a <em>major heap</em> for long-lived values, plus one fixed-size <em>minor heap</em> for each domain.
New allocations are made sequentially on the allocating domain's minor heap
(which is very fast, just adjusting a pointer by the size required).</p>
<p>When the minor heap is full the program performs a <em>minor GC</em>,
moving any values that are still reachable to the major heap
and leaving the minor heap empty.</p>
<p>Garbage collection of the major heap is done in small slices so that the application doesn't pause for long,
and domains can do marking and sweeping work without needing to coordinate
(except at the very end of a major cycle, when they briefly synchronise to agree a new cycle is starting).</p>
<p>However, as minor GCs move values that other domains may be using, they do require all domains to stop.</p>
<p>Although the simplified test program doesn't use Eio, we can still use <a href="https://github.com/ocaml-multicore/eio-trace">eio-trace</a> to record GC events
(we just don't see any fibers).
Here's a screenshot of the solver running with 24 domains on the ARM machine,
showing it performing GC work (not all domains are visible in the picture):</p>
<p><a href="/blog/images/perf/solver-arm-gc-24.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-gc-24.svg" title="GC work shown in eio-trace" class="caption"/><span class="caption-text">GC work shown in eio-trace</span></span></a></p>
<!-- 12.5503s -->
<p>The orange/red parts show when the GC is running and the yellow regions show when the domain is waiting for other domains.
The thick columns with yellow edges are minor GCs,
while the thin (almost invisible) red columns without any yellow between them are major slices.
The second minor GC from the left took longer than usual because the third domain from the top took a while to respond.
It also didn't do a major slice before that; perhaps it was busy doing something, or maybe Linux scheduled a different process to run then.</p>
<p>Traces recorded by eio-trace can also be viewed in Perfetto, which shows the nesting better:
Here's a close-up of a single minor GC, corresponding to the bottom two domains from the second column from the left:</p>
<p><a href="/blog/images/perf/solver-arm-gc-24-perfetto.png"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-gc-24-perfetto.png" title="Close-up in Perfetto" class="caption"/><span class="caption-text">Close-up in Perfetto</span></span></a></p>
<ul>
<li>The domain triggering the GC (the bottom one here) enters a &quot;stw_leader&quot; (stop-the-world) phase
and waits for the other domains to stop.
</li>
<li>One by one, the other domains stop and enter &quot;stw_api_barrier&quot; until all domains have stopped.
</li>
<li>All domains perform a minor GC, clearing their minor heaps.
</li>
<li>They then enter a &quot;minor_leave_barrier&quot; phase, waiting until all domains have finished.
</li>
<li>Each domain returns to running application code.
</li>
</ul>
<p>We can now see why the solver spends so much time sleeping;
when a domain performs a minor GC, it spends most of the time waiting for other domains.</p>
<p>(the above is a slight simplification; domains may do some work on the major GC while waiting)</p>
<h2 id="statmemprof">statmemprof</h2>
<p>One obvious solution to GC slowness is to produce less garbage in the first place.
To do that, we need to find out where the most costly allocations are coming from.
Tracing every memory allocation tends to make programs unusably slow,
so OCaml instead provides a <em>statistical</em> memory profiler.</p>
<p>It was temporarily removed in OCaml 5 because it needed updating for the new multicore GC,
but has recently been brought back and will be in OCaml 5.3.
There's a backport to 5.2, but <a href="https://github.com/janestreet/memtrace/pull/22#issuecomment-2199600729">I couldn't get it to work</a>,
so I just removed the domains stuff from the test and did a single-domain run on OCaml 4.14.
You need the <a href="https://github.com/janestreet/memtrace">memtrace</a> library to collect samples and <a href="https://github.com/janestreet/memtrace_viewer">memtrace_viewer</a> to view them:</p>
<pre><code>$ opam install memtrace memtrace_viewer
</code></pre>
<p>Put this at the start of the program to enable it:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Memtrace</span><span class="p">.</span><span class="n">trace_if_requested</span> <span class="o">~</span><span class="n">context</span><span class="o">:</span><span class="s2">&quot;solver-test&quot;</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then running with <code>MEMTRACE</code> set records a trace:</p>
<pre><code>$ MEMTRACE=solver.ctf ./stress.exe --count=10
Solved warm-up request in: 1.99s
Running another 10 * 1 solves...

$ memtrace-viewer solver.ctf
Processing solver.ctf...
Serving http://localhost:8080/
</code></pre>
<p><a href="/blog/images/perf/memtrace-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/memtrace-1.png" title="The memtrace viewer UI" class="caption"/><span class="caption-text">The memtrace viewer UI</span></span></a></p>
<p>The flame graph in the middle shows functions scaled by the amount of memory they allocated.
Initially it showed two groups, one for the warm-up request and one for the 10 runs.
To simplify the display, I used the filter panel (on the left) to show only allocations after the 2 second warm-up.
We can immediately see that <code>OpamVersionCompare.compare</code> is the source of most memory use.</p>
<p>Focusing on that function shows that it performed 54.1% of all allocations.
The display now shows allocations performed within it above it (in green),
and all the places it's called from in blue below:</p>
<p><a href="/blog/images/perf/memtrace-2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/memtrace-2.png" title="The compare function is expensive!" class="caption"/><span class="caption-text">The compare function is expensive!</span></span></a></p>
<p>The bulk of the allocations are coming from <a href="https://github.com/ocaml/opam/blob/a1c9c34417735687fd9310e7dc5c4c177e020441/src/core/opamVersionCompare.ml#L20-L27">this loop</a>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(* [skip_while_from i f w m] yields the index of the leftmost character</span>
</span><span class="line"><span class="c"> * in the string [s], starting from [i], and ending at [m], that does</span>
</span><span class="line"><span class="c"> * not satisfy the predicate [f], or [length w] if no such index exists.  *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">skip_while_from</span> <span class="n">i</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="n">i</span> <span class="o">=</span>
</span><span class="line">    <span class="k">if</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m</span> <span class="k">then</span> <span class="n">i</span>
</span><span class="line">    <span class="k">else</span> <span class="k">if</span> <span class="n">f</span> <span class="n">w</span><span class="o">.[</span><span class="n">i</span><span class="o">]</span> <span class="k">then</span> <span class="n">loop</span> <span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="k">else</span> <span class="n">i</span>
</span><span class="line">  <span class="k">in</span> <span class="n">loop</span> <span class="n">i</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">skip_zeros</span> <span class="n">x</span> <span class="n">xi</span> <span class="n">xl</span> <span class="o">=</span> <span class="n">skip_while_from</span> <span class="n">xi</span> <span class="o">(</span><span class="k">fun</span> <span class="n">c</span> <span class="o">-&gt;</span> <span class="n">c</span> <span class="o">=</span> <span class="sc">&#39;0&#39;</span><span class="o">)</span> <span class="n">x</span> <span class="n">xl</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's used when processing a version like <code>1.2.3</code> to skip any leading &quot;0&quot; characters
(so that would compare equal to <code>1.02.3</code>).
The <code>loop</code> function refers to other variables (such as <code>f</code>) from its context,
and so OCaml allocates a closure on the heap to hold these variables.
Even though these allocations are small, we have to do it for every component of every version.
And we compare versions a lot:
for every version of a package that says it requires e.g. <code>libfoo { &gt;= &quot;1.2&quot; }</code>,
we have to check the formula against every version of libfoo.</p>
<p>The solution is rather simple (and shorter than the original!):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">skip_while_from</span> <span class="n">i</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m</span> <span class="k">then</span> <span class="n">i</span>
</span><span class="line">  <span class="k">else</span> <span class="k">if</span> <span class="n">f</span> <span class="n">w</span><span class="o">.[</span><span class="n">i</span><span class="o">]</span> <span class="k">then</span> <span class="n">skip_while_from</span> <span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="k">else</span> <span class="n">i</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Removing the other allocations from <code>compare</code> too reduces total memory allocations
from 21.8G to 9.6G!
The processes benchmark got about 14% faster, while the domains one was 23% faster:</p>
<p><a href="/blog/images/perf/solver-arm-no-alloc.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-no-alloc.svg" title="Effect of reducing allocations. Old values are shown in grey." class="caption"/><span class="caption-text">Effect of reducing allocations. Old values are shown in grey.</span></span></a></p>
<p>A nice optimisation,
but using domains is still nowhere close to even the original version with separate processes.</p>
<h2 id="magic-trace">magic-trace</h2>
<p>The traces above show the solver taking a long time for all domains to enter the <code>stw_api_barrier</code> phase.
What was the slow domain doing to cause that?
<code>magic-trace</code> lets us tell it when to save the ring buffer and we can use this to get detailed information.
Tracing multiple threads with magic-trace doesn't seem to work well
(each thread gets a very small buffer, they don't stop at quite the same time, and triggers don't work)
so I find it's better to trace just one thread.</p>
<p>I modified the OCaml runtime so that the leader (the domain requesting the GC) records the time.
As each domain enters <code>stw_api_barrier</code> it checks how late it is and calls a function to print a warning if it's above a threshold.
Then I attached magic-trace to one of the worker threads and told it to save a sample when that function got called:</p>
<p><a href="/blog/images/perf/gc-magic-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/gc-magic-1.png" title="A domain being slow to join a minor GC" class="caption"/><span class="caption-text">A domain being slow to join a minor GC</span></span></a></p>
<p>In the example above,
magic-trace saved about 7ms of the history of a domain up to the point where it entered <code>stw_api_barrier</code>.
The first few ms show the solver working normally.
Then it needs to do a minor GC and tries to become the leader.
But another domain has the lock and so it spins, calling <code>handle_incoming</code> 293,711 times in a loop for 2.5ms.</p>
<p>I had a look at the code in the OCaml runtime.
When a domain wants to perform a minor GC, the steps are:</p>
<ol>
<li>Acquire <code>all_domains_lock</code>.
</li>
<li>Populate the <code>stw_request</code> global.
</li>
<li>Interrupt all domains.
</li>
<li>Release <code>all_domains_lock</code>.
</li>
<li>Wait for all domains to get the interrupt.
</li>
<li>Mark self as ready, allowing GC work to start.
</li>
<li>Do minor GC.
</li>
<li>The last domain to finish its minor GC signals <code>all_domains_cond</code> and everyone resumes.
</li>
</ol>
<p>I added some extra event reporting to the GC, showing when a domain is trying to perform a GC (<code>try</code>),
when the leader is signalling other domains (<code>signal</code>), and when a domain is sleeping waiting for something (<code>sleep</code>).
Here's what that looks like (in some places):</p>
<p><a href="/blog/images/perf/solver-try.png"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-try.png" title="One sleeping domain delays all the others" class="caption"/><span class="caption-text">One sleeping domain delays all the others</span></span></a></p>
<ol>
<li>The top domain finished its minor collection quickly (as it's mostly idle and had nothing to do),
and started waiting for the other domains to finish. For some reason, this sleep call took 3ms to run.
</li>
<li>The other domains resume work. One by one, they fill their minor heaps and try to start a GC.
</li>
<li>They can't start a new GC, as the old one hasn't completely finished yet, so they spin.
</li>
<li>Eventually the top domain wakes up and finishes the previous STW section.
</li>
<li>One of the other domains immediately starts a new minor GC and the pattern repeats.
</li>
</ol>
<p>These <code>try</code> events seem useful;
the program is spending much more time stuck in GC than the original traces indicated!</p>
<p>One obvious improvement here would be for idle domains to opt out of GC.
Another would be to tell the kernel when to wake instead of using sleeps —
and I see there's a PR already:
<a href="https://github.com/ocaml/ocaml/pull/12579">OS-based Synchronisation for Stop-the-World Sections</a>.</p>
<p>Another possibility would be to let domains perform minor GCs independently.
The OCaml developers did make a version that worked that way,
but it requires changes to all C code that uses the OCaml APIs,
since a value in another domain's minor heap might move while it's running.</p>
<p>Finally, I wonder if the code could be simplified a bit using a compare-and-set instead of taking a lock to become leader.
That would eliminate the <code>try</code> state, where a domain knows another domain is the leader, but doesn't know what it wants to do.
It's also strange that there's a state where
the top domain has finished its critical section and allowed the other domains to resume,
but is not quite finished enough to let a new GC start.</p>
<p>We can work around this problem by having the main domain do work too.
That could be a problem for interactive applications (where the main domain is running the UI and needs to respond fast),
but it should be OK for the solver service.
This was about 15% faster on my machine, but appeared to have no effect on the ARM server.
Lesson: get traces on the target machine!</p>
<h2 id="tuning-gc-parameters">Tuning GC parameters</h2>
<p>Another way to reduce the synchronisation overhead of minor GCs is to make them less frequent.
We can do that by increasing the size of the minor heap,
doing a few long GCs rather than many short ones.
The size is controlled by the setting e.g. <code>OCAMLRUNPARAM=s=8192k</code>.
On my machine, this actually makes things slower, but it's about 18% faster on the ARM server with 80 domains.</p>
<p>Here are the first few domains (from a total of 24) on the ARM server with different minor heap sizes
(both are showing 1s of execution):</p>
<p><a href="/blog/images/perf/small-heap-24.png"><span class="caption-wrapper center"><img src="/blog/images/perf/small-heap-24.png" title="The default minor heap size (256k words)" class="caption"/><span class="caption-text">The default minor heap size (256k words)</span></span></a>
<a href="/blog/images/perf/big-heap-24.png"><span class="caption-wrapper center"><img src="/blog/images/perf/big-heap-24.png" title="With a larger minor heap (8192k words)" class="caption"/><span class="caption-text">With a larger minor heap (8192k words)</span></span></a>
Note that the major slices also get fewer and larger, as they happen half way between minor slices.</p>
<p>Also, there's still a lot of variation between the time each domain spends doing GC
(despite the fact that they're all running exactly the same task), so they still end up waiting a lot.</p>
<h2 id="simplifying-further">Simplifying further</h2>
<p>This is all still pretty odd, though.
We're getting small performance increases, but still nothing like when forking.
Can the test-case be simplified further?
Yes, it turns out!
This <a href="https://gitlab.com/talex5/slow">simple function</a> takes much longer to run when using domains, compared to forking!</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">run_worker</span> <span class="n">n</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="n">n</span> <span class="o">*</span> <span class="mi">10000000</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>ref ()</code> allocates a small block (2 words, including the header) on the minor heap.
<code>opaque_identity</code> is to make sure the compiler doesn't optimise this pointless allocation away.</p>
<p><a href="/blog/images/perf/loop-arm.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/loop-arm.svg" title="Time to run the loop on the 160-core ARM server (lower is better)" class="caption"/><span class="caption-text">Time to run the loop on the 160-core ARM server (lower is better)</span></span></a></p>
<p>Here's what I would expect here:</p>
<ol>
<li>The domains all start to fill their minor heaps. One fills it and triggers a minor GC.
</li>
<li>The triggering domain sets an indicator in each domain saying a GC is due.
None of the domains is sleeping, so the OS isn't involved in any wake-ups here.
</li>
<li>The other domains check the indicator on their next allocation,
which happens immediately since that's all they're doing.
</li>
<li>The GCs all proceed quickly, since there's nothing to scan and nothing to promote
(except possibly the current single allocation).
</li>
<li>They all resume quickly and continue.
</li>
</ol>
<p>So ideally the lines would be flat.
In practice, we may hit physical limits due to memory bandwidth, CPU temperature or kernel limitations;
I assume this is why the &quot;Processes&quot; time starts to rise eventually.
But it looks like this minor slow-down causes knock-on effects in the &quot;Domains&quot; case.</p>
<p>If I remove the allocation, then the domains and processes versions take the same amount of time.</p>
<h2 id="perf-sched">perf sched</h2>
<p><code>perf sched record</code> records kernel scheduling events, allowing it to show what is running on each CPU at all times.
<code>perf sched timehist</code> displays a report:</p>
<pre><code>$ sudo perf sched record -k CLOCK_MONOTONIC
^C

$ sudo perf sched timehist
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  185296.715345 [0000]  sway[175042]                        1.694      0.025      0.775 
  185296.716024 [0002]  crosvm_vcpu2[178276/178217]         0.012      0.000      2.957 
  185296.717031 [0003]  main.exe[196519]                    0.006      0.000      4.004 
  185296.717044 [0003]  rcu_preempt[18]                     4.004      0.015      0.012 
  185296.717260 [0001]  main.exe[196526]                    1.760      0.000      2.633 
  185296.717455 [0001]  crosvm_vcpu1[193502/193445]        63.809      0.015      0.194 
  ...
</code></pre>
<p>The first line here shows that <code>sway</code> needed to wait for 1.694 ms for some reason (possibly a sleep),
and then once it was due to resume, had to wait a further 0.025 ms for CPU 0 to be free. It then ran for 0.775 ms.
I decided to use <code>perf sched</code> to find out what the system was doing when a domain failed to respond quickly.</p>
<p>To make the output easier to read, I hacked eio-trace to display it on the traces.
<code>perf script -g python</code> will generate a skeleton Python script that can format all the events found in the <code>perf.data</code> file,
and I used that to convert the output to CSV.
To correlate OCaml domains with Linux threads, I also modified OCaml to report the thread ID (TID) for each new domain
(it was previously reporting the PID instead for some reason).</p>
<p>Here's a trace of the simple allocator from the previous section:</p>
<p><a href="/blog/images/perf/slow-no-affinity1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity1.png" title="eio-trace with perf sched data" class="caption"/><span class="caption-text">eio-trace with perf sched data</span></span></a></p>
<!-- 404ms -->
<p>Note: the colour of <code>stw_api_barrier</code> has changed: previously eio-trace coloured it yellow to indicate sleeping,
but now we have the individual <code>sleep</code> events we can see exactly which part of it was sleeping.</p>
<p>The horizontal green bars show when each domain was running on the CPU.
Here, we see that most of the domains ran until they called <code>sleep</code>.
When the sleep timeout expires, the thread is ready to run again and goes on the run-queue.
Time spent waiting on the queue is shown with a black bar.</p>
<p>When switching to or from another process, the process name is shown.
Here we can see that <code>crosvm_vcpu6</code> interrupted one of our domains, making it late to respond to the GC request.</p>
<p>Here we see another odd feature of the protocol: even though the late domain was the last to be ready,
it wasn't able to start its GC even then, because only the leader is allowed to say when everyone is ready.
Several domains wake after the late one is ready and have to go back to sleep again.</p>
<p>The diagram also shows when Linux migrated our OCaml domains between CPUs.
For example:</p>
<ol>
<li>The bottom domain was initially running on CPU 0.
</li>
<li>After sleeping briefly, it spent a while waiting to resume and Linux moved it to CPU 6 (the leader domain, which was idle then).
</li>
<li>Once there, the bottom domain slept briefly again, and again was slow to wake, getting moved to CPU 7.
</li>
</ol>
<p>Here's another example:</p>
<p><a href="/blog/images/perf/slow-no-affinity2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity2.png" title="Two domains on the same CPU" class="caption"/><span class="caption-text">Two domains on the same CPU</span></span></a></p>
<ol>
<li>The bottom domain's sleep finished a while ago, and it's been stuck on the queue because it's on the same CPU as another domain.
</li>
<li>All the other domains are spinning, trying to become the leader for the next minor GC.
</li>
<li>Eventually, Linux preempts the 5th domain from the top to run the bottom domain
(the vertical green line indicates a switch between domains in the same process).
</li>
<li>The bottom domain finishes the previous minor GC, allowing the 3rd from top to start a new one.
</li>
<li>The new GC is delayed because the 5th domain is now waiting while the bottom domain spins.
</li>
<li>Eventually the bottom domain sleeps, allowing 5 to join and the GC starts.
</li>
</ol>
<p>I tried using the <a href="https://github.com/haesbaert/ocaml-processor">processor</a> package to pin each domain to a different CPU.
That cleaned up the traces a fair bit, but didn't make much difference to the runtime on my machine.</p>
<p>I also tried using <a href="https://www.man7.org/linux/man-pages/man1/chrt.1.html">chrt</a> to run the program as a high-priority &quot;real-time&quot; task,
which also didn't seem to help.
I wrote a <code>bpftrace</code> script to report if one of our domains was ready to resume and the scheduler instead ran something else.
That showed various things.
Often Linux was migrating something else out of the way and we had to wait for that,
but there were also some kernel tasks that seemed to be even higher priority, such as GPU drivers or uring workers.
I suspect to make this work you'd need to set the affinity of all the other processes to keep them away from the cores being used
(but that wouldn't work in this example because I'm using all of them!).
Come to think of it, running a CPU intensive task on every CPU at realtime priority was a dumb idea;
had it worked I wouldn't have been able to do anything else with the computer!</p>
<h2 id="olly">olly</h2>
<p>Exploring the scheduler behaviour was interesting, and might be needed for latency-sensitive tasks,
but how often do migrations and delays really cause trouble?
The slow GCs are interesting, but there are also sections like this where everything is going smoothly,
and minor GCs take less than 4 microseconds:</p>
<p><a href="/blog/images/perf/slow-no-affinity3.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity3.png" title="GCs going well" class="caption"/><span class="caption-text">GCs going well</span></span></a></p>
<p><a href="https://github.com/tarides/runtime_events_tools/">olly</a> can be used get summary statistics:</p>
<pre><code>$ olly gc-stats './_build/default/stress/stress.exe --count=6 --internal-workers=24'
...
Solved 144 requests in 25.44s (0.18s/iter) (5.66 solves/s)

Execution times:
Wall time (s):	28.17
CPU time (s):	1.66
GC time (s):	169.88
GC overhead (% of CPU time):	10223.84%

GC time per domain (s):
Domain0: 	0.47
Domain1: 	9.34
Domain2: 	6.90
Domain3: 	6.97
Domain4: 	6.68
Domain5: 	6.85
Domain6: 	6.59
...
</code></pre>
<p>10223.84% GC overhead sounds like a lot but I think this is a misleading, for a few reasons:</p>
<ol>
<li>The CPU time looks wrong. <code>time</code> reports about 6 minutes, which sounds more likely.
</li>
<li>GC time (as we've seen) includes time spent sleeping, while CPU time doesn't.
</li>
<li>It doesn't include time spent trying to become a GC leader.
</li>
</ol>
<p>To double-check, I modified eio-trace to report GC statistics for a saved trace:</p>
<pre><code>Solved 144 requests in 26.84s (0.19s/iter) (5.36 solves/s)
...

$ eio-trace gc-stats trace.fxt
./trace.fxt:

Ring  GC/s     App/s    Total/s   %GC
  0   10.255   19.376   29.631    34.61
  1    7.986   10.201   18.186    43.91
  2    8.195   10.648   18.843    43.49
  3    9.521   14.398   23.919    39.81
  4    9.775   16.537   26.311    37.15
  5    8.084   10.635   18.719    43.19
  6    7.977   10.356   18.333    43.51
...
 24    7.920   10.802   18.722    42.30

All  213.332  308.578  521.910    40.88

Note: all times are wall-clock and so include time spent blocking.
</code></pre>
<p>It ran slightly slower under eio-trace, perhaps because recording a trace file is more work than maintaining some counters,
but it's similar.
So this indicates that with 24 domains GC is taking about 40% of the total time (including time spent sleeping).</p>
<p>But something doesn't add up, on my machine at least:</p>
<ul>
<li>With processes, the simple allocator test's main process spends 2% of its time in GC and takes 2.4s to run.
</li>
<li>With domains, the main domain spends 20% of its time in GC and takes 8.2s.
</li>
</ul>
<p>Even if that 20% were removed completely, it should only save 20% of the 8.2s.
So with domains, the code must be running more slowly even when it's not in the GC.</p>
<h2 id="magic-trace-on-the-simple-allocator">magic-trace on the simple allocator</h2>
<p>I tried running magic-trace to see what it was doing outside of the GC.
Since it wasn't calling any functions, it didn't show anything, but we can fix that:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">foo</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="mi">100</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span><span class="line"><span class="o">[@@</span><span class="n">inline</span> <span class="n">never</span><span class="o">]</span> <span class="o">[@@</span><span class="n">local</span> <span class="n">never</span><span class="o">]</span> <span class="o">[@@</span><span class="n">specialise</span> <span class="n">never</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">run_worker</span> <span class="n">n</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="n">n</span> <span class="o">*</span> <span class="mi">100000</span> <span class="k">do</span>
</span><span class="line">    <span class="n">foo</span> <span class="bp">()</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here we do blocks of 100 allocations in a function called <code>foo</code>.
The annotations are to ensure the compiler doesn't inline it.
The trace was surprisingly variable!</p>
<p><a href="/blog/images/perf/foo-magic.png"><span class="caption-wrapper center"><img src="/blog/images/perf/foo-magic.png" title="magic-trace of foo between GCs" class="caption"/><span class="caption-text">magic-trace of foo between GCs</span></span></a></p>
<p>I see times for <code>foo</code> ranging from 50ns to around 750ns!</p>
<p>Note: the extra <code>foo</code> call above was probably due to a missed end event somewhere.</p>
<h2 id="perf-annotate">perf annotate</h2>
<p>I ran <code>perf record</code> on the simplified version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">foo</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="mi">100</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here the code is simple enough that we don't need stack-traces (so no <code>-g</code>):</p>
<pre><code>$ sudo perf record ./_build/default/main.exe
$ sudo perf annotate

       │    camlDune__exe__Main.foo_273():
       │      mov  $0x3,%eax
  0.04 │      cmp  $0xc9,%rax
       │    ↓ jg   39
  7.34 │ d:   sub  $0x10,%r15
 13.37 │      cmp  (%r14),%r15
  0.09 │    ↓ jb   3f
  0.21 │16:   lea  0x8(%r15),%rbx
 70.26 │      movq $0x400,-0x8(%rbx)
  6.66 │      movq $0x1,(%rbx)
  0.73 │      mov  %rax,%rbx
  0.00 │      add  $0x2,%rax
  0.01 │      cmp  $0xc9,%rbx
  0.66 │    ↑ jne  d
  0.28 │39:   mov  $0x1,%eax
  0.34 │    ← ret
  0.00 │3f: → call caml_call_gc
       │    ↑ jmp  16
</code></pre>
<p>The code starts by (pointlessly) checking if 1 &gt; 100 in case it can skip the whole loop.
After being disappointed, it:</p>
<ol>
<li>Decreases <code>%r15</code> (<code>young_ptr</code>) by 0x10 (two words).
</li>
<li>Checks if that's now below <code>young_limit</code>, calling <code>caml_call_gc</code> if so to clear the minor heap.
</li>
<li>Writes 0x400 to the first newly-allocated word (the block header, indicating 1 word of data).
</li>
<li>Writes 1 to the second word, which represents <code>()</code>.
</li>
<li>Increments the loop counter and loops, unless we're at the end.
</li>
<li>Returns <code>()</code>.
</li>
</ol>
<p>Looks like we spent most of the time (77%) writing the block, which makes sense.
Reading <code>young_limit</code> took 13% of the time, which seems reasonable too.
If there was contention between domains, we'd expect to see it here.</p>
<p>The output looked similar whether using domains or processes.</p>
<h2 id="perf-c2c">perf c2c</h2>
<p>To double-check, I also tried <code>perf c2c</code>.
This reports on cache-to-cache transfers, where two CPUs are accessing the same memory,
which requires the processors to communicate and is therefore relatively slow.</p>
<pre><code>$ sudo perf c2c record
^C

$ sudo perf c2c report
  Load Operations                   :      11898
  Load L1D hit                      :       4140
  Load L2D hit                      :         93
  Load LLC hit                      :       3750
  Load Local HITM                   :        251
  Store Operations                  :     116386
  Store L1D Hit                     :     104763
  Store L1D Miss                    :      11622
...
# ----- HITM -----  ------- Store Refs ------  ------- CL --------                      ---------- cycles ----------    Total       cpu                                    Shared                       
# RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A    Off  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol    Object      Source:Line  Node
...
      7        0        7        4        0        0      0x7f90b4002b80
  ----------------------------------------------------------------------
    0.00%  100.00%    0.00%    0.00%    0.00%    0x0     0       1            0x44a704         0       144       107        8         1  [.] Dune.exe.Main.foo_273       main.exe  main.ml:7        0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7b9         0         0         0        1         1  [.] caml_interrupt_all_signal_  main.exe  domain.c:318     0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7e2         0         0       323       49         1  [.] caml_reset_young_limit      main.exe  domain.c:1658    0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ce94d         0         0         0        1         1  [.] caml_empty_minor_heap_prom  main.exe  minor_gc.c:622   0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ceed2         0         0         0        1         1  [.] caml_alloc_small_dispatch   main.exe  minor_gc.c:874   0
</code></pre>
<p>This shows a list of cache lines (memory addresses) and how often we loaded from a modified address.
There's a lot of information here and I don't understand most of it.
But I think the above is saying that address 0x7f90b4002b80 (<code>young_limit</code>, at offset 0) was accessed by these places across domains:</p>
<ul>
<li><code>main.ml:7</code> (<code>ref ()</code>) checks against <code>young_limit</code> to see if we need to call into the GC.
</li>
<li><code>domain.c:318</code> sets the limit to <code>UINTNAT_MAX</code> to signal that another domain wants a GC.
</li>
<li><code>domain.c:1658</code> sets it back to <code>young_trigger</code> after being signalled.
</li>
</ul>
<p>The same cacheline was also accessed at offset 8, which contains <code>young_ptr</code> (address of last allocation):</p>
<ul>
<li><code>minor_gc.c:622</code> sets <code>young_ptr</code> to <code>young_end</code> after a GC.
</li>
<li><code>minor_gc.c:874</code> adjusts <code>young_ptr</code> to re-do the allocation that triggered the GC.
</li>
</ul>
<p>This indicates false sharing: <code>young_ptr</code> only gets accessed from one domain but it's in the same cache line as <code>young_limit</code>.</p>
<p>The main thing is that the counts are all very low, indicating that this doesn't happen often.</p>
<p>I tried adding an <code>incr x</code> on a global variable in the loop, and got some more operations reported.
But using <code>Atomic.incr</code> massively increased the number of records:</p>
<table class="table"><thead><tr><th> </th><th style="text-align: right">    Original </th><th style="text-align: right">  incr     </th><th style="text-align: right"> Atomic.incr</th></tr></thead><tbody><tr><td>Load Operations    </td><td style="text-align: right">     11,898  </td><td style="text-align: right">  25,860   </td><td style="text-align: right">  2,658,364</td></tr><tr><td>Load L1D hit       </td><td style="text-align: right">      4,140  </td><td style="text-align: right">  15,181   </td><td style="text-align: right">    326,236</td></tr><tr><td>Load L2D hit       </td><td style="text-align: right">         93  </td><td style="text-align: right">     163   </td><td style="text-align: right">        295</td></tr><tr><td>Load LLC hit       </td><td style="text-align: right">      3,750  </td><td style="text-align: right">   3,173   </td><td style="text-align: right">  2,321,704</td></tr><tr><td>Load Local HITM    </td><td style="text-align: right">        251  </td><td style="text-align: right">     299   </td><td style="text-align: right">  2,317,885</td></tr><tr><td>Store Operations   </td><td style="text-align: right">    116,386  </td><td style="text-align: right"> 462,162   </td><td style="text-align: right">  3,909,500</td></tr><tr><td>Store L1D Hit      </td><td style="text-align: right">    104,763  </td><td style="text-align: right"> 389,492   </td><td style="text-align: right">  3,908,947</td></tr><tr><td>Store L1D Miss     </td><td style="text-align: right">     11,622  </td><td style="text-align: right">  72,667   </td><td style="text-align: right">        550</td></tr></tbody></table><p>See <a href="https://joemario.github.io/blog/2016/09/01/c2c-blog/">C2C - False Sharing Detection in Linux Perf</a> for more information about all this.</p>
<h2 id="perf-stat">perf stat</h2>
<p><code>perf stat</code> shows statistics about a process.
I ran it with <code>-I 1000</code> to collect one-second samples.
Here are two samples from the test case on my machine,
one when it was running processes and one while it was using domains:</p>
<pre><code>$ perf stat -I 1000

# Processes
      8,032.71 msec cpu-clock         #    8.033 CPUs utilized
         2,475      context-switches  #  308.115 /sec
            51      cpu-migrations    #    6.349 /sec
            44      page-faults       #    5.478 /sec
35,268,665,452      cycles            #    4.391 GHz
48,673,075,188      instructions      #    1.38  insn per cycle
 9,815,905,270      branches          #    1.222 G/sec
    48,986,037      branch-misses     #    0.50% of all branches

# Domains
      8,008.11 msec cpu-clock         #    8.008 CPUs utilized
        10,970      context-switches  #    1.370 K/sec
           133      cpu-migrations    #   16.608 /sec
           232      page-faults       #   28.971 /sec
34,606,498,021      cycles            #    4.321 GHz
25,120,741,129      instructions      #    0.73  insn per cycle
 5,028,578,807      branches          #  627.936 M/sec
    24,402,161      branch-misses     #    0.49% of all branches
</code></pre>
<p>We're doing a lot more context switches with domains, as expected due to the sleeps,
and we're executing many fewer instructions, which isn't surprising.
Reporting the counts for individual CPUs gets more interesting though:</p>
<pre><code>$ sudo perf stat -I 1000 -e instructions -Aa
# Processes
     1.000409485 CPU0        5,106,261,160      instructions
     1.000409485 CPU1        2,746,012,554      instructions
     1.000409485 CPU2       14,235,084,764      instructions
     1.000409485 CPU3        7,545,940,906      instructions
     1.000409485 CPU4        2,605,655,333      instructions
     1.000409485 CPU5        6,023,131,238      instructions
     1.000409485 CPU6        2,860,656,865      instructions
     1.000409485 CPU7        8,195,416,048      instructions
     2.001406580 CPU0        5,674,686,033      instructions
     2.001406580 CPU1        2,774,756,912      instructions
     2.001406580 CPU2       12,231,014,682      instructions
     2.001406580 CPU3        8,292,824,909      instructions
     2.001406580 CPU4        2,592,461,540      instructions
     2.001406580 CPU5        7,182,922,668      instructions
     2.001406580 CPU6        2,742,731,223      instructions
     2.001406580 CPU7        7,219,186,119      instructions
     3.002394302 CPU0        4,676,179,731      instructions
     3.002394302 CPU1        2,773,345,921      instructions
     3.002394302 CPU2       13,236,080,365      instructions
     3.002394302 CPU3        5,142,640,767      instructions
     3.002394302 CPU4        2,580,401,766      instructions
     3.002394302 CPU5       13,600,129,246      instructions
     3.002394302 CPU6        2,667,830,277      instructions
     3.002394302 CPU7        4,908,168,984      instructions

$ sudo perf stat -I 1000 -e instructions -Aa
# Domains
     1.002680009 CPU0        3,134,933,139      instructions
     1.002680009 CPU1        3,140,191,650      instructions
     1.002680009 CPU2        3,155,579,241      instructions
     1.002680009 CPU3        3,059,035,269      instructions
     1.002680009 CPU4        3,102,718,089      instructions
     1.002680009 CPU5        3,027,660,263      instructions
     1.002680009 CPU6        3,167,151,483      instructions
     1.002680009 CPU7        3,214,267,081      instructions
     2.003692744 CPU0        3,009,806,420      instructions
     2.003692744 CPU1        3,015,194,636      instructions
     2.003692744 CPU2        3,093,562,866      instructions
     2.003692744 CPU3        3,005,546,617      instructions
     2.003692744 CPU4        3,067,126,726      instructions
     2.003692744 CPU5        3,042,259,123      instructions
     2.003692744 CPU6        3,073,514,980      instructions
     2.003692744 CPU7        3,158,786,841      instructions
     3.004694851 CPU0        3,069,604,047      instructions
     3.004694851 CPU1        3,063,976,761      instructions
     3.004694851 CPU2        3,116,761,158      instructions
     3.004694851 CPU3        3,045,677,304      instructions
     3.004694851 CPU4        3,101,053,228      instructions
     3.004694851 CPU5        2,973,005,489      instructions
     3.004694851 CPU6        3,109,177,113      instructions
     3.004694851 CPU7        3,158,349,130      instructions
</code></pre>
<p>In the domains case all CPUs are doing roughly the same amount of work.
But when running separate processes the CPUs differ wildly!
Over the last 1-second interval, for example, CPU5 executed 5.3 times as many instructions as CPU4.
And indeed, some of the test processes are finishing much sooner than the others,
even though they all do the same work.</p>
<p>Setting <code>/sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference</code> to <code>performance</code> didn't make it faster,
but setting it to <code>power</code> (power-saving mode) did make the processes benchmark much slower,
while having little effect on the domains case!</p>
<p>So I <em>think</em> what's happening here with separate processes is that
the CPU is boosting the performance of one or two cores at a time,
allowing them to make lots of progress.</p>
<p>But with domains this doesn't happen, either because no domain runs long enough before sleeping to trigger the boost,
or because as soon as it does it needs to stop and wait for the other domains for a GC and loses it.</p>
<h2 id="conclusions">Conclusions</h2>
<p>The main profiling and tracing tools used were:</p>
<ul>
<li><code>perf</code> to take samples of CPU use, find hot functions and hot instructions within them,
record process scheduling, look at hardware counters, and find sources of cache contention.
</li>
<li><code>statmemprof</code> to find the source of allocations.
</li>
<li><code>eio-trace</code> to visualise GC events and as a generic canvas for custom visualisations.
</li>
<li><code>magic-trace</code> to see very detailed traces of recent activity when something goes wrong.
</li>
<li><code>olly</code> to report on GC statistics.
</li>
<li><code>bpftrace</code> for quick experiments about kernel behaviour.
</li>
<li><code>offcputime</code> to see why a process is sleeping.
</li>
</ul>
<p>I think OCaml 5's runtime events tracing was the star of the show here, making it much easier to see what was going on with GC,
especially in combination with <code>perf sched</code>.
<code>statmemprof</code> is also an essential tool for OCaml, and I'll be very glad to get it back with OCaml 5.3.
I think I need to investigate <code>perf</code> more; I'd never used many of these features before.
Though it is important to use it with <code>offcputime</code> etc to check you're not missing samples due to sleeping.</p>
<p>Unlike the previous post's example, where the cause was pretty obvious and led to a massive easy speed-up,
this one took a lot of investigation and revealed several problems, none of which seem very easy to fix.
I'm also a lot less confident that I really understand what's happening here, but here is a summary of my current guess:</p>
<ul>
<li>OCaml applications typically allocate lots of short-lived values.
</li>
<li>With a single domain this isn't much of a problem; minor GCs are fast.
With multiple domains however we have to wait for every domain to enter the
GC, and then wait again for them all to exit.
</li>
<li>This can be very fast (4 microseconds or so per GC),
but if one domain is late due to OS scheduling then it can be much longer
(several ms in some cases).
</li>
<li>When a domain needs to wait for another it spins for a bit and then sleeps.
If the other domain runs on the same CPU then spinning delays it from running.
On the other hand, sleeping introduces longer delays and can cause the CPU to slow down.
</li>
<li>Idle domains are currently expensive.
An idle domain requires a syscall to wake it, and often causes all the other domains to sleep waiting for it.
When the idle domain does wake, it still can't start the GC and has to wait again for the leader.
</li>
<li>If the leader gets suspended while holding the lock, all the other domains will spin waiting for it (without ever sleeping).
This time isn't accounted for in the GC events reported by OCaml 5.2.
</li>
</ul>
<p>Since the sleeping mechanism will be changing in OCaml 5.3,
it would probably be worthwhile checking how that performs too.
I think there are some opportunities to improve the GC, such as letting idle domains opt out of GC after one collection,
and it looks like there are opportunities to reduce the amount of synchronisation done
(e.g. by letting late arrivers start the GC without having to wait for the leader,
or using a lock-free algorithm for becoming leader).</p>
<p>For the solver, it would be good to try experimenting with CPU affinity to keep a subset of the 160 cores reserved for the solver.
Increasing the minor heap size and doing work in the main domain should also reduce the overhead of GC,
and improving the version compare function in the opam library would greatly reduce the need for it.
And if my goal was really to make it fast (rather than to improve multicore OCaml and its tooling)
then I'd probably switch it back to using processes.</p>
<p>Finally, it was really useful that both of these blog posts examined performance regressions,
so I knew it must be possible to go faster.
Without a good idea of how fast something should be, it's easy to give up too early.</p>
<p>Anyway, I hope you found some useful new tool in these posts!</p>
<h2 id="update-2024-08-22">Update 2024-08-22</h2>
<p>I reported above that using <code>chrt</code> to make the process high priority didn't help on my machine.
It also didn't help on the ARM server using the real service.
However, it <em>did</em> help <em>a lot</em> when running the simplified version of the solver on the ARM server!</p>
<p>Some more investigation showed that the real service had an additional problem:
it spawned a <code>git log</code> subprocess after every solve, and this was causing all the domains to pause
for about 50ms during the fork operation.
<a href="https://github.com/ocurrent/solver-service/pull/79">Use OCaml code to find the oldest commit</a>
eliminated the problem (and is also faster, as it can cache the history, although that doesn't matter much).</p>
<p>Here are some benchmarks of various combinations of fixes:</p>
<ul>
<li>
<p><code>Processes, sched-other</code> is the original version using multiple processes.
Oddly, using <code>sched-rr</code> to make it high-priority actually slows this one down!</p>
</li>
<li>
<p><code>Domains, sched-other</code> is the original version using domains.
The results are noisy but using <code>sched-rr</code> doesn't seem to help.</p>
</li>
<li>
<p><code>ocaml-git</code> indicates that the <code>git log</code> subprocess has been replaced by OCaml code.</p>
</li>
<li>
<p><code>new opam</code> indicates using the latest Git version of the opam library,
which includes my <a href="https://github.com/ocaml/opam/pull/6144">Reduce allocations in opamVersionCompare</a>
as well as Kate's <a href="https://github.com/ocaml/opam/pull/5518">Speedup OpamVersionCompare by 25%</a>.</p>
<p><a href="/blog/images/perf/real-service.png"><span class="caption-wrapper center"><img src="/blog/images/perf/real-service.png" title="Performance of the full service" class="caption"/><span class="caption-text">Performance of the full service</span></span></a></p>
</li>
</ul>
<p>And with that, the new multicore solver service is finally faster than the old process-based one!</p>
]]></content>
  </entry>
  <entry>
    <title type="html">OCaml 5 performance problems</title>
    <link href="https://roscidus.com/blog/blog/2024/07/22/performance/"></link>
    <updated>2024-07-22T10:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2024/07/22/performance</id>
    <content type="html"><![CDATA[<p>Linux and OCaml provide a huge range of tools for investigating performance problems.
In this post I try using some of them to understand a network performance problem.
In <a href="/blog/blog/2024/07/22/performance-2/">part 2</a>, I'll investigate a problem in a CPU-intensive multicore program.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The problem</a>
</li>
<li><a href="#time">time</a>
</li>
<li><a href="#eio-trace">eio-trace</a>
</li>
<li><a href="#strace">strace</a>
</li>
<li><a href="#bpftrace">bpftrace</a>
</li>
<li><a href="#tcpdump">tcpdump</a>
</li>
<li><a href="#ss">ss</a>
</li>
<li><a href="#offwaketime">offwaketime</a>
</li>
<li><a href="#magic-trace">magic-trace</a>
</li>
<li><a href="#summary-script">Summary script</a>
</li>
<li><a href="#fixing-it">Fixing it</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="the-problem">The problem</h2>
<p>While porting <a href="https://github.com/mirage/capnp-rpc">capnp-rpc</a> from <a href="https://github.com/ocsigen/lwt/">Lwt</a> to <a href="https://github.com/ocaml-multicore/eio">Eio</a>,
to take advantage of OCaml 5's new effects system,
I tried running the benchmark to see if it got any faster:</p>
<pre><code>$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44933.359573 # The old Lwt version
echo_bench.exe: [INFO] rate = 511.963565   # The (buggy) Eio version
</code></pre>
<p>The benchmark records the number of echo RPCs per second.
Clearly, something is very wrong here!
In fact, the new version was so slow I had to reduce the number of iterations so it would finish.</p>
<h2 id="time">time</h2>
<p>The old <code>time</code> command can immediately give us a hint:</p>
<pre><code>$ /usr/bin/time ./echo_bench.exe
1.85user 0.42system 0:02.31elapsed 98%CPU  # Lwt
0.16user 0.05system 0:01.95elapsed 11%CPU  # Eio (buggy)
</code></pre>
<p>(many shells provide their own <code>time</code> built-in with different output formats; I'm using <code>/usr/bin/time</code> here)</p>
<p><code>time</code>'s output shows time spent in user-mode (running the application's code on the CPU),
time spent in the kernel, and the total wall-clock time.
Both versions ran for around 2 seconds (doing a different number of iterations),
but the Lwt version was using the CPU 98% of the time, while the Eio version was mostly sleeping.</p>
<h2 id="eio-trace">eio-trace</h2>
<p><a href="https://github.com/ocaml-multicore/eio-trace">eio-trace</a> can be used to see what an Eio program is doing.
Tracing is always available (you don't need to recompile the program to get it).</p>
<pre><code>$ eio-trace run -- ./echo_bench.exe
</code></pre>
<p><code>eio-trace run</code> runs the command and displays the trace in a window.
You can also use <code>eio-trace record</code> to save a trace and examine it later.</p>
<p><a href="/blog/images/perf/capnp-eio-slow-many.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-many.png" title="Trace of slow benchmark (12 concurrent requests)" class="caption"/><span class="caption-text">Trace of slow benchmark (12 concurrent requests)</span></span></a></p>
<p>The benchmark runs 12 test clients at once, making it a bit noisy.
To simplify things, I set it to run only one client:</p>
<p><a href="/blog/images/perf/capnp-eio-slow.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow.png" title="Trace of slow benchmark (one request at a time)" class="caption"/><span class="caption-text">Trace of slow benchmark (one request at a time)</span></span></a></p>
<p>I've zoomed the image to show the first four iterations.
The first is so quick it's not really visible, but the next three take about 40ms each.
The yellow regions labelled &quot;suspend-domain&quot; show when the program is sleeping, waiting for an event from Linux.
Each horizontal bar is a fiber (a light-weight thread). From top to bottom they are:</p>
<ul>
<li>Three rows for the test client:
<ul>
<li>The main application fiber performing the RPC call (mostly awaiting responses).
</li>
<li>The network's write fiber, sending outgoing messages (mostly waiting for something to send).
</li>
<li>The network's read fiber, reading incoming messages (mostly waiting to something to read).
</li>
</ul>
</li>
<li>Four rows for the server:
<ul>
<li>A loop accepting new incoming TCP connections.
</li>
<li>A short-lived fiber that accepts the new connection, then short-lived fibers each handling one request.
</li>
<li>The server's network write fiber.
</li>
<li>The server's network read fiber.
</li>
</ul>
</li>
<li>One fiber owned by Eio itself (used to wake up the event loop in some situations).
</li>
</ul>
<p>This trace immediately raises a couple of questions:</p>
<ul>
<li>
<p>Why is there a 40ms delay in each iteration of the test loop?</p>
</li>
<li>
<p>Why does the program briefly wake up in the middle of the first delay, do nothing, and return to sleep?
(notice the extra &quot;suspend-domain&quot; at the top)</p>
</li>
</ul>
<p>Zooming in on a section between the delays, let's see what it's doing when it's not sleeping:</p>
<p><a href="/blog/images/perf/capnp-eio-slow-zoom1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-zoom1.png" title="Zoomed in on the active part" class="caption"/><span class="caption-text">Zoomed in on the active part</span></span></a></p>
<p>After a 40ms delay, the server's read fiber receives the next request (the running fiber is shown in green).
The read fiber spawns a fiber to handle the request, which finishes quickly, starts the next read,
and then the write fiber transmits the reply.</p>
<p>The client's read fiber gets the reply, the write fiber outputs a message, then the application fiber runs
and another message is sent.
The server reads something (presumably the first message, though it happens after the client had sent both),
then there is another long 40ms delay, then (far off the right of the image) the pattern repeats.</p>
<p>To get more context in the trace,
I <a href="https://ocaml-multicore.github.io/eio/eio/Eio/Private/Trace/index.html#val-log">configured</a>
the logging library to write the (existing) debug-level log messages to the trace buffer too:</p>
<p><a href="/blog/images/perf/capnp-eio-slow-zoom1-debug.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-zoom1-debug.png" title="With log messages" class="caption"/><span class="caption-text">With log messages</span></span></a></p>
<p>Log messages tend to be a bit long for the trace display, so they overlap and you have to zoom right in to read them,
but they do help navigate.
With this, I can see that the first client write is &quot;Send finish&quot; and the second is &quot;Calling Echo.ping&quot;.</p>
<p>Looks like we're not buffering the output, so it's doing two separate writes rather than combining them.
That's a little inefficient, and if you've done much network programming,
you also probably already know why this might cause a 40ms delay,
but let's pretend we don't know so we can play with a few more tools...</p>
<h2 id="strace">strace</h2>
<p><a href="https://github.com/strace/strace">strace</a> can be used to trace interactions between applications and the Linux kernel
(<code>-tt -T</code> shows when each call was started and how long it took):</p>
<pre><code>$ strace -tt -T ./echo_bench.exe
...
11:38:58.079200 write(2, &quot;echo_bench.exe: [INFO] Accepting&quot;..., 73) = 73 &lt;0.000008&gt;
11:38:58.079253 io_uring_enter(4, 4, 0, 0, NULL, 8) = 4 &lt;0.000032&gt;
11:38:58.079341 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000020&gt;
11:38:58.079408 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000021&gt;
11:38:58.079471 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000018&gt;
11:38:58.079525 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000019&gt;
11:38:58.079580 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000013&gt;
11:38:58.079611 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 &lt;0.000009&gt;
11:38:58.079637 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffc1661a480, 24) = -1 ETIME (Timer expired) &lt;0.018913&gt;
11:38:58.098669 futex(0x5584542b767c, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000105&gt;
11:38:58.098889 futex(0x5584542b7690, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000059&gt;
11:38:58.098976 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 &lt;0.021355&gt;
</code></pre>
<p>On Linux, Eio defaults to using the <a href="https://github.com/axboe/liburing">io_uring</a> mechanism for submitting work to the kernel.
<code>io_uring_enter(4, 2, 0, 0, NULL, 8) = 2</code> means we asked to submit 2 new operations to the ring on FD 4,
and the kernel accepted them.</p>
<p>The call at <code>11:38:58.079637</code> timed out after 19ms.
It then woke up some <a href="https://www.man7.org/linux/man-pages/man2/futex.2.html">futexes</a> and then waited again, getting woken up after a further 21ms (for a total of 40ms).</p>
<p>Futexes are used to coordinate between system threads.
<code>strace -f</code> will follow all spawned threads (and processes), not just the main one:</p>
<pre><code>$ strace -T -f ./echo_bench.exe
...
[pid 48451] newfstatat(AT_FDCWD, &quot;/etc/resolv.conf&quot;, {st_mode=S_IFREG|0644, st_size=40, ...}, 0) = 0 &lt;0.000011&gt;
...
[pid 48451] futex(0x561def43296c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY &lt;unfinished ...&gt;
...
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffe1d5d1c90, 24) = -1 ETIME (Timer expired) &lt;0.018899&gt;
[pid 48449] futex(0x561def43296c, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000106&gt;
[pid 48451] &lt;... futex resumed&gt;)        = 0 &lt;0.019981&gt;
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 &lt;unfinished ...&gt;
...
[pid 48451] exit(0)                     = ?
[pid 48451] +++ exited with 0 +++
[pid 48449] &lt;... io_uring_enter resumed&gt;) = 0 &lt;0.021205&gt;
...
</code></pre>
<p>The benchmark connects to <code>&quot;127.0.0.1&quot;</code> and Eio uses <code>getaddrinfo</code> to look up addresses (we can't use uring for this).
Since <code>getaddrinfo</code> can block for a long time, Eio creates a new system thread (pid 48451) to handle it
(we can guess this thread is doing name resolution because we see it read <code>resolv.conf</code>).</p>
<p>As creating system threads is a little slow, Eio keeps the thread around for a bit after it finishes in case it's needed again.
The timeout is when Eio decides that the thread isn't needed any longer and asks it to exit.
So this isn't relevant to our problem (and only happens on the first 40ms delay, since we don't look up any further addresses).</p>
<p>However, strace doesn't tell us what the uring operations were, or their return values.
One option is to switch to the <code>posix</code> backend (which is the default on Unix systems).
In fact, it's a good idea with any performance problem to check if it still happens with a different backend:</p>
<pre><code>$ EIO_BACKEND=posix strace -T -tt ./echo_bench.exe
...
11:53:52.935976 writev(7, [{iov_base=&quot;\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0&quot;..., iov_len=40}], 1) = 40 &lt;0.000170&gt;
11:53:52.936308 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, {tv_sec=0, tv_nsec=0}, NULL, 8) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=0, tv_nsec=0}) &lt;0.000044&gt;
11:53:52.936500 writev(7, [{iov_base=&quot;\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0&quot;..., iov_len=136}], 1) = 136 &lt;0.000055&gt;
11:53:52.936831 readv(8, [{iov_base=&quot;\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0&quot;..., iov_len=4096}], 1) = 40 &lt;0.000056&gt;
11:53:52.937516 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, NULL, NULL, 8) = 1 ([{fd=8, revents=POLLIN}]) &lt;0.038972&gt;
11:53:52.977751 readv(8, [{iov_base=&quot;\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0&quot;..., iov_len=4096}], 1) = 136 &lt;0.000398&gt;
</code></pre>
<p>(to reduce clutter, I removed calls that returned <code>EAGAIN</code> and <code>ppoll</code> calls that returned 0 ready descriptors)</p>
<p>The problem still occurs, and now we can see the two writes:</p>
<ul>
<li>The client writes 40 bytes to its end of the socket (FD 7), after which the server's end (FD 8) is ready for reading (<code>revents=POLLIN</code>).
</li>
<li>The client then writes another 136 bytes.
</li>
<li>The server reads 40 bytes and then uses <code>ppoll</code> to await further data.
</li>
<li>After 39ms, <code>ppoll</code> says FD 8 is now ready, and the server reads the other 136 bytes.
</li>
</ul>
<h2 id="bpftrace">bpftrace</h2>
<p>Alternatively, we can trace uring operations using <a href="https://github.com/bpftrace/bpftrace">bpftrace</a>.
bpftrace is a little scripting language similar to awk,
except that instead of editing a stream of characters,
it live-patches the running Linux kernel.
Apparently this is safe to run in production
(and I haven't managed to crash my kernel with it yet).</p>
<p>Here is a list of uring tracepoints we can probe:</p>
<pre><code>$ sudo bpftrace -l 'tracepoint:io_uring:*'
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_cqe_overflow
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_defer
tracepoint:io_uring:io_uring_fail_link
tracepoint:io_uring:io_uring_file_get
tracepoint:io_uring:io_uring_link
tracepoint:io_uring:io_uring_local_work_run
tracepoint:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_req_failed
tracepoint:io_uring:io_uring_short_write
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_task_add
tracepoint:io_uring:io_uring_task_work_run
</code></pre>
<p><code>io_uring_complete</code> looks promising:</p>
<pre><code>$ sudo bpftrace -vl tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_complete
    void * ctx
    void * req
    u64 user_data
    int res
    unsigned cflags
    u64 extra1
    u64 extra2
</code></pre>
<p>Here's a script to print out the time, process, operation name and result for each completion:</p>
<figure class="code"><figcaption><span>uringtrace.bt</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nc">BEGIN</span> <span class="o">{</span>
</span><span class="line">  <span class="o">@</span><span class="n">op</span><span class="o">[</span><span class="nc">IORING_OP_NOP</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;NOP&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="o">@</span><span class="n">op</span><span class="o">[</span><span class="nc">IORING_OP_READV</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;READV&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="n">tracepoint</span><span class="o">:</span><span class="n">io_uring</span><span class="o">:</span><span class="n">io_uring_complete</span> <span class="o">{</span>
</span><span class="line">  <span class="o">$</span><span class="n">req</span> <span class="o">=</span> <span class="o">(</span><span class="k">struct</span> <span class="n">io_kiocb</span> <span class="o">*)</span> <span class="n">args</span><span class="o">-&gt;</span><span class="n">req</span><span class="o">;</span>
</span><span class="line">  <span class="n">printf</span><span class="o">(</span><span class="s2">&quot;%dms: %s: %s %d</span><span class="se">\n</span><span class="s2">&quot;</span><span class="o">,</span>
</span><span class="line">    <span class="n">elapsed</span> <span class="o">/</span> <span class="mf">1e6</span><span class="o">,</span>
</span><span class="line">    <span class="n">comm</span><span class="o">,</span>
</span><span class="line">    <span class="o">@</span><span class="n">op</span><span class="o">[$</span><span class="n">req</span><span class="o">-&gt;</span><span class="n">opcode</span><span class="o">],</span>
</span><span class="line">    <span class="n">args</span><span class="o">-&gt;</span><span class="n">res</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="nc">END</span> <span class="o">{</span>
</span><span class="line">  <span class="n">clear</span><span class="o">(@</span><span class="n">op</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><pre><code>$ sudo bpftrace uringtrace.bt
Attaching 3 probes...
...
1743ms: echo_bench.exe: WRITE_FIXED 40
1743ms: echo_bench.exe: READV 40
1743ms: echo_bench.exe: WRITE_FIXED 136
1783ms: echo_bench.exe: READV 136
</code></pre>
<p>In this output, the order is slightly different:
we see the server's read get the 40 bytes before the client sends the rest,
but we still see the 40ms delay between the completion of the second write and the corresponding read.
The change in order is because we're seeing when the kernel knew the read was complete,
not when the application found out about it.</p>
<h2 id="tcpdump">tcpdump</h2>
<p>An obvious step with any networking problem is the look at the packets going over the network.
<a href="https://www.tcpdump.org/">tcpdump</a> can be used to capture packets, which can be displayed on the console or in a GUI with <a href="https://www.wireshark.org/">wireshark</a>.</p>
<pre><code>$ sudo tcpdump -n -ttttt -i lo
...
...041330 IP ...37640 &gt; ...7000: Flags [P.], ..., length 40
...081975 IP ...7000 &gt; ...37640: Flags [.], ..., length 0
...082005 IP ...37640 &gt; ...7000: Flags [P.], ..., length 136
...082071 IP ...7000 &gt; ...37640: Flags [.], ..., length 0
</code></pre>
<p>Here we see the client (on port 37640) sending 40 bytes to the server (port 7000),
and the server replying with an ACK (with no payload) 40ms later.
After getting the ACK, the client socket sends the remaining 136 bytes.</p>
<p>Here we can see that while the application made the two writes in quick succession,
TCP waited before sending the second one.
Searching for &quot;delayed ack 40ms&quot; will turn up an explanation.</p>
<h2 id="ss">ss</h2>
<p><a href="https://www.man7.org/linux/man-pages/man8/ss.8.html">ss</a> displays socket statistics.
<code>ss -tin</code> shows all TCP sockets (<code>-t</code>) with internals (<code>-i</code>):</p>
<pre><code>$ ss -tin 'sport = 7000 or dport = 7000'
State   Recv-Q   Send-Q  Local Address:Port  Peer Address:Port
ESTAB   0        0       127.0.0.1:7000      127.0.0.1:56224
 ato:40 lastsnd:34 lastrcv:34 lastack:34
ESTAB   0        176     127.0.0.1:56224     127.0.0.1:7000
 ato:40 lastsnd:34 lastrcv:34 lastack:34 unacked:1 notsent:136
</code></pre>
<p>There's a lot of output here; I've removed the irrelevant bits.
<code>ato:40</code> says there's a 40ms timeout for &quot;delay ack mode&quot;.
<code>lastsnd</code>, etc, say that nothing had happened for 34ms when this information was collected.
<code>unacked</code> and <code>notsent</code> aren't documented in the man-page,
but I guess it means that the client (now port 56224) is waiting for 1 packet to be ack'd and has 136 bytes waiting until then.</p>
<p>The client socket still has both messages (176 bytes total) in its queue;
it can't forget about the first message until the server confirms receiving it,
since the client might need to send it again if it got lost.</p>
<p>This doesn't quite lead us to the solution, though.</p>
<h2 id="offwaketime">offwaketime</h2>
<p><a href="https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html">offwaketime</a> records why a program stopped using the CPU, and what caused it to resume:</p>
<pre><code>$ sudo offwaketime-bpfcc -f -p (pgrep echo_bench.exe) &gt; wakes
$ flamegraph.pl --colors=chain wakes &gt; wakes.svg
</code></pre>
<p><a href="/blog/images/perf/wakes.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/wakes.svg" title="Time spent suspended along with wakeup reason" class="caption"/><span class="caption-text">Time spent suspended along with wakeup reason</span></span></a></p>
<p><code>offwaketime</code> records a stack-trace when a process is suspended (shown at the bottom and going up)
and pairs it with the stack-trace of the thread that caused it to be resumed (shown above it and going down).</p>
<p>The taller column on the right shows Eio being woken up due to TCP data being received from the network,
confirming that it was the TCP ACK that got things going again.</p>
<p>The shorter column on the left was unexpected, and the <code>[UNKNOWN]</code> in the stack is annoying
(probably C code compiled without frame pointers).
<code>gdb</code> gets a better stack trace.
It turned out to be OCaml's tick thread, which wakes every 50ms to prevent one sys-thread from hogging the CPU:</p>
<pre><code>$ strace -T -e pselect6 -p (pgrep echo_bench.exe) -f
strace: Process 20162 attached with 2 threads
...
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) &lt;0.050441&gt;
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) &lt;0.050318&gt;
</code></pre>
<p>Having multiple threads shown on the same diagram is a bit confusing.
I should probably have used <code>-t</code> to focus only on the main one.</p>
<p>Also, note that when using profiling tools that record the OCaml stack,
it's useful to compile with frame pointers enabled.
To install e.g. OCaml 5.2.0 with frame pointers enabled, use:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="sh"><span class="line">$<span class="w"> </span>opam<span class="w"> </span>switch<span class="w"> </span>create<span class="w"> </span><span class="m">5</span>.2.0-fp<span class="w"> </span>ocaml-variants.5.2.0+options<span class="w"> </span>ocaml-option-fp
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="magic-trace">magic-trace</h2>
<p><a href="https://magic-trace.org/">magic-trace</a> allows capturing a short trace of everything the CPUs were doing just before some event.
It uses Intel Processor Trace to have the CPU record all control flow changes (calls, branches, etc) to a ring-buffer,
with fairly low overhead (2% to 10%, due to extra memory bandwidth needed).
When something interesting happens, we save the buffer and use it to reconstruct the recent history.</p>
<p>Normally we'd need to set up a trigger to grab the buffer at the right moment,
but since this program is mostly idle it doesn't record much
and I just attached at a random point and immediately pressed Ctrl-C to grab a snapshot and detach:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="sh"><span class="line">$<span class="w"> </span>sudo<span class="w"> </span>magic-trace<span class="w"> </span>attach<span class="w"> </span>-multi-thread<span class="w"> </span>-trace-include-kernel<span class="w"> </span><span class="se">\</span>
</span><span class="line"><span class="w">    </span>-p<span class="w"> </span><span class="o">(</span>pgrep<span class="w"> </span>echo_bench.exe<span class="o">)</span>
</span><span class="line"><span class="o">[</span><span class="w"> </span>Attached.<span class="w"> </span>Press<span class="w"> </span>Ctrl-C<span class="w"> </span>to<span class="w"> </span>stop<span class="w"> </span>recording.<span class="w"> </span><span class="o">]</span>
</span><span class="line">^C
</span></code></pre></td></tr></tbody></table></div></figure><p>As before, we see 40ms periods of waiting, with bursts of activity between them:</p>
<p><a href="/blog/images/perf/capnp-magic-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-magic-1.png" title="Magic trace showing 40ms delays" class="caption"/><span class="caption-text">Magic trace showing 40ms delays</span></span></a></p>
<p>The output is a bit messed up because magic-trace doesn't understand that there are multiple OCaml fibers here,
each with their own stack. It also doesn't seem to know that exceptions unwind the stack.</p>
<p>In each 40ms column, <code>Eio_posix.Flow.single_read</code> (3rd line from top) tried to do a read
with <code>readv</code>, which got <code>EAGAIN</code> and called <code>Sched.next</code> to switch to the next fiber.
Since there was nothing left to run, the Eio scheduler called <code>ppoll</code>.
Linux didn't have anything ready for this process,
and called the <code>schedule</code> kernel function to switch to another process.</p>
<p>I recorded an eio-trace at the same time, to see the bigger picture.
Here's the eio-trace zoomed in to show the two client writes (just before the 40ms wait),
with the relevant bits of the magic-trace stack pasted below them:</p>
<p><a href="/blog/images/perf/capnp-magic-2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-magic-2.png" title="Zoomed in on the two client writes, showing eio-trace and magic-trace output together" class="caption"/><span class="caption-text">Zoomed in on the two client writes, showing eio-trace and magic-trace output together</span></span></a></p>
<p>We can see the OCaml code calling <code>writev</code>, entering the kernel, <code>tcp_write_xmit</code> being called to handle it,
writing the IP packet to the network and then, because this is the loopback interface, the network receive logic
handling the packet too.
The second call is much shorter; <code>tcp_write_xmit</code> returns quickly without sending anything.</p>
<p>Note: I used the <code>eio_posix</code> backend here so it's easier to correlate the kernel operations to the application calls
(uring queues them up and runs them later).
The <a href="https://github.com/koonwen/uring-trace">uring-trace</a> project should make this easier in future, but doesn't integrate with eio-trace yet.</p>
<p>Zooming in further, it's easy to see the difference between the two calls to <code>tcp_write_xmit</code>:</p>
<p><a href="/blog/images/perf/tcp_write_xmit.png"><span class="caption-wrapper center"><img src="/blog/images/perf/tcp_write_xmit.png" title="The start of the first tcp_write_xmit and the whole of the second" class="caption"/><span class="caption-text">The start of the first tcp_write_xmit and the whole of the second</span></span></a>
Looking at the source for <a href="https://github.com/torvalds/linux/blob/v6.6/net/ipv4/tcp_output.c#L2727-L2731"><code>tcp_write_xmit</code></a>,
we finally find the magic word &quot;<a href="https://en.wikipedia.org/wiki/Nagle's_algorithm">nagle</a>&quot;!</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">tcp_nagle_test</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span><span class="w"> </span><span class="n">skb</span><span class="p">,</span><span class="w"> </span><span class="n">mss_now</span><span class="p">,</span>
</span><span class="line"><span class="w">			     </span><span class="p">(</span><span class="n">tcp_skb_is_last</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span><span class="w"> </span><span class="n">skb</span><span class="p">)</span><span class="w"> </span><span class="o">?</span>
</span><span class="line"><span class="w">			      </span><span class="nl">nonagle</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">TCP_NAGLE_PUSH</span><span class="p">))))</span>
</span><span class="line"><span class="w">	</span><span class="k">break</span><span class="p">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="summary-script">Summary script</h2>
<p>Having identified a load of interesting events
I wrote <a href="/blog/data/perf/summary-posix.bt">summary-posix.bt</a>, a bpftrace script to summarise them.
This includes log messages written by the application (by tracing <code>write</code> calls to stderr),
reads and writes on the sockets,
and various probed kernel functions seen in the magic-trace output and when reading the kernel source.</p>
<p>The output is specialised to this application (for example, TCP segments sent to port 7000
are displayed as &quot;to server&quot;, while others are &quot;to client&quot;).
I think this is a useful way to double-check my understanding, and any fix:</p>
<pre><code>$ sudo bpftrace summary-posix.bt
[...]
844ms: server: got ping request; sending reply
844ms: server reads from socket (EAGAIN)
844ms: server: writev(96 bytes)
844ms:   tcp_write_xmit (to client, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 96 bytes to client
844ms: tcp_v4_rcv: got 96 bytes
844ms:   timer_start (tcp_delack_timer, 40 ms)
844ms: client reads 96 bytes from socket
844ms: client: enqueue finish message
844ms: client: enqueue ping call
844ms: client reads from socket (EAGAIN)
844ms: client: writev(40 bytes)
844ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 40 bytes to server
845ms: tcp_v4_rcv: got 40 bytes
845ms:   timer_start (tcp_delack_timer, 40 ms)
845ms: client: writev(136 bytes)
845ms:   tcp_write_xmit (to server, nagle-on, packets_out=1)
845ms: server reads 40 bytes from socket
845ms: server reads from socket (EAGAIN)
885ms: tcp_delack_timer_handler (ACK to client)
885ms:   tcp_v4_send_check: sending 0 bytes to client
885ms: tcp_delack_timer_handler (ACK to server)
885ms: tcp_v4_rcv: got 0 bytes
885ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
885ms:   tcp_v4_send_check: sending 136 bytes to server
</code></pre>
<ol>
<li>The server replies to a ping request, sending a 96 byte reply.
Nagle is on, but nothing is awaiting an ACK (<code>packets_out=0</code>) so it gets sent immediately.
</li>
<li>The client receives the data. It starts a 40ms timer to send an ACK for it.
</li>
<li>The client enqueues a &quot;finish&quot; message, followed by another &quot;ping&quot; request.
</li>
<li>The client's write fiber sends the 40 byte &quot;finish&quot; message.
Nothing is awaiting an ACK (<code>packets_out=0</code>) so the kernel sends it immediately.
</li>
<li>The client sends the 136 byte ping request. As the last message hasn't been ACK'd, it isn't sent yet.
</li>
<li>The server receives the 40 byte finish message.
</li>
<li>40ms pass. The server's delayed ACK timer fires and it sends the ACK to the client.
</li>
<li>The client's delayed ACK timer fires, but there's nothing to do (it sent the ACK with the &quot;finish&quot;).
</li>
<li>The client socket gets the ACK for its &quot;finish&quot; message and sends the delayed ping request.
</li>
</ol>
<h2 id="fixing-it">Fixing it</h2>
<p>The problem seemed clear: while porting from Lwt to Eio I'd lost the output buffering.
So I looked at the Lwt code to see how it did it and... it doesn't! So how was it working?</p>
<p>As I did with Eio, I set the Lwt benchmark's concurrency to 1 to simplify it for tracing,
and discovered that Lwt with 1 client thread has exactly the same problem as the Eio version.
Well, that's embarrassing!
But why is Lwt fast with 12 client threads?</p>
<p>With only minor changes (e.g. <code>write</code> vs <code>writev</code>), the summary script above also worked for tracing the Lwt version.
With 1 or 2 client threads, Lwt is slow, but with 3 it's fairly fast.
The delay only happens if the client sends a &quot;finish&quot; message when the server has no replies queued up
(otherwise the finish message unblocks the replies, which carry the ACK to the client immediately).
So, it works mostly by fluke!
Lwt just happens to schedule the threads in such a way that Nagle's algorithm mostly doesn't trigger with 12 concurrent requests.</p>
<p>Anyway, adding buffering to the Eio version fixed the problem:</p>
<p><a href="/blog/images/perf/capnp-before.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-before.png" title="Before" class="caption"/><span class="caption-text">Before</span></span></a>
<a href="/blog/images/perf/capnp-after.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-after.png" title="After (same scale)" class="caption"/><span class="caption-text">After (same scale)</span></span></a></p>
<p>An interesting thing to notice here is that not only did the long delay go away,
but the CPU operations while it was active were faster too!
I think the reason is that the CPU goes into power-saving mode during the long delays.
<code>cpupower monitor</code> shows my CPUs running at around 1 GHz with the old code and
around 4.7 GHz when running the new version.</p>
<p>Here are the results for the fixed version:</p>
<pre><code>$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44425.962625 # The old Lwt version
echo_bench.exe: [INFO] rate = 59653.451934 # The fixed Eio version
</code></pre>
<p>60k RPC requests per second doesn't seem that impressive, but at least it's faster than the old version,
which is good enough for now! There's clearly scope for improvement here (for example, the buffering I
added is quite inefficient, making two extra copies of every message, as the framing library copies it from
a cstruct to a string, and then I have to copy the string back to a cstruct for the kernel).</p>
<h2 id="conclusions">Conclusions</h2>
<p>There are lots of great tools available to help understand why something is running slowly (or misbehaving),
and since programmers usually don't have much time for profiling,
a little investigation will often turn up something interesting!
Even when things are working correctly, these tools are a good way to learn more about how things work.</p>
<p><code>time</code> will quickly tell you if the program is taking lots of time in application code, in the kernel, or just sleeping.
If the problem is sleeping, <code>offcputime</code> and <code>offwaketime</code> can tell you why it was waiting and what woke it in the end.
My own <code>eio-trace</code> tool will give a quick visual overview of what an Eio application is doing.
<code>strace</code> is great for tracing interactions between applications and the kernel,
but it doesn't help much when the application is using uring.
To fix that, you can either switch to the <code>eio_posix</code> backend or use <code>bpftrace</code> with the uring tracepoints.
<code>tcpdump</code>, <code>wireshark</code> and <code>ss</code> are all useful to examine network problems specifically.</p>
<p>I've found <code>bpftrace</code> to be really useful for all kinds of tasks.
Being able to write quick one-liners or short scripts gives it great flexibility.
Since the scripts run in the kernel you can also filter and aggregate data efficiently
without having to pass it all to userspace, and you can examine any kernel data structures.
We didn't need that here because the program was running so slowly, but it's great for many problems.
In addition to using well-defined tracepoints,
it can also probe any (non-inlined) function in the kernel or the application.
I also think using it to create a &quot;summary script&quot; to confirm a problem and its solution seems useful,
though this is the first time I've tried doing that.</p>
<p><code>magic-trace</code> is great for getting really detailed function-by-function tracing through the application and kernel.
Its ability to report the last few ms of activity after you notice a problem is extremely useful
(though not needed in this example).
It would be really useful if you could trigger magic-trace from a bpftrace script, but I didn't see a way to do that.</p>
<p>However, it was surprisingly difficult to get any of the tools to point directly
at the combination of Nagle's algorithm with delayed ACKs as the cause of this common problem!</p>
<p>This post was mainly focused on what was happening in the kernel.
In <a href="/blog/blog/2024/07/22/performance-2/">part 2</a>, I'll investigate a CPU-intensive problem instead.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Lambda Capabilities</title>
    <link href="https://roscidus.com/blog/blog/2023/04/26/lambda-capabilities/"></link>
    <updated>2023-04-26T10:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2023/04/26/lambda-capabilities</id>
    <content type="html"><![CDATA[<p>&quot;Is this software safe?&quot; is a question software engineers should be able to answer,
but doing so can be difficult.
Capabilities offer an elegant solution, but seem to be little known among functional programmers.
This post is an introduction to capabilities in the context of ordinary programming
(using plain functions, in the style of the lambda calculus).</p>
<!-- more -->
<p>Even if you're not interested in security,
capabilities provide a useful way to understand programs;
when trying to track down buggy behaviour,
it's very useful to know that some component <em>couldn't</em> have been the problem.</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The Problem</a>
</li>
<li><a href="#option-1-security-as-a-separate-concern">Option 1: Security as a separate concern</a>
</li>
<li><a href="#option-2-purity">Option 2: Purity</a>
</li>
<li><a href="#option-3-capabilities">Option 3: Capabilities</a>
<ul>
<li><a href="#attenuation">Attenuation</a>
</li>
<li><a href="#web-server-example">Web-server example</a>
</li>
<li><a href="#use-at-different-scales">Use at different scales</a>
</li>
<li><a href="#key-points">Key points</a>
</li>
</ul>
</li>
<li><a href="#practical-considerations">Practical considerations</a>
<ul>
<li><a href="#plumbing-capabilities-everywhere">Plumbing capabilities everywhere</a>
</li>
<li><a href="#levels-of-support">Levels of support</a>
</li>
<li><a href="#running-on-a-traditional-os">Running on a traditional OS</a>
</li>
<li><a href="#use-with-existing-security-mechanisms">Use with existing security mechanisms</a>
</li>
<li><a href="#thread-local-storage">Thread-local storage</a>
</li>
<li><a href="#symlinks">Symlinks</a>
</li>
<li><a href="#time-and-randomness">Time and randomness</a>
</li>
<li><a href="#power-boxes">Power boxes</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://www.reddit.com/r/ProgrammingLanguages/comments/130an3z/lambda_capabilities/">Reddit</a>, <a href="https://news.ycombinator.com/item?id=35723557">Hacker News</a> and <a href="https://lobste.rs/s/uyj3vj/lambda_capabilities">Lobsters</a> )</p>
<h2 id="the-problem">The Problem</h2>
<p>We have some application (for example, a web-server) that we want to run.
The application is many thousands of lines long and depends on dozens of third-party libraries,
which get updated on a regular basis.
I would like to be able to check, quickly and easily, that the application cannot do any of these things:</p>
<ul>
<li>Delete my files.
</li>
<li>Append a line to my <code>~/.ssh/authorized_keys</code> file.
</li>
<li>Act as a relay, allowing remote machines to attack other computers on my local network.
</li>
<li>Send telemetry to a third-party.
</li>
<li>Anything else bad that I forget to think about.
</li>
</ul>
<p>For example, here are some of the OCaml packages I use just to generate this blog:</p>
<p><a href="/blog/images/lambda-caps/blog-deps.svg"><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/blog-deps.svg" title="Dependency graph for this blog" class="caption"/><span class="caption-text">Dependency graph for this blog</span></span></a></p>
<p>Having to read every line of every version of each of these packages in order to decide whether it's safe
to generate the blog clearly isn't practical.</p>
<p>I'll start by looking at traditional solutions to this problem, using e.g. containers or VMs,
and then show how to do better using capabilities.</p>
<h2 id="option-1-security-as-a-separate-concern">Option 1: Security as a separate concern</h2>
<p>A common approach to access control treats securing software as a separate activity to writing it.
Programmers write (insecure) software, and a security team writes a policy saying what it can do.
Examples include firewalls, containers, virtual machines, seccomp policies, SELinux and AppArmor.</p>
<p>The great advantage of these schemes is that security can be applied after the software is written, treating it as a black box.
However, it comes with many problems:</p>
<dl><dt>Confused deputy problem</dt>
<dd>
<p>Some actions are OK for one use but not for another.</p>
<p>For example, if the client of a web-server requests <code>https://example.com/../../etc/httpd/server-key.pem</code>
then we don't want the server to read this file and send it to them.
But the server does need to read this file for other reasons, so the policy must allow it.</p>
</dd>
<dt>Coarse-grained controls</dt>
<dd>
<p>All the modules making up the program are treated the same way,
even though you probably trust some more than others.</p>
<p>For example, we might trust the TLS implementation with the server's private key, but not the templating engine,
and I know the modules I wrote myself are not malicious.</p>
</dd>
<dt>Even well-typed programs go wrong</dt>
<dd>
<p>Programming in a language with static types is supposed to ensure that if the program compiles then it won't crash.
But the security policy can cause the program to fail even though it passed the compiler's checks.</p>
<p>For example, the server might sometimes need to send an email notification.
If it didn't do that while the security policy was being written, then that will be blocked.
Or perhaps the web-server didn't even have a notification system when the policy was written,
but has since been updated.</p>
</dd>
<dt>Policy language limitations</dt>
<dd>
<p>The security configuration is written in a new language, which must be learned.
It's usually not worth learning this just for one program,
so the people who write the program struggle to write the policy.
Also, the policy language often cannot express the desired policy,
since it may depend on concepts unique to the program
(e.g. controlling access based on a web-app user's ID, rather than local Unix user ID).</p>
</dd>
</dl>
<p>All of the above problems stem from trying to separate security from the code.
If the code were fully correct, we wouldn't need the security layer.
Checking that code is fully correct is hard,
but maybe there are easy ways to check automatically that it does at least satisfy our security requirements...</p>
<h2 id="option-2-purity">Option 2: Purity</h2>
<p>One way to prevent programs from performing unwanted actions is to prevent <em>all</em> actions.
In pure functional languages, such as Haskell, the only way to interact with the outside world is to return the action you want to perform from <code>main</code>. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="haskell"><span class="line"><span class="nf">f</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Int</span><span class="w"> </span><span class="ow">-&gt;</span><span class="w"> </span><span class="kt">String</span>
</span><span class="line"><span class="nf">f</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="nf">main</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">IO</span><span class="w"> </span><span class="nb">()</span>
</span><span class="line"><span class="nf">main</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="n">putStr</span><span class="w"> </span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="mi">42</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Even if we don't look at the code of <code>f</code>, we can be sure it only returns a <code>String</code> and performs no other actions
(assuming <a href="https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/safe_haskell.html">Safe Haskell</a> is being used).
Assuming we trust <code>putStr</code>, we can be sure this program will only output a string to stdout and not perform any other actions.</p>
<p>However, writing only pure code is quite limiting. Also, we still need to audit all IO code.</p>
<h2 id="option-3-capabilities">Option 3: Capabilities</h2>
<p>Consider this code (written in a small OCaml-like functional language, where <code>ref n</code> allocates a new memory location
initially containing <code>n</code>, and <code>!x</code> reads the current value of <code>x</code>):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span> <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">5</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">y</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">10</span> <span class="k">in</span>
</span><span class="line">  <span class="n">f</span> <span class="n">x</span><span class="o">;</span>
</span><span class="line">  <span class="k">assert</span> <span class="o">(!</span><span class="n">y</span> <span class="o">=</span> <span class="mi">10</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Can we be sure that the assert won't fail, without knowing the definition of <code>f</code>?
Assuming the language doesn't provide unsafe backdoors (such as OCaml's <code>Obj.magic</code>), we can.
<code>f x</code> cannot change <code>y</code>, because <code>f x</code> does not have access to <code>y</code>.</p>
<p>So here is an access control system, built in to the lambda calculus itself!
At first glance this might not look very promising.
For example, while <code>f</code> doesn't have access to <code>y</code>, it does have access to any global variables defined before <code>f</code>.
It also, typically, has access to the file-system and network,
which are effectively globals too.</p>
<p>To make this useful, we ban global variables.
Then any top-level function like <code>f</code> can only access things passed to it explicitly as arguments.
Avoiding global variables is usually considered good practise, and some systems ban them for other reasons anyway
(for example, Rust doesn't allow global mutable state as it wouldn't be able to prevent races accessing it from multiple threads).</p>
<p>Returning to the Haskell example above (but now in OCaml syntax),
it looks like this in our capability system:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="n">x</span> <span class="o">=</span> <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">ch</span> <span class="o">=</span> <span class="n">output_string</span> <span class="n">ch</span> <span class="o">(</span><span class="n">f</span> <span class="mi">42</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Since <code>f</code> is a top-level function, we know it does not close over any mutable state, and our <code>42</code> argument is pure data.
Therefore, the call <code>f 42</code> does not have access to, and therefore cannot affect,
any pre-existing state (including the filesystem).
Internally, it can use mutation (creating arrays, etc),
but it has nowhere to store any mutable values and so they will get GC'd after it returns.
<code>f</code> therefore appears as a pure function, and calling it multiple times will always give the same result,
just as in the Haskell version.</p>
<p><code>output_string</code> is also a top-level function, closing over no mutable state.
However, the function resulting from evaluating <code>output_string ch</code> is not top-level,
and without knowing anything more about it we should assume it has full access to the output channel <code>ch</code>.</p>
<p>If <code>main</code> is invoked with standard output as its argument, it may output a message to it,
but cannot affect other pre-existing state.</p>
<p>In this way, we can reason about the pure parts of our code as easily as with Haskell,
but we can also reason about the parts with side-effects.
Haskell's purity is just a special case of a more general rule:
the effects of a (top-level) function are bounded by its arguments.</p>
<h3 id="attenuation">Attenuation</h3>
<p>So far, we've been thinking about what values are reachable through other values.
For example, the set of ref-cells that can be modified by <code>f x</code> is bounded by
the union of the set of ref cells reachable from the closure <code>f</code>
with the set of ref cells reachable from <code>x</code>.</p>
<p>One powerful aspect of capabilities is that we can use functions to implement whatever access controls we want.
For example, let's say we only want <code>f</code> to be able to set the ref-cell, but not read it.
We can just pass it a suitable function:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="n">x</span> <span class="o">:=</span> <span class="n">v</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="n">set</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or perhaps we only want to allow inserting positive integers:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="n">v</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">then</span> <span class="n">set</span> <span class="n">v</span>
</span><span class="line">  <span class="k">else</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Positive values only!&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or we can allow access to be revoked:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">set</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="o">!</span><span class="n">r</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">fn</span> <span class="o">-&gt;</span> <span class="n">fn</span> <span class="n">v</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Access revoked!&quot;</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="o">...</span>
</span><span class="line"><span class="n">r</span> <span class="o">:=</span> <span class="nc">None</span>		<span class="c">(* Revoke *)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or we could limit the number of times it can be used:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">used</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="o">!</span><span class="n">used</span> <span class="o">&lt;</span> <span class="mi">3</span> <span class="k">then</span> <span class="o">(</span><span class="n">incr</span> <span class="n">used</span><span class="o">;</span> <span class="n">set</span> <span class="n">v</span><span class="o">)</span>
</span><span class="line">  <span class="k">else</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Quota exceeded&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or log each time it is used, tagged with a label that's meaningful to us
(e.g. the function to which we granted access):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">log</span> <span class="o">=</span> <span class="n">ref</span> <span class="bp">[]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">name</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="n">sprintf</span> <span class="s2">&quot;%S set it to %d&quot;</span> <span class="n">name</span> <span class="n">v</span> <span class="k">in</span>
</span><span class="line">  <span class="n">log</span> <span class="o">:=</span> <span class="n">msg</span> <span class="o">::</span> <span class="o">!</span><span class="n">log</span><span class="o">;</span>
</span><span class="line">  <span class="n">set</span> <span class="n">v</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="o">(</span><span class="n">set</span> <span class="s2">&quot;f&quot;</span><span class="o">);</span>
</span><span class="line"><span class="n">g</span> <span class="o">(</span><span class="n">set</span> <span class="s2">&quot;g&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or all of the above.</p>
<p>In these examples, our function <code>f</code> never got direct access (permission) to <code>x</code>, yet was still able to affect it.
Therefore, in capability systems people often talk about &quot;authority&quot; rather than permission.
Roughly speaking, the <em>authority</em> of a subject is the set of actions that the subject could cause to happen,
now or in the future, on currently-existing resources.
Since it's only things that <em>might</em> happen, and we don't want to read all the code to find out exactly what
it might do, we're usually only interested in getting an upper-bound on a subject's authority,
to show that it <em>can't</em> do something.</p>
<p>The examples here all used a single function.
We may want to allow multiple operations on a single value (e.g. getting and setting a ref-cell),
and the usual techniques are available for doing that (e.g. having the function take the operation as its first argument,
or collecting separate functions together in a record, module or object).</p>
<h3 id="web-server-example">Web-server example</h3>
<p>Let's look at a more realistic example.
Here's a simple web-server (we are defining the <code>main</code> function, which takes two arguments):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To use it, we pass it access to some network (<code>net</code>) and a directory tree with the content (<code>htdocs</code>).
Immediately we can see that this server does not access any part of the file-system outside of <code>htdocs</code>,
but that it may use the network. Here's a picture of the situation:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web1.svg" title="Initial reference graph" class="caption"/><span class="caption-text">Initial reference graph</span></span></p>
<p>Notes on reading the diagram:</p>
<ul>
<li>The diagram shows a model of the reference graph, where each node represents some value (function, record, tuple, etc)
or aggregated group of values.
</li>
<li>An arrow from A to B indicates the possibility that some value in the group A holds a reference to
some value in the group B.
</li>
<li>The model is typically an <em>over-approximation</em>, so the lack of an arrow from A to B means that no such reference
exists, while the presence of an arrow just means we haven't ruled it out.
</li>
<li>Orange nodes here represent OCaml values.
</li>
<li>White boxes are directories.
They include all contained files and subdirectories, except those shown separately.
I've pulled out <code>htdocs</code> so we can see that <code>app</code> doesn't have access to the rest of <code>home</code>.
Just for emphasis, I also show <code>.ssh</code> separately.
I'm assuming here that a directory doesn't give access to its parent,
so <code>htdocs</code> can only be used to read files within that sub-tree.
</li>
<li><code>net</code> represents the network and everything else connected to it.
</li>
<li>In most operating systems, directories exist in the kernel's address space,
and so you cannot have a direct reference to them.
That's not a problem, but for now you may find it easier to imagine a system where the kernel and applications
are all a single program, in a single programming language.
</li>
<li>This diagram represents the state at a particular moment in time (when starting the application).
We could also calculate and show all the references that might ever come to exist,
given what we know about the behaviour of <code>app</code> and <code>net</code>.
Since we don't yet know anything about either,
we would have to assume that <code>app</code> might give <code>net</code> access to <code>htdocs</code> and to itself.
</li>
</ul>
<p>So, the diagram above shows the application <code>app</code> has been given references to <code>net</code> and to <code>htdocs</code> as arguments.</p>
<p>Looking at our checklist from the start:</p>
<ul>
<li>It can't delete all my files, but it might delete the ones in <code>htdocs</code>.
</li>
<li>It can't edit <code>~/.ssh/authorized_keys</code>.
</li>
<li>It might act as a relay, allowing remote machines to attack other computers on my local network.
</li>
<li>It might send telemetry to a third-party.
</li>
</ul>
<p>We can read the body of the function to learn more:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">socket</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">listen</span> <span class="n">net</span> <span class="o">(`</span><span class="nc">Tcp</span> <span class="mi">8080</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">handler</span> <span class="o">=</span> <span class="n">static_files</span> <span class="n">htdocs</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="n">handler</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: <code>Net.listen net</code> is typical OCaml style for performing the <code>listen</code> operation on <code>net</code>.
We could also have used a record and written <code>net.listen</code> instead, which may look more familiar to some readers.</p>
<p>Here's an updated diagram, showing the moment when <code>Http.serve</code> is called.
The <code>app</code> group has been opened to show <code>socket</code> and <code>handler</code> separately:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web2.svg" title="After reading the code of main" class="caption"/><span class="caption-text">After reading the code of main</span></span></p>
<p>We can see that the code in the HTTP library can only access the network via <code>socket</code>,
and can only access <code>htdocs</code> by using <code>handler</code>.
Assuming <code>Net.listen</code> is trust-worthy (we'll normally trust the platform's networking layer),
it's clear that the application doesn't make out-bound connections,
since <code>net</code> is used only to create a listening socket.</p>
<p>To know what the application might do to <code>htdocs</code>, we only have to read the definition of <code>static_files</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">static_files</span> <span class="n">dir</span> <span class="n">request</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Path</span><span class="p">.</span><span class="n">load</span> <span class="o">(</span><span class="n">dir</span> <span class="o">/</span> <span class="n">request</span><span class="o">.</span><span class="n">path</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Now we can see that the application doesn't change any files; it only uses <code>htdocs</code> to read them.</p>
<p>Finally, expanding <code>Http.serve</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">serve</span> <span class="n">socket</span> <span class="n">handle_request</span> <span class="o">=</span>
</span><span class="line">  <span class="k">while</span> <span class="bp">true</span> <span class="k">do</span>
</span><span class="line">    <span class="k">let</span> <span class="n">conn</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">accept</span> <span class="n">socket</span> <span class="k">in</span>
</span><span class="line">    <span class="n">handle_connection</span> <span class="n">conn</span> <span class="n">handle_request</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We see that <code>handle_connection</code> has no way to share telemetry information between connections,
given that <code>handle_request</code> never stores anything.</p>
<p>We can tell these things after only looking at the code for a few seconds, even though dozens of libraries are being used.
In particular, we didn't have to read <code>handle_connection</code> or any of the HTTP parsing logic.</p>
<p>Now let's enable TLS. For this, we will require a configuration directory containing the server's key:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="o">~</span><span class="n">tls_config</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">socket</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">listen</span> <span class="n">net</span> <span class="o">(`</span><span class="nc">Tcp</span> <span class="mi">8443</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">tls_socket</span> <span class="o">=</span> <span class="nn">Tls</span><span class="p">.</span><span class="n">wrap</span> <span class="o">~</span><span class="n">tls_config</span> <span class="n">socket</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">handler</span> <span class="o">=</span> <span class="n">static_files</span> <span class="n">htdocs</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">tls_socket</span> <span class="n">handler</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>OCaml syntax note: I used <code>~</code> to make <code>tls_config</code> a named argument; we wouldn't want to get this directory confused with <code>htdocs</code>!</p>
<p>We can see that only the TLS library gets access to the key.
The HTTP library interacts only with the TLS socket, which presumably does not reveal it.</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web3.svg" title="Updated graph showing TLS" class="caption"/><span class="caption-text">Updated graph showing TLS</span></span></p>
<p>Notice too how this fixes the problem we had with our original policy enforcement system.
There, an attacker could request <code>https://example.com/../tls_config/server.key</code> and the HTTP server might send the key.
But here, the handler cannot do that even if it wants to.
When <code>handler</code> loads a file, it does so via <code>htdocs</code>, which does not have access to <code>tls_config</code>.</p>
<p>The above server has pretty good security properties,
even though we didn't make any special effort to write secure code.
Security-conscious programmers will try to wrap powerful capabilities (like <code>net</code>)
with less powerful ones (like <code>socket</code>) as early as possible, making the code easier to understand.
A programmer uninterested in readability is likely to mix in more irrelevant code you have to skip through,
but even so it shouldn't take too long to track down where things like <code>net</code> and <code>htdocs</code> end up.
And even if they spread them throughout their entire application,
at least you avoid having to read all the libraries too!</p>
<p>By contrast, consider a more traditional (non-capability) style.
We start with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">htdocs</span> <span class="o">=</span> <span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>htdocs</code> would be a plain string rather than a reference to a directory,
and the network would be reached through a global.
We can't tell anything about what this server could do from looking at this one line,
and even if we expand it, we won't be able to tell what all the functions it calls do, either.
We will end up having to follow every function call recursively through all of the server's
dependencies, and our analysis will be out of date as soon as any of them changes.</p>
<h3 id="use-at-different-scales">Use at different scales</h3>
<p>We've seen that we can create an over-approximation of the reference graph by looking at just a small part of the code,
and then get a closer bound on the possible effects as needed
by expanding groups of values until we can prove the desired property.
For example, to prove that the application didn't modify <code>htdocs</code>, we followed <code>htdocs</code> by expanding <code>main</code> and then <code>static_files</code>.</p>
<p>Within a single process, a capability is a reference (pointer) to another value in the process's memory.
However, the diagrams also included arrows (capabilities) to things outside of the process, such as directories.
We can regard these as references to privileged proxy functions in the process that make calls to the OS kernel,
or (at a higher level of abstraction) we can consider them to be capabilities to the external resources themselves.</p>
<p>It is possible to build capability operating systems (in fact, this was the first use of capabilities).
Just as we needed to ban global variables to make a safe programming language,
we need to ban global namespaces to make a capability operating system.
For example, on FreeBSD this is done (on a per-process basis) by invoking the <a href="https://man.freebsd.org/cgi/man.cgi?query=cap_enter">cap_enter</a> system call.</p>
<p>We can zoom out even further, and consider a network of computers.
Here, an arrow between machines represents some kind of (unforgeable) network address or connection.
At the IP level, any process can connect to any address, but a capability system can be implemented on top.
<a href="http://www.erights.org/elib/distrib/captp/index.html">CapTP</a> (the Capability Transport Protocol) was an early system for this, but
<a href="https://capnproto.org/rpc.html">Cap'n Proto</a> (Capabilities and Protocols) is the modern way to do it.</p>
<p>So, thinking in terms of capabilities, we can zoom out to look at the security properties of the whole network,
yet still be able to expand groups as needed right down to the level of individual closures in a process.</p>
<h3 id="key-points">Key points</h3>
<ul>
<li>
<p>Library code can be imported and called without it getting access to any pre-existing state,
except that given to it explicitly. There is no &quot;ambient authority&quot; available to the library.</p>
</li>
<li>
<p>A function's side-effects are bounded by its arguments.
We can understand (get a bound on) the behaviour of a function call just by looking at it.</p>
</li>
<li>
<p>If <code>a</code> has access to <code>b</code> and to <code>c</code>, then <code>a</code> can introduce them (e.g. by performing the function call <code>b c</code>).
Note that there is no capability equivalent to making something &quot;world readable&quot;;
to perform an introduction,
you need access to both the resource being granted and to the recipient (&quot;only connectivity begets connectivity&quot;).</p>
</li>
<li>
<p>Instead of passing the <em>name</em> of a resource, we pass a capability reference (pointer) to it,
thereby proving that we have access to it and sharing that access (&quot;no designation without authority&quot;).</p>
</li>
<li>
<p>The caller of a function decides what it should access, and can provide restricted access by wrapping
another capability, or substituting something else entirely.</p>
<p>I am sometimes unable to install a messaging app on my phone because it requires me to grant it
access to my address book.
A capability system should never say &quot;This application requires access to the address book. Continue?&quot;;
it should say &quot;This application requires access to <em>an</em> address book; which would you like to use?&quot;.</p>
</li>
<li>
<p>A capability must behave the same way regardless of who uses it.
When we do <code>f x</code>, <code>f</code> can perform exactly the same operations on <code>x</code> that we can.</p>
<p>It is tempting to add a traditional policy language alongside capabilities for &quot;extra security&quot;,
saying e.g. &quot;<code>f</code> cannot write to <code>x</code>, even if it has a reference to it&quot;.
However, apart from being complicated and annoying,
this creates an incentive for <code>f</code> to smuggle <code>x</code> to another context with more powers.
This is the root cause of many real-world attacks, such as click-jacking or cross-site request forgery,
where a URL permits an attack if a victim visits it, but not if the attacker does.
One of the great benefits of capability systems is that you don't need to worry that someone is trying to trick you
into doing something that you can do but they can't,
because your ability to access the resource they give you comes entirely from them in the first place.</p>
</li>
</ul>
<p>All of the above follow naturally from using functions in the usual way, while avoiding global variables.</p>
<h2 id="practical-considerations">Practical considerations</h2>
<p>The above discussion argues that capabilities would have been a good way to build systems in an ideal world.
But given that most current operating systems and programming languages have not been designed this way,
how useful is this approach?
I'm currently working on <a href="https://github.com/ocaml-multicore/eio">Eio</a>, an IO library for OCaml, and using these principles to guide the design.
Here are a few thoughts about applying capabilities to a real system.</p>
<h3 id="plumbing-capabilities-everywhere">Plumbing capabilities everywhere</h3>
<p>A lot of people worry about cluttering up their code by having to pass things explicitly everywhere.
This is actually not much of a problem, for a couple of reasons:</p>
<ol>
<li>
<p>We already do this with most things anyway.
If your program uses a database, you probably establish a connection to it at the start and pass the connection around as needed.
You probably also pass around open file handles, configuration settings, HTTP connection pools, arrays, queues, ref-cells, etc.
Handling &quot;the file-system&quot; and &quot;the network&quot; the same way as everything else isn't a big deal.</p>
</li>
<li>
<p>You can often bundle up a capability with something else.
For example, a web-server will likely let the user decide which directory to serve,
so you're already passing around a pathname argument.
Passing a path capability instead is no extra work.</p>
</li>
</ol>
<p>Consider a request handler that takes the address of a Redis server:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">redis_url</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It might seem that by using capabilities we'd need to pass the network in here too:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">net</span> <span class="n">redis_url</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is both messy and unnecessary.
Instead, <code>handle_request</code> can take a function for connecting to Redis:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">redis</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then there is only one argument to pass around again.
Instead of writing the connection logic in <code>handle_request</code>, we write the same logic outside and just pass in the function.
And now someone looking at the code can see &quot;the handler can connect to Redis&quot;,
rather than the less precise &quot;the handler accesses the network&quot;.
Of course, if Redis required more than one configuration setting then you'd probably already be doing it this way.</p>
<p>The main problematic case is providing <em>defaults</em>.
For example, a TLS library might allow us to specify the location of the system's certificate store,
but it would like to provide a default (e.g. <code>/etc/ssl/certs/</code>).
This is particularly important if the default location varies by platform.
If the TLS library decides the location, then we must give it (read-only at least) access to the whole system!
We may just decide to trust the library, or we might separate out the default paths into a trusted package.</p>
<h3 id="levels-of-support">Levels of support</h3>
<p>Ideally, our programming language would provide a secure implementation of capabilities that we could depend on.
That would allow running untrusted code safely and protect us from compromised packages.
However, converting a non-capability language to a capability-secure one isn't easy,
and isn't likely to happen any time soon for OCaml
(but see <a href="https://www.hpl.hp.com/techreports/2006/HPL-2006-116.pdf">Emily</a> for an old proof-of-concept).</p>
<p>Even without that, though, capabilities help to protect non-malicious code from malicious inputs.
For example, the request handler above forgot to sanitise the URL path from the remote client,
but it still can't access anything outside of <code>htdocs</code>.</p>
<p>And even if we don't care about security at all, capabilities make it easy to see what a program does;
they make it easy to test programs by replacing OS resources with mocks;
and preventing access to globals helps to avoid race conditions,
since two functions that access the same resource must be explicitly introduced.</p>
<h3 id="running-on-a-traditional-os">Running on a traditional OS</h3>
<p>A capability OS would let us run a program's <code>main</code> function and provide the capabilities it wanted directly,
but most systems don't work like that.
Instead, each program requires a small trusted entrypoint that has the full privileges of the process.
In Eio, an application will typically start something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Eio_main</span><span class="p">.</span><span class="n">run</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">env</span> <span class="o">-&gt;</span>
</span><span class="line"><span class="k">let</span> <span class="n">net</span> <span class="o">=</span> <span class="nn">Eio</span><span class="p">.</span><span class="nn">Stdenv</span><span class="p">.</span><span class="n">net</span> <span class="n">env</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">fs</span> <span class="o">=</span> <span class="nn">Eio</span><span class="p">.</span><span class="nn">Stdenv</span><span class="p">.</span><span class="n">fs</span> <span class="n">env</span> <span class="k">in</span>
</span><span class="line"><span class="nn">Eio</span><span class="p">.</span><span class="nn">Path</span><span class="p">.</span><span class="n">with_open_dir</span> <span class="o">(</span><span class="n">fs</span> <span class="o">/</span> <span class="s2">&quot;/srv/www&quot;</span><span class="o">)</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">htdocs</span> <span class="o">-&gt;</span>
</span><span class="line"><span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Eio_main.run</code> starts the Eio event loop and then runs the callback.
The <code>env</code> argument gives full access to the process's environment.
Here, the callback extracts network and filesystem access from this,
gets access to just &quot;/srv/www&quot; from <code>fs</code>,
and then calls the <code>main</code> function as before.</p>
<p>Note that <code>Eio_main.run</code> itself is not a capability-safe function (it magics up <code>env</code> from nothing).
A capability-enforcing compiler would flag this bit up as needing to be audited manually.</p>
<h3 id="use-with-existing-security-mechanisms">Use with existing security mechanisms</h3>
<p>Maybe you're not convinced by all this capability stuff.
Traditional security systems are more widely available, better tested, and approved by your employer,
and you want to use that instead.
Still, to write the policy, you're going to need a list of resources the program might access.
Looking at the above code, we can immediately see that the policy need allow access only to the &quot;/srv/www&quot; directory,
and so we could call e.g. <a href="https://man.openbsd.org/unveil">unveil</a> here.
And if <code>main</code> later changes to use TLS,
the type-checker will let us know to update this code to provide the TLS configuration
and we'll know to update the policy at the same time.</p>
<p>If you want to drop privileges, such a program also makes it easy to see when it's safe to do that.
For example, looking at <code>main</code> we can see that <code>net</code> is never used after creating the socket,
so we don't need the <code>bind</code> system call after that,
and we never need <code>connect</code>.
We know, for instance, that this program isn't hiding an XML parser that needs to download schema files to validate documents.</p>
<h3 id="thread-local-storage">Thread-local storage</h3>
<p>In addition to global and local variables, systems often allow us to attach data to threads as a sort of middle ground.
This could allow unexpected interactions. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="n">x</span><span class="o">;</span>
</span><span class="line"><span class="n">g</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, we'd expect that <code>g</code> doesn't have access to <code>x</code>, but <code>f</code> could pass it using thread-local storage.
To prevent that, Eio instead provides <a href="https://ocaml-multicore.github.io/eio/eio/Eio/Fiber/index.html#val-with_binding">Fiber.with_binding</a>,
which runs a function with a binding but then puts things back how they were before returning,
so <code>f</code> can't make changes that are still active when <code>g</code> runs.</p>
<p>This also allows people who don't want capabilities to disable the whole system easily:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">everything</span> <span class="o">=</span> <span class="nn">Fiber</span><span class="p">.</span><span class="n">create_key</span> <span class="bp">()</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">env</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="o">(</span><span class="nn">Fiber</span><span class="p">.</span><span class="n">get</span> <span class="n">everything</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">env</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Fiber</span><span class="p">.</span><span class="n">with_binding</span> <span class="n">everything</span> <span class="n">env</span> <span class="n">f</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It looks like <code>f ()</code> doesn't have access to anything, but in fact it can recover <code>env</code> and get access to everything!
However, anyone trying to understand the code will start following <code>env</code> from the main entrypoint
and will then see that it got put in fiber-local storage.
They then at least know that they must read all the code to understand anything about what it can do.</p>
<p>More usefully, this mechanism allows us to make just a few things ambiently available.
For example, we don't want to have to plumb stderr through to a function every time we want to do some <code>printf</code> debugging,
so it makes sense to provide a tracing function this way (and Eio does this by default).
Tracing allows all components to write debug messages, but it doesn't let them read them.
Therefore, it doesn't provide a way for components to communicate with each other.</p>
<p>It might be tempting to use <code>Fiber.with_binding</code> to restrict access to part of a program
(e.g. giving an HTTP server network access this way),
but note that this is a non-capability way to do things,
and suffers the same problems as traditional security systems,
separating designation from authority.
In particular, supposedly sandboxed code in other parts of the application
can try to escape by tricking the HTTP server part into running a callback function for them.
But fiber local storage is fine for things to which you don't care to restrict access.</p>
<h3 id="symlinks">Symlinks</h3>
<p>Symlinks are a bit of a pain! If I have a capability reference to a directory, it's useful to know that I can only access things beneath that directory. But the directory may contain a symlink that points elsewhere.</p>
<p>One option would be to say that a symlink is a capability itself, but this means that you could only create symlinks to things you can access yourself, and this is quite a restriction. For example, you might be forbidden from extracting a tarball because <code>tar</code> didn't have permission to the target of a symlink it wanted to create.</p>
<p>The other option is to say that symlinks are just strings, and it's up to the user to interpret them.
This is the approach FreeBSD uses. When you use a system call like <code>openat</code>,
you pass a capability to a base directory and a string path relative to that.
In the case of our web-server, we'd use a capability for <code>htdocs</code>, but use strings to reference things inside it, allowing the server to follow symlinks within that sub-tree, but not outside.</p>
<p>The main problem is that it makes the API a bit confusing. Consider:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">save_to</span> <span class="o">(</span><span class="n">htdocs</span> <span class="o">/</span> <span class="s2">&quot;uploads&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It might look like <code>save_to</code> is only getting access to the &quot;uploads&quot; directory,
but in Eio it actually gets access to the whole of <code>htdocs</code>.
If you want to restrict access, you have to do that explicitly
(as we did when creating <code>htdocs</code> from <code>fs</code>).</p>
<p>The advantage, however, is that we don't break software that relies on symlinks.
Also, restricting access is quite expensive on some systems (FreeBSD has the handy <code>O_BENEATH</code> open flag,
and Linux has <code>RESOLVE_BENEATH</code>, but not all systems provide this), so might not be a good default.
I'm not completely satisfied with the current API, though.</p>
<h3 id="time-and-randomness">Time and randomness</h3>
<p>It is also possible to use capabilities to restrict access to time and randomness.
The security benefits here are less clear.
Tracking access to time can be useful in preventing side-channel attacks that depend on measuring time accurately,
but controlling access to randomness makes it difficult to e.g. randomise hash functions to
help prevent denial-of-service-attacks.</p>
<p>However, controlling access to these does have the advantage of making code deterministic by default,
which is a great benefit, especially for expect-style testing.
Your top level test function is called with no arguments, and therefore has no access to non-determinism,
instead creating deterministic mocks to use with the code under test.
You can then just record a good trace of a test's operations and check that it doesn't change.</p>
<h3 id="power-boxes">Power boxes</h3>
<p>Interactive applications that load and save files present a small problem:
since the user might load or save anywhere, it seems they need access to the whole file-system.
The solution is a &quot;powerbox&quot;.
The powerbox has access to the file-system and the rest of the application only has access to the powerbox.
When the application wants to save a file, it asks the powerbox, which pops up a GUI asking the user to choose the location.
Then it opens the file and passes that back to the application.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Currently-popular security mechanisms are complex and have many shortcomings.
Yet, the lambda calculus already contains an excellent security mechanism,
and making use of it requires little more than avoiding global variables.</p>
<p>This is known as &quot;capability-based security&quot;.
The word &quot;capabilities&quot; has also been used for several unrelated concepts (such as &quot;POSIX capabilities&quot;),
and for clarity much of the community rebranded a while back as &quot;Object Capabilities&quot;,
but this can make it seem irrelevant to functional programmers.
In fact, I wrote this blog post because several OCaml programmers have asked me what the point of capabilities is.
I was expecting it to be quite short (basically: applying functions to arguments good, global variables bad),
but it's got quite long; it seems there is a fair bit that follows from this simple idea!</p>
<p>Instead of seeing security as an extra layer that runs separately from the code and tries to guess what it meant to do,
capabilities fit naturally into the language.
The key difference with traditional security is that
the ability to do something depends on the reference used to do it, not on the identity of the caller.
This way of thinking about security works not only for controlling access to resources within a single program,
but also for controlling interactions between processes running on a machine, and between machines on a network.
We can group together resources and zoom out to see the overall picture, or expand groups to zoom in and get a closer
bound on the behaviour.</p>
<p>Even ignoring security, a key question is: what can a function do?
Should a function call be able to do anything at all that the process can do,
or should its behaviour be bounded in some way that is obvious just by looking at it?
If we say that you must read the source code of a function to see what it does, then this applies recursively:
we must also read all the functions that it calls, and so on.
To understand the <code>main</code> function, we end up having to read the code of every library it uses!</p>
<p>If you want to read more,
the <a href="http://habitatchronicles.com/2017/05/what-are-capabilities/">What Are Capabilities?</a> blog post provides a good overview;
Part II of <a href="https://papers.agoric.com/papers/robust-composition/abstract/">Robust Composition</a> contains a longer explanation;
<a href="https://srl.cs.jhu.edu/pubs/SRL2003-02.pdf">Capability Myths Demolished</a> does a good job of enumerating security properties provided by capabilities;
my own <a href="https://roscidus.com/blog/about/#the-serscis-access-modeller">SERSCIS Access Modeller</a> paper shows how to analyse systems
where some components have unknown behaviour; and, for historical interest, see
Dennis and Van Horn's 1966 <a href="https://dl.acm.org/doi/pdf/10.1145/365230.365252">Programming Semantics for Multiprogrammed Computations</a>, which introduced the idea.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Isolating Xwayland in a VM</title>
    <link href="https://roscidus.com/blog/blog/2021/10/30/xwayland/"></link>
    <updated>2021-10-30T10:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2021/10/30/xwayland</id>
    <content type="html"><![CDATA[<p>In my last post, <a href="/blog/blog/2021/03/07/qubes-lite-with-kvm-and-wayland/">Qubes-lite with KVM and Wayland</a>, I described setting up a Qubes-inspired Linux system that runs applications in virtual machines. A Wayland proxy running in each VM connects its applications to the host Wayland compositor over virtwl, allowing them to appear on the desktop alongside normal host applications. In this post, I extend this to support X11 applications using Xwayland.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#overview">Overview</a>
</li>
<li><a href="#introduction-to-x11">Introduction to X11</a>
</li>
<li><a href="#running-xwayland">Running Xwayland</a>
</li>
<li><a href="#the-x11-protocol">The X11 protocol</a>
</li>
<li><a href="#initialising-the-window-manager">Initialising the window manager</a>
</li>
<li><a href="#windows">Windows</a>
</li>
<li><a href="#performance">Performance</a>
</li>
<li><a href="#pointer-events">Pointer events</a>
</li>
<li><a href="#keyboard-events">Keyboard events</a>
</li>
<li><a href="#pointer-cursor">Pointer cursor</a>
</li>
<li><a href="#selections">Selections</a>
</li>
<li><a href="#drag-and-drop">Drag-and-drop</a>
</li>
<li><a href="#bonus-features">Bonus features</a>
<ul>
<li><a href="#hidpi-works">HiDPI works</a>
</li>
<li><a href="#ring-buffer-logging">Ring-buffer logging</a>
</li>
<li><a href="#vim-windows-open-correctly">Vim windows open correctly</a>
</li>
<li><a href="#copy-and-paste-without-m-characters">Copy-and-paste without ^M characters</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=29645743">Hacker News</a> )</p>
<h2 id="overview">Overview</h2>
<p>A graphical desktop typically allows running multiple applications on a single display
(e.g. by showing each application in a separate window).
Client applications connect to a server process (usually on the same machine) and ask it to display their windows.</p>
<p>Until recently, this service was an <em>X server</em>, and applications would communicate with it using the X11 protocol.
However, on newer systems the display is managed by a <em>Wayland compositor</em>, using the Wayland protocol.</p>
<p>Many older applications haven't been updated yet.
<a href="https://wayland.freedesktop.org/docs/html/ch05.html">Xwayland</a> can be used to allow unmodified X11 applications to run in a Wayland desktop environment.
However, setting this up wasn't as easy as I'd hoped.
Ideally, Xwayland would completely isolate the Wayland compositor from needing to know anything about X11:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/xwayland/fantasy-xwayland.png" title="Fantasy Xwayland architecture" class="caption"/><span class="caption-text">Fantasy Xwayland architecture</span></span></p>
<p>However, it doesn't work like this.
Xwayland handles X11 drawing operations, but it doesn't handle lots of other details, including window management (e.g. telling the Wayland compositor what the window title should be), copy-and-paste, and selections.
Instead, the Wayland compositor is supposed to connect back to Xwayland over the X11 protocol and act as an X11 window manager to provide the missing features:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/xwayland/real-xwayland.png" title="Actual Xwayland architecture" class="caption"/><span class="caption-text">Actual Xwayland architecture</span></span></p>
<p>This is a problem for several reasons:</p>
<ol>
<li>It means that every Wayland compositor has to implement not only the new Wayland protocol, but also the old X11 protocol.
</li>
<li>The compositor is part of the trusted computing base (it sees all your keystrokes and window contents)
and this adds a whole load of legacy code that you'd need to audit to have confidence in it.
</li>
<li>It doesn't work when running applications in VMs,
because each VM needs its own Xwayland service and existing compositors can only manage one.
</li>
</ol>
<p>Because Wayland (unlike X11) doesn't allow applications to mess with other applications' windows,
we can't have a third-party application act as the X11 window manager.
It wouldn't have any way to ask the compositor to put Xwayland's surfaces into a window frame, because Xwayland is a separate application.</p>
<p>There is another way to do it, however.
As I mentioned in the last post,
I already had to write a Wayland proxy (<a href="https://github.com/talex5/wayland-proxy-virtwl">wayland-proxy-virtwl</a>) to run in each VM
and relay Wayland messages over virtwl, so I decided to extend it to handle Xwayland too.
As a bonus, the proxy can also be used even without VMs, avoiding the need for any X11 support in Wayland compositors at all.
In fact, I found that doing this avoided several bugs in Sway's built-in Xwayland support.</p>
<p><a href="https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/vm_tools/sommelier/">Sommelier</a> already has support for this, but it doesn't work for the applications I want to use.
For example, popup menus appear in the center of the screen, text selections don't work, and it generally crashes after a few seconds (often with the error <code>xdg_surface has never been configured</code>).
So instead I'd been using <code>ssh -Y vm</code> from the host to forward X11 connections to the host's Xwayland,
managed by Sway.
That works, but it's not at all secure.</p>
<h2 id="introduction-to-x11">Introduction to X11</h2>
<p>Unlike Wayland, where applications are mostly unaware of each other, X is much more collaborative.
The X server maintains a tree of windows (rectangles) and the applications manipulate it.
The root of the tree is called the <em>root window</em> and fills the screen.
You can see the tree using the <code>xwininfo</code> command, like this:</p>
<pre><code>$ xwininfo -tree -root

xwininfo: Window id: 0x47 (the root window) (has no name)

  Root window id: 0x47 (the root window) (has no name)
  Parent window id: 0x0 (none)
     9 children:
     0x800112 &quot;~/Projects/wayland/wayland-proxy-virtwl&quot;: (&quot;ROX-Filer&quot; &quot;ROX-Filer&quot;)  2184x2076+0+0  +0+0
        1 child:
        0x800113 (has no name): ()  1x1+-1+-1  +-1+-1
     0x800123 (has no name): ()  1x1+-1+-1  +-1+-1
     0x800003 &quot;ROX-Filer&quot;: ()  10x10+-100+-100  +-100+-100
     0x800001 &quot;ROX-Filer&quot;: (&quot;ROX-Filer&quot; &quot;ROX-Filer&quot;)  10x10+10+10  +10+10
        1 child:
        0x800002 (has no name): ()  1x1+-1+-1  +9+9
     0x600002 &quot;main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1&quot;: (&quot;gvim&quot; &quot;Gvim&quot;)  1648x1012+0+0  +0+0
        1 child:
        0x600003 (has no name): ()  1x1+-1+-1  +-1+-1
     0x600007 (has no name): ()  1x1+-1+-1  +-1+-1
     0x600001 &quot;Vim&quot;: (&quot;gvim&quot; &quot;Gvim&quot;)  10x10+10+10  +10+10
     0x200002 (has no name): ()  1x1+0+0  +0+0
     0x200001 (has no name): ()  1x1+0+0  +0+0
</code></pre>
<p>This tree shows the windows of two X11 applications, ROX-Filer and GVim,
as well as various invisible utility windows (mostly 1x1 or 10x10 pixels in size).</p>
<p>Applications can create, move, resize and destroy windows, draw into them, and request events from them.
The X server also allows arbitrary data to be attached to windows in <em>properties</em>.
You can see a window's properties with <code>xprop</code>. Here are some of the properties on the GVim window:</p>
<pre><code>$ xprop -id 0x600002
WM_HINTS(WM_HINTS):
		Client accepts input or input focus: True
		Initial state is Normal State.
		window id # of group leader: 0x600001
_NET_WM_WINDOW_TYPE(ATOM) = _NET_WM_WINDOW_TYPE_NORMAL
WM_NORMAL_HINTS(WM_SIZE_HINTS):
		program specified minimum size: 188 by 59
		program specified base size: 188 by 59
		window gravity: NorthWest
WM_CLASS(STRING) = &quot;gvim&quot;, &quot;Gvim&quot;
WM_NAME(STRING) = &quot;main.ml (~/Projects/wayland/wayland-proxy-virtwl) - GVIM1&quot;
...
</code></pre>
<p>The X server itself doesn't know anything about e.g. window title bars.
Instead, a <em>window manager</em> process connects and handles that.
A window manager is just another X11 application.
It asks to be notified when an application tries to show (&quot;map&quot;) a window inside the root,
and when that happens it typically creates a slightly larger window (with room for the title bar, etc)
and moves the other application's window inside that.</p>
<p>This design gives X a lot of flexibility.
All kinds of window managers have been implemented, without needing to change the X server itself.
However, it is very bad for security. For example:</p>
<ol>
<li>Open an xterm.
</li>
<li>Use <code>xwininfo</code> to find its window ID (you need the nested child window, not the top-level one).
</li>
<li>Run <code>xev -id 0x80001b -event keyboard</code> in another window (using the ID you got above).
</li>
<li>Use <code>sudo</code> or similar inside <code>xterm</code> and enter a password.
</li>
</ol>
<p>As you type the password into <code>xterm</code>, you should see the characters being captured by <code>xev</code>.
An X application can easily spy on another application, send it synthetic events, etc.</p>
<h2 id="running-xwayland">Running Xwayland</h2>
<p>Xwayland is a version of the <a href="https://www.x.org/wiki/">xorg</a> X server that treats Wayland as its display hardware.
If you run it as e.g. <code>Xwayland :1</code> then it opens a single Wayland window corresponding to the X root window,
and you can use it as a nested desktop.
This isn't very useful, because these windows don't fit in with the rest of your desktop.
Instead, it is normally used in <em>rootless</em> mode, where each child of the X root window may have its own Wayland window.</p>
<pre><code>$ WAYLAND_DEBUG=1 Xwayland :1 -rootless
[3991465.523]  -&gt; wl_display@1.get_registry(new id wl_registry@2)
[3991465.531]  -&gt; wl_display@1.sync(new id wl_callback@3)
...
</code></pre>
<p>When run this way, however, no windows actually appear.
If we run <code>DISPLAY=:1 xterm</code> then we see Xwayland creating some buffers, but no surfaces:</p>
<pre><code>[4076460.506]  -&gt; wl_shm@4.create_pool(new id wl_shm_pool@15, fd 9, 540)
[4076460.520]  -&gt; wl_shm_pool@15.create_buffer(new id wl_buffer@24, 0, 9, 15, 36, 0)
[4076460.526]  -&gt; wl_shm_pool@15.destroy()
...
</code></pre>
<p>We need to run Xwayland as <code>Xwayland :1 -rootless -wm FD</code>, where FD is a socket we will use to speak the X11 protocol and act as a window manager.</p>
<p>It's a little hard to find information about Xwayland's rootless mode, because &quot;rootless&quot; has two separate common meanings in xorg:</p>
<ol>
<li>Running xorg without root privileges.
</li>
<li>Using xorg's miext/rootless extension to display application windows on some other desktop.
</li>
</ol>
<p>After a while, it became clear that Xwayland's rootless mode isn't either of these, but a third xorg feature also called &quot;rootless&quot;.</p>
<h2 id="the-x11-protocol">The X11 protocol</h2>
<p><a href="https://xcb.freedesktop.org/">libxcb</a> provides C bindings to the X11 protocol, but I wanted to program in OCaml.
Luckily, the <a href="https://www.x.org/releases/X11R7.7/doc/xproto/x11protocol.html">X11 protocol</a> is well documented, and generating the messages directly didn't look any harder than binding libxcb,
so I wrote a little OCaml library to do this (<a href="https://github.com/talex5/wayland-proxy-virtwl/blob/master/x11/x11.mli">ocaml-x11</a>).</p>
<p>At first, I hard-coded the messages. For example, here's the code to delete a property on a window:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Delete</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="o">[%%</span><span class="n">cstruct</span>
</span><span class="line">    <span class="k">type</span> <span class="n">req</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">      <span class="n">window</span> <span class="o">:</span> <span class="n">uint32_t</span><span class="o">;</span>
</span><span class="line">      <span class="n">property</span> <span class="o">:</span> <span class="n">uint32_t</span><span class="o">;</span>
</span><span class="line">    <span class="o">}</span> <span class="o">[@@</span><span class="n">little_endian</span><span class="o">]</span>
</span><span class="line">  <span class="o">]</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">send</span> <span class="n">t</span> <span class="n">window</span> <span class="n">property</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">Request</span><span class="p">.</span><span class="n">send_only</span> <span class="n">t</span> <span class="o">~</span><span class="n">major</span><span class="o">:</span><span class="mi">19</span> <span class="n">sizeof_req</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">r</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">set_req_window</span> <span class="n">r</span> <span class="n">window</span><span class="o">;</span>
</span><span class="line">    <span class="n">set_req_property</span> <span class="n">r</span> <span class="n">property</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I'm using the <a href="https://github.com/mirage/ocaml-cstruct">cstruct</a> syntax extension to let me define the exact layout of the message body.
Here, it generates <code>sizeof_req</code>, <code>set_req_window</code> and <code>set_req_property</code> automatically.</p>
<p>After a bit, I discovered that there are XML files in <a href="https://gitlab.freedesktop.org/xorg/proto/xcbproto">xcbproto</a> describing the X11 protocol.
This provides a Python library for parsing the XML,
which you can use by writing a Python script for your language of choice.
For example, this <a href="https://gitlab.freedesktop.org/xorg/lib/libxcb/-/blob/master/src/c_client.py">glorious 3394 line Python script</a>
generates the C bindings.
After studying this script carefully, I decided that hard-coding everything wasn't so bad after all.</p>
<p>I ended up having to implement more messages than I expected,
including some surprising ones like <code>OpenFont</code> (see <a href="https://github.com/talex5/wayland-proxy-virtwl/blob/master/x11/x11.mli">x11.mli</a> for the final list).
My implementation came to 1754 lines of OCaml,
which is quite a bit shorter than the Python generator script,
so I guess I still came out ahead!</p>
<p>In the X11 protocol, client applications send <em>requests</em> and the server sends <em>replies</em>, <em>errors</em> and <em>events</em>.
Most requests don't produce replies, but can produce errors.
Replies and errors are returned immediately, so if you see a response to a later request, you know all previous ones succeeded.
If you care about whether a request succeeded, you may need to send a dummy message that generates a reply after it.
Since message sequence numbers are 16-bit, after sending 0xffff consecutive requests without replies,
you should send a dummy one with a reply to resynchronise
(but window management involves lots of round-trips, so this isn't likely to be a problem for us).
Events can be sent by the server at any time.</p>
<p>Unlike Wayland, which is very regular, X11 has various quirks.
For example, every event has a sequence number at offset 2, except for <code>KeymapNotify</code>.</p>
<h2 id="initialising-the-window-manager">Initialising the window manager</h2>
<p>Using <code>Xwayland -wm FD</code> actually prevents any client applications from connecting at all at first,
because Xwayland then waits for the window manager to be ready before accepting any client connections.</p>
<p>To fix that, we need to claim ownership of the <code>WM_S0</code> <em>selection</em>.
A &quot;selection&quot; is something that can be owned by only one application at a time.
Selections were originally used to track ownership of the currently-selected text, and later also used for the clipboard.
<code>WM_S0</code> means &quot;Window Manager for Screen 0&quot; (Xwayland only has one screen).</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(* Become the window manager. This allows other clients to connect. *)</span>
</span><span class="line"><span class="k">let</span><span class="o">*</span> <span class="n">wm_sn</span> <span class="o">=</span> <span class="n">intern</span> <span class="n">t</span> <span class="o">~</span><span class="n">only_if_exists</span><span class="o">:</span><span class="bp">false</span> <span class="o">(</span><span class="s2">&quot;WM_S&quot;</span> <span class="o">^</span> <span class="n">string_of_int</span> <span class="n">i</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line"><span class="nn">X11</span><span class="p">.</span><span class="nn">Selection</span><span class="p">.</span><span class="n">set_owner</span> <span class="n">x11</span> <span class="o">~</span><span class="n">owner</span><span class="o">:(</span><span class="nc">Some</span> <span class="n">root</span><span class="o">)</span> <span class="o">~</span><span class="n">timestamp</span><span class="o">:`</span><span class="nc">CurrentTime</span> <span class="n">wm_sn</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Instead of passing things like <code>WM_S0</code> as strings in each request, X11 requires us to first <em>intern</em> the string.
This returns a unique 32-bit ID for it, which we use in future messages.
Because <code>intern</code> may require a round-trip to the server, it returns a promise,
and so we use <code>let*</code> instead of <code>let</code> to wait for that to resolve before continuing.
<code>let*</code> is defined in the <code>Lwt.Syntax</code> module, as an alternative to the more traditional <code>&gt;&gt;=</code> notation.</p>
<p>This lets our clients connect. However, Xwayland still isn't creating any Wayland surfaces.
By reading the Sommelier code and stepping through Xwayland with a debugger, I found that I needed to enable the <a href="https://www.x.org/wiki/guide/extensions/">Composite</a> extension.</p>
<p>Composite was originally intended to speed up redraw operations, by having the server keep a copy of every top-level window's pixels
(even when obscured), so that when you move a window it can draw it right away without asking the application for help.
The application's drawing operations go to the window's buffer, and then the buffer is copied to the screen, either automatically by the X server
or manually by the window manager.
Xwayland reuses this mechanism, by turning each window buffer into a Wayland surface.
We just need to turn that on:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span><span class="o">*</span> <span class="n">composite</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Composite</span><span class="p">.</span><span class="n">init</span> <span class="n">x11</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span><span class="o">*</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Composite</span><span class="p">.</span><span class="n">redirect_subwindows</span> <span class="n">composite</span> <span class="o">~</span><span class="n">window</span><span class="o">:</span><span class="n">root</span> <span class="o">~</span><span class="n">update</span><span class="o">:`</span><span class="nc">Manual</span> <span class="k">in</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that every child of the root window should use this system.
Finally, we see Xwayland creating Wayland surfaces:</p>
<pre><code>-&gt; wl_compositor@5.create_surface id:+28
</code></pre>
<p>Now we just need to make them appear on the screen!</p>
<h2 id="windows">Windows</h2>
<p>As usual for Wayland, we need to create a role object and attach it to the surface.
This tells Wayland whether the surface is a window or a dialog, for example, and lets us set the title, etc.</p>
<p>But first we have a problem: we need to know which X11 window corresponds to each Wayland surface.
For example, we need the title, which is stored in a property on the X11 window.
Xwayland does this by sending the new window a <em>ClientMessage</em> event of type <code>WL_SURFACE_ID</code> containing the Wayland ID.
We don't get this message by default, but it seems that selecting <code>SubstructureRedirect</code> on the root does the trick.</p>
<p><code>SubstructureRedirect</code> is used by window managers to intercept attempts by other applications to change the children of the root window.
When an application asks the server to e.g. map a window, the server just forwards the request to the window manager.
Operations performed by the window manager itself do not get redirected, so it can just perform the same request the client wanted, or
make any changes it requires.</p>
<p>In our case, we don't actually need to modify the request, so we just re-perform the original <code>map</code> operation:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">event_handler</span> <span class="o">=</span> <span class="k">object</span> <span class="o">(_</span> <span class="o">:</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Event</span><span class="p">.</span><span class="n">handler</span><span class="o">)</span>
</span><span class="line">  <span class="k">method</span> <span class="n">map_request</span> <span class="o">~</span><span class="n">window</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">map</span> <span class="n">x11</span> <span class="n">window</span>
</span><span class="line">
</span><span class="line">  <span class="k">method</span> <span class="n">client_message</span> <span class="o">~</span><span class="n">window</span> <span class="o">~</span><span class="n">ty</span> <span class="n">body</span> <span class="o">=</span>
</span><span class="line">      <span class="k">if</span> <span class="n">ty</span> <span class="o">=</span> <span class="n">wl_surface_id</span> <span class="k">then</span> <span class="o">(</span>
</span><span class="line">        <span class="k">let</span> <span class="n">wayland_id</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="nn">LE</span><span class="p">.</span><span class="n">get_uint32</span> <span class="n">body</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line">        <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="s2">&quot;X window %a corresponds to Wayland surface %ld&quot;</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">pp</span> <span class="n">window</span> <span class="n">wayland_id</span><span class="o">);</span>
</span><span class="line">        <span class="n">pair_when_ready</span> <span class="o">~</span><span class="n">x11</span> <span class="n">t</span> <span class="n">window</span> <span class="n">wayland_id</span>
</span><span class="line">      <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Having two separate connections to Xwayland is quite annoying, because messages can arrive in any order.
We might get the X11 <code>ClientMessage</code> first and need to wait for the Wayland <code>create_surface</code>, or we might get the <code>create_surface</code> first
and need to wait for the <code>ClientMessage</code>.</p>
<p>An added complication is that not all Wayland surfaces correspond to X11 windows.
For example, Xwayland also creates surfaces representing cursor shapes, and these don't have X11 windows.
However, when we get the <code>ClientMessage</code> we <em>can</em> be sure that a Wayland message is on the way,
so I just pause the X11 event handling until that has arrived:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(* We got an X11 message saying X11 [window] corresponds to Wayland surface [wayland_id].</span>
</span><span class="line"><span class="c">   Turn [wayland_id] into an xdg_surface. If we haven&#39;t seen that surface yet, wait until it appears</span>
</span><span class="line"><span class="c">   on the Wayland socket. *)</span>
</span><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">pair_when_ready</span> <span class="o">~</span><span class="n">x11</span> <span class="n">t</span> <span class="n">window</span> <span class="n">wayland_id</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="nn">Hashtbl</span><span class="p">.</span><span class="n">find_opt</span> <span class="n">t</span><span class="o">.</span><span class="n">unpaired</span> <span class="n">wayland_id</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="s2">&quot;Unknown Wayland object %ld; waiting for surface to be created...&quot;</span> <span class="n">wayland_id</span><span class="o">);</span>
</span><span class="line">    <span class="k">let</span><span class="o">*</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Lwt_condition</span><span class="p">.</span><span class="n">wait</span> <span class="n">t</span><span class="o">.</span><span class="n">unpaired_added</span> <span class="k">in</span>
</span><span class="line">    <span class="n">pair_when_ready</span> <span class="o">~</span><span class="n">x11</span> <span class="n">t</span> <span class="n">window</span> <span class="n">wayland_id</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="o">{</span> <span class="n">client_surface</span> <span class="o">=</span> <span class="o">_;</span> <span class="n">host_surface</span><span class="o">;</span> <span class="n">set_configured</span> <span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="s2">&quot;Setting up Wayland surface %ld using X11 window %a&quot;</span> <span class="n">wayland_id</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Xid</span><span class="p">.</span><span class="n">pp</span> <span class="n">window</span><span class="o">);</span>
</span><span class="line">    <span class="nn">Hashtbl</span><span class="p">.</span><span class="n">remove</span> <span class="n">t</span><span class="o">.</span><span class="n">unpaired</span> <span class="n">wayland_id</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Lwt</span><span class="p">.</span><span class="n">async</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="n">pair</span> <span class="n">t</span> <span class="o">~</span><span class="n">set_configured</span> <span class="o">~</span><span class="n">host_surface</span> <span class="n">window</span><span class="o">);</span>
</span><span class="line">    <span class="nn">Lwt</span><span class="p">.</span><span class="n">return_unit</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Another complication is that Wayland doesn't allow you to attach a buffer to a surface until the window has been &quot;configured&quot;.
Doing so is a protocol error, and Sway will disconnect us if we try!
But Xwayland likes to attach the buffer immediately after creating the surface.</p>
<p>To avoid this, I use a queue:</p>
<ol>
<li>Xwayland asks to create a surface.
</li>
<li>We forward this to Sway, add its ID to the <code>unpaired</code> map, and create a queue for further events.
</li>
<li>Xwayland asks us to attach a buffer, etc. We just queue these up.
</li>
<li>We get the <code>ClientMessage</code> over the X11 connection and create a role for the new surface.
</li>
<li>Sway sends us a <code>configure</code> event, confirming it's ready for the buffer.
</li>
<li>We forward the queued events.
</li>
</ol>
<p>However, this creates a new problem: if the surface isn't a window then the events will be queued forever.
To fix that, when we get a <code>create_surface</code> we also do a round-trip on the X11 connection.
If the window is still unpaired when that returns then we know that no <code>ClientMessage</code> is coming, and we flush the queue.</p>
<p>X applications like to create dummy windows for various purposes (e.g. receiving clipboard data),
and we need to avoid showing those.
They're normally set as <code>override_redirect</code> so the window manager doesn't handle them,
but Xwayland redirects them anyway (it needs to because otherwise e.g. tooltips wouldn't appear at all).
I'm trying various heuristics to detect this, e.g. that override redirect windows with a size of 1x1 shouldn't be shown.</p>
<p>If Sway asks us to close a window, we need to relay that to the X application using the <code>WM_DELETE_WINDOW</code> protocol,
if it supports that:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">toplevel</span> <span class="o">=</span> <span class="nn">Xdg_surface</span><span class="p">.</span><span class="n">get_toplevel</span> <span class="n">xdg_surface</span> <span class="o">@@</span> <span class="k">object</span>
</span><span class="line">    <span class="k">inherit</span> <span class="o">[_]</span> <span class="nn">Xdg_toplevel</span><span class="p">.</span><span class="n">v1</span>
</span><span class="line">
</span><span class="line">    <span class="k">method</span> <span class="n">on_close</span> <span class="o">_</span> <span class="o">=</span>
</span><span class="line">      <span class="nn">Lwt</span><span class="p">.</span><span class="n">async</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">          <span class="k">let</span><span class="o">*</span> <span class="n">x11</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">x11</span> <span class="k">in</span>
</span><span class="line">          <span class="k">let</span><span class="o">*</span> <span class="n">wm_protocols</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Atom</span><span class="p">.</span><span class="n">intern</span> <span class="n">x11</span> <span class="s2">&quot;WM_PROTOCOLS&quot;</span>
</span><span class="line">          <span class="ow">and</span><span class="o">*</span> <span class="n">wm_delete_window</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Atom</span><span class="p">.</span><span class="n">intern</span> <span class="n">x11</span> <span class="s2">&quot;WM_DELETE_WINDOW&quot;</span> <span class="k">in</span>
</span><span class="line">          <span class="k">let</span><span class="o">*</span> <span class="n">protocols</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Property</span><span class="p">.</span><span class="n">get_atoms</span> <span class="n">x11</span> <span class="n">window</span> <span class="n">wm_protocols</span> <span class="k">in</span>
</span><span class="line">          <span class="k">if</span> <span class="nn">List</span><span class="p">.</span><span class="n">mem</span> <span class="n">wm_delete_window</span> <span class="n">protocols</span> <span class="k">then</span> <span class="o">(</span>
</span><span class="line">            <span class="k">let</span> <span class="n">data</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">create</span> <span class="mi">8</span> <span class="k">in</span>
</span><span class="line">            <span class="nn">Cstruct</span><span class="p">.</span><span class="nn">LE</span><span class="p">.</span><span class="n">set_uint32</span> <span class="n">data</span> <span class="mi">0</span> <span class="o">(</span><span class="n">wm_delete_window</span> <span class="o">:&gt;</span> <span class="n">int32</span><span class="o">);</span>
</span><span class="line">            <span class="nn">Cstruct</span><span class="p">.</span><span class="nn">LE</span><span class="p">.</span><span class="n">set_uint32</span> <span class="n">data</span> <span class="mi">4</span> <span class="mi">0</span><span class="n">l</span><span class="o">;</span>
</span><span class="line">            <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">send_client_message</span> <span class="n">x11</span> <span class="n">window</span> <span class="o">~</span><span class="n">fmt</span><span class="o">:</span><span class="mi">32</span> <span class="o">~</span><span class="n">propagate</span><span class="o">:</span><span class="bp">false</span> <span class="o">~</span><span class="n">event_mask</span><span class="o">:</span><span class="mi">0</span><span class="n">l</span> <span class="o">~</span><span class="n">ty</span><span class="o">:</span><span class="n">wm_protocols</span> <span class="n">data</span><span class="o">;</span>
</span><span class="line">          <span class="o">)</span> <span class="k">else</span> <span class="o">(</span>
</span><span class="line">            <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">destroy</span> <span class="n">x11</span> <span class="n">window</span>
</span><span class="line">          <span class="o">)</span>
</span><span class="line">        <span class="o">)</span>
</span><span class="line">  <span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Wayland defaults to using client-side decorations (where the application draws its own window decorations).
X doesn't do that, so we need to turn it off (if the Wayland compositor supports the decoration manager extension):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">t</span><span class="o">.</span><span class="n">decor_mgr</span> <span class="o">|&gt;</span> <span class="nn">Option</span><span class="p">.</span><span class="n">iter</span> <span class="o">(</span><span class="k">fun</span> <span class="n">decor_mgr</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">decor</span> <span class="o">=</span> <span class="nn">Xdg_decor_mgr</span><span class="p">.</span><span class="n">get_toplevel_decoration</span> <span class="n">decor_mgr</span> <span class="o">~</span><span class="n">toplevel</span> <span class="o">@@</span> <span class="k">object</span>
</span><span class="line">        <span class="k">inherit</span> <span class="o">[_]</span> <span class="nn">Xdg_decoration</span><span class="p">.</span><span class="n">v1</span>
</span><span class="line">        <span class="k">method</span> <span class="n">on_configure</span> <span class="o">_</span> <span class="o">~</span><span class="n">mode</span><span class="o">:_</span> <span class="o">=</span> <span class="bp">()</span>
</span><span class="line">      <span class="k">end</span>
</span><span class="line">    <span class="k">in</span>
</span><span class="line">    <span class="nn">Xdg_decoration</span><span class="p">.</span><span class="n">set_mode</span> <span class="n">decor</span> <span class="o">~</span><span class="n">mode</span><span class="o">:</span><span class="nn">Xdg_decoration</span><span class="p">.</span><span class="nn">Mode</span><span class="p">.</span><span class="nc">Server_side</span>
</span><span class="line">  <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Dialog boxes are more of a problem.
Wayland requires every dialog box to have a parent window, but X11 doesn't.
To handle that, the proxy tracks the last window the user interacted with and uses that as a fallback parent
if an X11 window with type <code>_NET_WM_WINDOW_TYPE_DIALOG</code> is created without setting <code>WM_TRANSIENT_FOR</code>.
That could be a problem if the application closes that window, but it seems to work.</p>
<h2 id="performance">Performance</h2>
<p>I noticed a strange problem: scrolling around in GVim had long pauses once a second or so,
corresponding to OCaml GC runs.
This was surprising, as OCaml has a fast incremental garbage collector, and is normally not a problem for interactive programs.
Besides, I'd been using the proxy with the (Wayland) Firefox and xfce4-terminal applications for 6 months without any similar problem.</p>
<p>Using <code>perf</code> showed that Linux was spending a huge amount of time in <code>release_pages</code>.
The problem is that Xwayland was sharing lots of short-lived memory pools with the proxy.
Each time it shares a pool, we have to ask the VM host for a chunk of memory of the same size.
We map both pools into our address space and then copy each frame across
(this is needed because we can't export guest memory to the host).</p>
<p>Normally, an application shares a single pool and just refers to regions within it, so we just map once at startup and unmap at exit.
But Xwayland was creating, sharing and discarding around 100 pools per second while scrolling in GVim!
Because these pools take up a lot of RAM, OCaml was (correctly) running the GC very fast, freeing them in batches of 100 or so each second.</p>
<p>First, I tried adding a cache of host memory, but that only solved half the problem: freeing the client pool was still slow.</p>
<p>Another option is to unmap the pools as soon as we get the destroy message, to spread the work out.
Annoyingly, OCaml's standard library doesn't let you free memory-mapped memory explicitly
(see the <a href="https://github.com/ocaml/ocaml/pull/389">Add BigArray.Genarray.free</a> PR for the current status),
but adding this myself with a bit of C code would have been easy enough.
We only touch the memory in one place (for the copy), so manually checking it hadn't been freed would have been pretty safe.</p>
<p>Then I noticed something interesting about the repeated log entries, which mostly looked like this:</p>
<pre><code>-&gt; wl_shm@4.create_pool id:+26 fd:(fd) size:8368360
-&gt; wl_shm_pool@26.create_buffer id:+28 offset:0 width:2090 height:1001 stride:8360 format:1
-&gt; wl_shm_pool@26.destroy 
&lt;- wl_display@1.delete_id id:26
-&gt; wl_buffer@28.destroy 
&lt;- wl_display@1.delete_id id:28
</code></pre>
<p>Xwayland creates a pool, allocates a buffer within it, destroys the pool (so it can't create more buffers), and then deletes the buffer.
But <em>it never uses the buffer for anything</em>!</p>
<p>So the solution was simple: I just made the host buffer allocation and the mapping operations lazy.
We force the mapping if a pool's buffer is ever attached to a surface, but if not we just close the FD and forget about it.
Would be more efficient if Xwayland only shared the pools when needed, though.</p>
<h2 id="pointer-events">Pointer events</h2>
<p>Wayland delivers pointer events relative to a surface, so we simply forward these on to Xwayland unmodified and everything just works.</p>
<p>I'm kidding - this was the hardest bit! When Xwayland gets a pointer event on a window, it doesn't send it directly to that window.
Instead, it converts the location to screen coordinates and then pushes the event through the old X event handling mechanism, which looks at the X11 window stack to decide where to send it.</p>
<p>However, the X11 window stack (which we saw earlier with <code>xwininfo -tree -root</code>) doesn't correspond to the Wayland window layout at all.
In fact, Wayland doesn't provide us any way to know where our windows are, or how they are stacked.</p>
<p>Sway seems to handle this via a backdoor: X11 applications do get access to location information even though native Wayland clients don't.
This is one of the reasons I want to get X11 support out of the compositor - I want to make sure X11 apps don't have any special access.
Sommelier has a solution though: when the pointer enters a window we raise it to the top of the X11 stack. Since it's the topmost window, it will get the events.</p>
<p>Unfortunately, the raise request goes over the X11 connection while the pointer events go over the Wayland one.
We need to make sure that they arrive in the right order.
If the computer is running normally, this isn't much of a problem,
but if it's swapping or otherwise struggling it could result in events going to the wrong place
(I temporarily added a 2-second delay to test this).
This is what I ended up with:</p>
<ol>
<li>Get a wayland pointer enter event from Sway.
</li>
<li>Pause event delivery from Sway.
</li>
<li>Flush any pending Wayland events we previously sent to Xwayland by doing a round-trip on the Wayland connection.
</li>
<li>Send a raise on the X11 connection.
</li>
<li>Do a round-trip on the X11 connection to ensure the raise has completed.
</li>
<li>Forward the enter event on the Wayland connection.
</li>
<li>Unpause the event stream from Sway.
</li>
</ol>
<p>At first I tried queuing up just the pointer events,
but that doesn't work because e.g. keyboard events need to be synchronised with pointer events.
Otherwise, if you e.g. Shift-click on something then the click gets delayed but the Shift doesn't and it can do the wrong thing.
Also, Xwayland might ask Sway to destroy the window while we're entering it, and Sway might confirm the deletion.
Pausing the whole event stream from Sway fixes all these problems.</p>
<p>The next problem was how to do the two round-trips.
For X11 we just send an <code>Intern</code> request after the raise and wait to get a reply to that.
Wayland provides the <code>wl_display.sync</code> method to clients, but we're acting as a Wayland server to Xwayland,
not a client.
I remembered that Wayland's xdg-shell extension provides a ping from the server to the client
(the compositor can use this to detect when an application is not responding).
Unfortunately, Xwayland has no reason to use this extension because it doesn't deal with window roles.
Luckily, it uses it anyway (it does need it for non-rootless mode and doesn't bother to check).</p>
<p><code>wl_display.sync</code> works by creating a fresh callback object, but xdg-shell's <code>ping</code> just sends a <code>pong</code> event to a fixed object,
so we also need a queue to keep track of pings in flight so we don't get confused between our pings and any pings we're relaying for Sway.
Also, xdg-shell's ping requires a serial number and we don't have one.
But since Xwayland is the only app this needs to support, and it doesn't look at that, I cheat and just send zero.</p>
<p>And that's how to get pointer events to go to the right window with Xwayland.</p>
<h2 id="keyboard-events">Keyboard events</h2>
<p>A very similar problem exists with the keyboard.
When Wayland says the focus has entered a window
we need to send a <code>SetInputFocus</code> over the X11 connection
and then send the keyboard events over the Wayland one,
requiring another two round-trips to synchronise the two connections.</p>
<h2 id="pointer-cursor">Pointer cursor</h2>
<p>Some applications set their own pointer shape, which works fine.
But others rely on the default and for some reason you get no cursor at all in that case.
To fix it, you need to set a cursor on the root window, which applications will then inherit by default.
Unlike Wayland, where every application provides its own cursor bitmaps,
X very sensibly provides a standard set of cursors, in a font called <code>cursor</code>
(this is why I had to implement <code>OpenFont</code>).
As cursors have two colours and a mask, each cursor is two glyphs: even numbered glyphs are the image and the following glyph is its mask:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(* Load the default cursor image *)</span>
</span><span class="line"><span class="k">let</span><span class="o">*</span> <span class="n">cursor_font</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Font</span><span class="p">.</span><span class="n">open_font</span> <span class="n">x11</span> <span class="s2">&quot;cursor&quot;</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span><span class="o">*</span> <span class="n">default_cursor</span> <span class="o">=</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Font</span><span class="p">.</span><span class="n">create_glyph_cursor</span> <span class="n">x11</span>
</span><span class="line">    <span class="o">~</span><span class="n">source_font</span><span class="o">:</span><span class="n">cursor_font</span> <span class="o">~</span><span class="n">mask_font</span><span class="o">:</span><span class="n">cursor_font</span>
</span><span class="line">    <span class="o">~</span><span class="n">source_char</span><span class="o">:</span><span class="mi">68</span> <span class="o">~</span><span class="n">mask_char</span><span class="o">:</span><span class="mi">69</span>
</span><span class="line">    <span class="o">~</span><span class="n">bg</span><span class="o">:(</span><span class="mh">0xffff</span><span class="o">,</span> <span class="mh">0xffff</span><span class="o">,</span> <span class="mh">0xffff</span><span class="o">)</span>
</span><span class="line">    <span class="o">~</span><span class="n">fg</span><span class="o">:(</span><span class="mi">0</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="mi">0</span><span class="o">)</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">create_attributes</span> <span class="o">~</span><span class="n">cursor</span><span class="o">:</span><span class="n">default_cursor</span> <span class="bp">()</span>
</span><span class="line"><span class="o">|&gt;</span> <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">change_attributes</span> <span class="n">x11</span> <span class="n">root</span>
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="selections">Selections</h2>
<p>The next job was to get copying text between X and Wayland working.</p>
<p>In X11:</p>
<ul>
<li>When you select something, the application takes ownership of the <code>PRIMARY</code> selection.
</li>
<li>When you click the middle button or press Shift-Insert, the application requests <code>PRIMARY</code>.
</li>
<li>When you press Ctrl-C, the application takes ownership of the <code>CLIPBOARD</code> selection.
</li>
<li>When you press Ctrl-V it requests <code>CLIPBOARD</code>.
</li>
</ul>
<p>It's quite neat that adding support for a Windows-style clipboard didn't require changing the X server at all.
Good forward-thinking design there.</p>
<p>In Wayland, things are not so simple.
I have so far found no less than four separate Wayland protocols for copying text:</p>
<ol>
<li><code>gtk_primary_selection</code> supports copying the primary selection, but not the clipboard.
</li>
<li><code>wp_primary_selection_unstable_v1</code> is identical to <code>gtk_primary_selection</code> except that it renames everything.
</li>
<li><code>wl_data_device_manager</code> supports clipboard transfers but not the primary selection.
</li>
<li><code>zwlr_data_control_manager_v1</code> supports both, but it's for a &quot;privileged client&quot; to be a clipboard manager.
</li>
</ol>
<p><code>gtk_primary_selection</code> and <code>wl_data_device_manager</code> both say they're stable, while the other two are unstable.
However, Sway dropped support for <code>gtk_primary_selection</code> a while ago, breaking many applications
(luckily, I had a handy Wayland proxy and was able to add some adaptor code
to route <code>gtk_primary_selection</code> messages to the new &quot;unstable&quot; protocol).</p>
<p>For this project, I went with <code>wp_primary_selection_unstable_v1</code> and <code>wl_data_device_manager</code>.
On the Wayland side, everything has to be written twice for the two protocols, which are almost-but-not-quite the same.
In particular, <code>wl_data_device_manager</code> also has a load of drag-and-drop stuff you need to ignore.</p>
<p>For each selection (<code>PRIMARY</code> or <code>CLIPBOARD</code>), we can be in one of two states:</p>
<ul>
<li>An X11 client owns the selection (and we own the Wayland selection).
</li>
<li>A Wayland client owns the selection (and we own the X11 selection).
</li>
</ul>
<p>When we own a selection we proxy requests for it to the matching selection on the other protocol.</p>
<ul>
<li>At startup, we take ownership of the X11 selection, since there are no X11 apps running yet.
</li>
<li>When we lose the X11 selection it means that an X11 client now owns it and we take the Wayland selection.
</li>
<li>When we lose the Wayland selection it means that a Wayland client now owns it and we take the X11 selection.
</li>
</ul>
<p>One good thing about the Wayland protocols is that you send the data by writing it to a normal Unix pipe.
For X11, we need to write the data to a property on the requesting application's window and then notify it about the data.
And we may need to split it into multiple chunks if there's a lot of data to transfer.</p>
<p>A strange problem I had was that, while pasting into GVim worked fine, xterm would segfault shortly after trying to paste into it.
This turned out to be a bug in the way I was sending the notifications.
If an X11 application requests the special <code>TEXT</code> target, it means that the sender should choose the exact format.
You write the property with the chosen type (e.g. <code>UTF8_STRING</code>),
but you must still send the notification with the target <code>TEXT</code>.
xterm is a C application (thankfully no longer set-uid!) and seems to have a use-after-free bug in the timeout code.</p>
<h2 id="drag-and-drop">Drag-and-drop</h2>
<p>Sadly, I wasn't able to get this working at all.
X itself doesn't know anything about drag-and-drop and instead applications look at the window tree to decide where the user dropped things.
This doesn't work with the proxy, because Wayland doesn't tell us where the windows really are on the screen.</p>
<p>Even without any VMs or proxies, drag-and-drop from X applications to Wayland ones doesn't work,
because the X app can't see the Wayland window and the drop lands on the X window below (if any).</p>
<h2 id="bonus-features">Bonus features</h2>
<p>In the last post, I mentioned several other problems, which have also now been solved by the proxy:</p>
<h3 id="hidpi-works">HiDPI works</h3>
<p>Wayland's support for high resolution screens is a bit strange.
I would have thought that applications really only need to know two things:</p>
<ol>
<li>The size in pixels of the window.
</li>
<li>The size in pixels you want some standard thing (e.g. a normal-sized letter M).
</li>
</ol>
<p>Some systems instead provide the size of the window and the DPI (dots-per-inch),
but this doesn't work well.
For example, a mobile phone might be high DPI but still want small text because you hold it close to your face,
while a display board will have very low DPI but want large text.</p>
<p>Wayland instead redefines the idea of pixel to be a group of pixels corresponding to a single pixel on a typical 1990's display.
So if you set your scale factor to 2 then 1 Wayland pixel is a 2x2 grid of physical pixels.
If you have a 1000x1000 pixel window, Wayland will tell the application it is 500x500 but suggest a scale factor of 2.
If the application supports HiDPI mode, it will double all the numbers and render a 1000x1000 image and things work correctly.
If not, it will render a 500x500 pixel image and the compositor will scale it up.</p>
<p>Since Xwayland doesn't support this, it just draws everything too small and Sway scales it up,
creating a blurry and unusable mess.
This might be made worse by <a href="https://en.wikipedia.org/wiki/Subpixel_rendering">subpixel rendering</a>, which doesn't cope well with being scaled.</p>
<p>With the proxy, the solution is simple enough: when talking to Xwayland we just scale everything back up to the real dimensions,
scaling all coordinates as we relay them:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">scale_to_client</span> <span class="n">t</span> <span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="n">x</span> <span class="o">*</span> <span class="n">t</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">xunscale</span><span class="o">,</span>
</span><span class="line">  <span class="n">y</span> <span class="o">*</span> <span class="n">t</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">xunscale</span>
</span></code></pre></td></tr></tbody></table></div></figure><figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">method</span> <span class="n">on_configure</span> <span class="o">_</span> <span class="o">~</span><span class="n">width</span> <span class="o">~</span><span class="n">height</span> <span class="o">~</span><span class="n">states</span><span class="o">:_</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">width</span> <span class="o">=</span> <span class="nn">Int32</span><span class="p">.</span><span class="n">to_int</span> <span class="n">width</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">height</span> <span class="o">=</span> <span class="nn">Int32</span><span class="p">.</span><span class="n">to_int</span> <span class="n">height</span> <span class="k">in</span>
</span><span class="line">  <span class="k">if</span> <span class="n">width</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="n">height</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">then</span> <span class="o">(</span>
</span><span class="line">    <span class="nn">Lwt</span><span class="p">.</span><span class="n">async</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">        <span class="k">let</span> <span class="o">(</span><span class="n">width</span><span class="o">,</span> <span class="n">height</span><span class="o">)</span> <span class="o">=</span> <span class="n">scale_to_client</span> <span class="n">t</span> <span class="o">(</span><span class="n">width</span><span class="o">,</span> <span class="n">height</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">        <span class="nn">X11</span><span class="p">.</span><span class="nn">Window</span><span class="p">.</span><span class="n">configure</span> <span class="n">x11</span> <span class="n">window</span> <span class="o">~</span><span class="n">width</span> <span class="o">~</span><span class="n">height</span> <span class="o">~</span><span class="n">border_width</span><span class="o">:</span><span class="mi">0</span>
</span><span class="line">      <span class="o">)</span>
</span><span class="line">  <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This will tend to make things sharp but too small, but X applications already have their own ways to handle high resolution screens.
For example, you can set <code>Xft.dpi</code> to make all the fonts bigger. I run this proxy like this, which works for me:</p>
<pre><code>wayland-proxy-virtwl --x-display=0 --xrdb Xft.dpi:150 --x-unscale=2
</code></pre>
<p>However, there is a problem.
The Wayland specification says:</p>
<blockquote>
<p>The new size of the surface is calculated based on the buffer
size transformed by the inverse buffer_transform and the
inverse buffer_scale. This means that at commit time the supplied
buffer size must be an integer multiple of the buffer_scale. If
that's not the case, an invalid_size error is sent.</p>
</blockquote>
<p>Let's say we have an X11 image viewer that wants to show a 1001-pixel-high image in a 1001-pixel-high window.
This isn't allowed by the spec, which can only handle even-sized windows when the scale factor is 2.
Regular Wayland applications already have to deal with that somehow, but for X11 applications it becomes our problem.</p>
<p>I tried rounding down, but that has a bad side-effect: if GTK asks for a 1001-pixel high menu and gets a 1000 pixel allocation,
it switches to squashed mode and draws two big bumper arrows at the top and bottom of the menu which you must use to scroll it.
It looks very silly.</p>
<p>I also tried rounding up, but tooltips look bad with any rounding. Either one border is missing, or it's double thickness.
Luckily, it seems that Sway doesn't actually enforce the rule about surfaces being a multiple of the scale factor.
So, I just let the application attach a buffer of whatever size it likes to the surface and it seems to work!</p>
<p>The only problem I had was that when using unscaling, the mouse pointer in GVim would get lost.
Vim hides it when you start typing, but it's supposed to come back when you move the mouse.
The problem seems to be that it hides it by creating a 1x1 pixel cursor.
Sway decides this isn't worth showing (maybe because it's 0x0 in Wayland-pixels?),
and sends Xwayland a leave event saying the cursor is no longer on the screen.
Then when Vim sets the cursor back, Xwayland doesn't bother updating it, since it's not on screen!</p>
<p>The solution was to stop applying unscaling to cursors.
They look better doubled in size, anyway.
True, this does mean that the sharpness of the cursor changes as you move between windows,
but you're unlikely to notice this
due to the far more jarring effect of Wayland cursors also changing size and shape at the same time.</p>
<h3 id="ring-buffer-logging">Ring-buffer logging</h3>
<p>Even without a proxy to complicate things, Wayland applications often have problems.
To make investigating this easier, I added a ring-buffer log feature.
When on, the proxy keeps the last 512K or so of log messages in memory, and will dump them out on demand.</p>
<p>To use it, you run the proxy with e.g. <code>-v --log-ring-path ~/wayland.log</code>.
When something odd happens (e.g. an application crashes, or opens its menus in the wrong place) you can
dump out the ring buffer and see what just happened with:</p>
<pre><code>echo dump-log &gt; /run/user/1000/wayland-1-ctl
</code></pre>
<p>I also added some filtering options (e.g. <code>--log-suppress motion,shm</code>) to suppress certain classes of noisy messages.</p>
<h3 id="vim-windows-open-correctly">Vim windows open correctly</h3>
<p>One annoyance with Sway is that Vim's window always appears blank (even when running on the host, without any proxy).
You have to resize it before you can see the text.</p>
<p>My proxy initially suffered from the same problem, although only intermittently.
It turned out to be because Vim sends a <code>ConfigureRequest</code> with its desired size and then waits for the confirmation message.
Since Sway is a tiling window manager, it ignores the new size and no event is generated.
In this case, an X11 window manager is supposed to send a synthetic <code>ConfigureNotify</code>,
so I just got the proxy to do that and the problem disappeared
(I confirmed this by adding a sleep to Vim's <code>gui_mch_update</code>).</p>
<p>By the way, the GVim start-up code is quite interesting.
The code path to opening the window goes though three separate functions which each define a
<code>static int recursive = 0</code> and then proceed to behave differently depending on how many times they've
been reentered - see <a href="https://github.com/vim/vim/blob/9cd063e3195a4c250c8016fa340922ab21fda252/src/gui.c#L489">gui_init</a> for an example!</p>
<h3 id="copy-and-paste-without-m-characters">Copy-and-paste without ^M characters</h3>
<p>The other major annoyance with Sway is that copy-and-paste doesn't work correctly (<a href="https://github.com/swaywm/wlroots/issues/1839">Sway bug #1839</a>).
Using the proxy avoids that problem completely.</p>
<h2 id="conclusions">Conclusions</h2>
<p>I'm not sure how I feel about this project.
It ended up taking a lot longer than I expected, and I could probably have ported several X11 applications to Wayland in the same time.
On the other hand, I now have working X support in the VMs with no need for <code>ssh -Y</code> from the host, plus support for HiDPI in Wayland, mouse cursors that are large enough to see easily, windows that open reliably, text pasting that works, and I can get logs whenever something misbehaves.</p>
<p>In fact, I'm now also running an instance of the proxy directly on the host to get the same benefits for host X11 applications.
Setting this up is actually a bit tricky:
you want to start Sway with <code>DISPLAY=:0</code> so that every application it spawns knows it has an X11 display,
but if you set that then Sway thinks you want it to run nested inside an X window provided by the proxy,
which doesn't end well (or, indeed, at all).</p>
<p>Having all the legacy X11 support in a separate binary should make it much easier to write new Wayland compositors,
which might be handy if I ever get some time to try that.
It also avoids having many thousands of lines of legacy C code in the highly-trusted compositor code.</p>
<p>If Wayland had an official protocol for letting applications know the window layout then I could make drag-and-drop between X11 applications within the same VM work, but it still wouldn't work between VMs or to Wayland applications, so it's probably not worth it.</p>
<p>Having two separate connections to Xwayland creates a lot of unnecessary race conditions.
A simple solution might be a Wayland extension that allows the Wayland server to say &quot;please read N bytes from the X11 socket now&quot;,
and likewise in the other direction.
Then messages would always arrive in the order in which they were sent.</p>
<p>The code is all available at <a href="https://github.com/talex5/wayland-proxy-virtwl">https://github.com/talex5/wayland-proxy-virtwl</a> if you want to try it.
It works with the applications I use when running under Sway,
but will probably require some tweaking for other programs or compositors.
Here's a screenshot of my desktop using it:</p>
<p><a href="/blog/images/xwayland/desktop.png"><span class="caption-wrapper center"><img src="/blog/images/xwayland/desktop.png" title="Screenshot of my desktop" class="caption"/><span class="caption-text">Screenshot of my desktop</span></span></a></p>
<p>The windows with <code>[dev]</code> in the title are from my Debian VM, while <code>[com]</code> is a SpectrumOS VM I use for email, etc.
Gitk, GVim and ROX-Filer are X11 applications using Xwayland,
while Firefox and xfce4-terminal are using plain Wayland proxying.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Qubes-lite with KVM and Wayland</title>
    <link href="https://roscidus.com/blog/blog/2021/03/07/qubes-lite-with-kvm-and-wayland/"></link>
    <updated>2021-03-07T15:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2021/03/07/qubes-lite-with-kvm-and-wayland</id>
    <content type="html"><![CDATA[<p>I've been running QubesOS as my main desktop since 2015.
It provides good security, by running applications in different Xen VMs.
However, it is also quite slow and has some hardware problems.
I've recently been trying out NixOS, KVM, Wayland and SpectrumOS,
and attempting to create something similar with more modern/compatible/faster technology.</p>
<p>This post gives my initial impressions of these tools and describes my current setup.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#qubesos">QubesOS</a>
</li>
<li><a href="#nixos">NixOS</a>
<ul>
<li><a href="#nix-store">nix-store</a>
</li>
<li><a href="#nix-instantiate">nix-instantiate</a>
</li>
<li><a href="#nix-pkgs">nix-pkgs</a>
</li>
<li><a href="#nix-env">nix-env</a>
</li>
<li><a href="#nixos-1">NixOS</a>
</li>
<li><a href="#installing-nixos">Installing NixOS</a>
</li>
<li><a href="#thoughts-on-nixos">Thoughts on NixOS</a>
</li>
</ul>
</li>
<li><a href="#why-use-virtual-machines">Why use virtual machines?</a>
</li>
<li><a href="#spectrumos">SpectrumOS</a>
</li>
<li><a href="#wayland">Wayland</a>
<ul>
<li><a href="#protocol">Protocol</a>
</li>
<li><a href="#copying-text">Copying text</a>
</li>
<li><a href="#security">Security</a>
</li>
</ul>
</li>
<li><a href="#future-work">Future work</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=26378854">Hacker News</a> and
<a href="https://lobste.rs/s/cisgn2/qubes_lite_with_kvm_wayland">Lobsters</a> )</p>
<h2 id="qubesos">QubesOS</h2>
<p><a href="https://www.qubes-os.org/">QubesOS</a> aims to provide &quot;a reasonably secure operating system&quot;.
It does this by running multiple virtual machines under the Xen hypervisor.
Each VM's windows have a different colour and tag, but they appear together as a single desktop.
The VMs I run include:</p>
<ul>
<li><code>com</code> for email and similar (the only VM that sees my email password).
</li>
<li><code>dev</code> for software development.
</li>
<li><code>shopping</code> (the only VM that sees my card number).
</li>
<li><code>personal</code> (with no Internet access)
</li>
<li><code>untrusted</code> (general browsing)
</li>
</ul>
<p>The desktop environment itself is another Linux VM (<code>dom0</code>), used for managing the other VMs.
Most of the VMs are running Fedora (the default for Qubes), although I run Debian in <code>dev</code>.
There are also a couple of system VMs; one for dealing with the network hardware,
and one providing a firewall between the VMs.</p>
<p>You can run <code>qvm-copy</code> in a VM to copy a file to another VM.
<code>dom0</code> pops up a dialog box asking which VM should receive the file, and it arrives there
as <code>~/QubesIncoming/$source_vm/$file</code>.
You can also press Ctrl-Shift-C to copy a VM's clipboard to the global clipboard, and then
press Ctrl-Shift-V in a window of the target VM to copy to that VM's clipboard,
ready for pasting into an application.</p>
<p>I think Qubes does a very good job at providing a secure environment.</p>
<p>However, it has poor hardware compatibility and it feels sluggish, even on a powerful machine.
I bought a new machine a while ago and found that the motherboard only provided a single video output, limited to 30Hz.
This meant I had to buy a discrete graphics card. With the card enabled, the machine <a href="https://github.com/QubesOS/qubes-issues/issues/5459">fails to resume from suspend</a>,
and locks up from time to time (it's completely stable with the card removed or disabled).
I spent some time trying to understand the driver code, but I didn't know enough about graphics, the Linux kernel, PCI suspend, or Xen to fix it.</p>
<p>I was also having some other problems with QubesOS:</p>
<ul>
<li>Graphics performance is terrible (especially on a 4k monitor).
Qubes disables graphics acceleration in VMs for security reasons, but it was slow even for software rendering.
</li>
<li>It recently started freezing for a couple of seconds from time to time - annoying when you're trying to type.
</li>
<li>It uses LVM thin-pools for VM storage, which I don't understand, and which sometimes need repairing (haven't lost any data, though).
</li>
<li>dom0 is out-of-date and generally not usable.
This is intentional (you should be using VMs),
but my security needs aren't that high and it would be nice to be able to do video conferencing these days.
Also, being able to print over USB and use bluetooth would be handy.
</li>
</ul>
<p>Anyway, I decided it was time to try something new.
Linux now has its own built-in hypervisor (KVM), and I thought that would probably work better with my hardware.
I was also keen to try out Wayland, which is built around shared-memory and I thought it might therefore work better with VMs.
How easy would it be to recreate a Qubes-like environment directly on Linux?</p>
<h2 id="nixos">NixOS</h2>
<p>I've been meaning to try <a href="https://nixos.org/">NixOS</a> properly for some time. Ever since I started using Linux, its package management has struck me as absurd. On Debian, Fedora, etc, installing a package means letting it put files wherever it likes; which effectively gives the package author root on your system. Not a good base for sandboxing!</p>
<p>Also, they make it difficult to try out 3rd-party software, or to test newer versions of just some packages.</p>
<p>In 2003 I created <a href="https://0install.net/">0install</a> to address these problems, and Nix has very similar goals. I thought Nix was a few years younger, but looking at its Git history the first commit was on Mar 12, 2003. I announced the first preview of 0install just two days later, so both projects must have started writing code within a few days of each other!</p>
<p>NixOS is made up of quite a few components. Here is what I've learned so far:</p>
<h3 id="nix-store">nix-store</h3>
<p>The store holds the files of all the programs, and is the central component of the system.
Each version of a package goes in its own directory (or file), at <code>/nix/store/$HASH</code>.
You can add data to the store directly, like this:</p>
<pre><code>$ echo hello &gt; file

$ nix-store --add-fixed sha256 file
/nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file

$ cat /nix/store/1vap48aqggkk52ijn2prxzxv7cnzvs0w-file
hello
</code></pre>
<p>Here, the store location is calculated from the hash of the contents of the file we added (as with <code>0install store add</code> or <code>git hash-object</code>).</p>
<p>However, you can also add things to the store by asking Nix to run a build script.
For example, to compile some source code:</p>
<ol>
<li>You add the source code and some build instructions (a &quot;derivation&quot; file) to the store.
</li>
<li>You ask the store to build the derivation. It runs your build script in a container sandbox.
</li>
<li>The results are added to the store, using the hash of the build instructions (not the hash of the result) as the directory name.
</li>
</ol>
<p>If a package in the store depends on another one (at build time or run time), it just refers to it by its full path.
For example, a bash script in the store will start something like:</p>
<pre><code>#! /nix/store/vnyfysaya7sblgdyvqjkrjbrb0cy11jf-bash-4.4-p23/bin/bash
...
</code></pre>
<p>If two users want to use the same build instructions, the second one will see that the hash already exists and can just reuse that.
This allows users to compile software from source and share the resulting binaries, without having to trust each other.</p>
<p>Ideally, builds should be reproducible.
To encourage this, builds which use the hash of the build instructions for the result path are built in a sandbox without network access.
So, you can't submit a build job like &quot;Download and compile whatever is the latest version of Vim&quot;.
But you can discover the latest version yourself and then submit two separate jobs to the store:</p>
<ol>
<li>&quot;Download Vim 8.2, with hash XXX&quot; (a fixed-output job, which therefore has network access)
</li>
<li>&quot;Build Vim from hash XXX&quot;
</li>
</ol>
<p>You can run <code>nix-collect-garbage</code> to delete everything from the store that isn't reachable via the symlinks under <code>/nix/var/nix/gcroots/</code>.
Users can put symlinks to things they care about keeping in <code>/nix/var/nix/gcroots/per-user/$USER/</code>.</p>
<p>By default, the store is also configured with a trusted binary cache service,
and will try to download build results from there instead of compiling locally when possible.</p>
<h3 id="nix-instantiate">nix-instantiate</h3>
<p>Writing derivation files by hand is tedious, so Nix provides a templating language to create them easily.
The Nix language is dynamically typed and based around maps/dictionaries (which it confusingly refers to as &quot;sets&quot;).
<code>nix-instantiate file.nix</code> will generate a derivation from <code>file.nix</code> and add it to the store.</p>
<p>An Nix file looks like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="nix"><span class="line"><span class="nb">derivation</span> <span class="p">{</span> <span class="ss">system =</span> <span class="s2">&quot;x86_64-linux&quot;</span><span class="p">;</span> <span class="ss">builder =</span> <span class="o">.</span><span class="l">/myfile</span><span class="p">;</span> <span class="ss">name =</span> <span class="s2">&quot;foo&quot;</span><span class="p">;</span> <span class="p">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Running <code>nix-instantiate</code> on this will:</p>
<ol>
<li>Add <code>myfile</code> to the store.
</li>
<li>Add the generated <code>foo.drv</code> to the store, including the full store path of <code>myfile</code>.
</li>
</ol>
<h3 id="nix-pkgs">nix-pkgs</h3>
<p>Writing Nix expressions for every package you want would also be tedious.
The <a href="https://github.com/NixOS/nixpkgs">nixpkgs</a> Git repository contains a Nix expression that evaluates to a set of derivations,
one for each package in the distribution.
It also contains a library of useful helper functions for packages
(e.g. it knows how to handle GNU autoconf packages automatically).</p>
<p>Rather than evaluating the whole lot, you use <code>-A</code> to ask for a single package.
For example, you can use <code>nix-instantiate ./nixpkgs/default.nix -A firefox</code> to generate a derivation for Firefox.</p>
<p><code>nix-build</code> is a quick way to create a derivation with <code>nix-instantiate</code> and build it with <code>nix-store</code>.
It will also create a <code>./result</code> symlink pointing to its path in the store,
as well as registering <code>./result</code> with the garbage collector under <code>/nix/var/nix/gcroots/auto/</code>.
For example, to build and run Firefox:</p>
<pre><code>nix-build ./nixpkgs/default.nix -A firefox
./result/bin/firefox
</code></pre>
<p>If you use nixpkgs without making any changes, it will be able to download a pre-built binary from the cache service.</p>
<h3 id="nix-env">nix-env</h3>
<p>Keeping track of all these symlinks would be tedious too,
but you can collect them all together by making a package that depends on every application you want.
Its build script will produce a <code>bin</code> directory full of symlinks to the applications.
Then you could just point your <code>$PATH</code> variable at that <code>bin</code> directory in the store.</p>
<p>To make updating easier, you will actually add <code>~/.nix-profile/bin/</code> to <code>$PATH</code> and
update <code>.nix-profile</code> to point at the latest build of your environment package.</p>
<p>This is essentially what <code>nix-env</code> does, except with yet more symlinks to allow for
switching between multiple profiles, and to allow rolling back to previous environments
if something goes wrong.</p>
<p>For example, to install Firefox so you can run it via <code>$PATH</code>:</p>
<pre><code>nix-env -i firefox
</code></pre>
<h3 id="nixos-1">NixOS</h3>
<p>Finally, just as <code>nix-env</code> can create a user environment with <code>bin</code>, <code>man</code>, etc,
a similar process can create a root filesystem for a Linux distribution.</p>
<p><code>nixos-rebuild</code> reads the <code>/etc/nixos/configuration.nix</code> configuration file,
generates a system environment,
and then updates grub and the <code>/run/current-system</code> symlink to point to it.</p>
<p>In fact, it also lists previous versions of the system environment in the grub file, so
if you mess up the configuration you can just choose an earlier one from the boot
menu to return to that version.</p>
<h3 id="installing-nixos">Installing NixOS</h3>
<p>To install NixOS you boot one of the live images at <a href="https://nixos.org">https://nixos.org</a>.
Which you use only affects the installation UI, not the system you end up with.</p>
<p>The manual walks you through the installation process, showing how to partition
the disk, format and mount the partitions, and how to edit the configuration file.
I like this style of installation, where it teaches you things instead of just doing it for you.
Most of the effort in switching to a new system is learning about it, so I'd rather
spend 3 hours learning stuff following an installation guide than use a 15-minute
single-click installer that teaches me nothing.</p>
<p>The configuration file (<code>/etc/nixos/configuration.nix</code>) is just another Nix expression.
Most things are set to off by default (I approve), but can be changed easily.
For example, if you want sound support you change that setting to <code>sound.enable = true</code>,
and if you also want to use PulseAudio then you set <code>hardware.pulseaudio.enable = true</code> too.</p>
<p>Every system service supported by NixOS is controlled from here,
with all kinds of options, from <code>programs.vim.defaultEditor = true</code> (so you don't get trapped in <code>nano</code>)
to <code>services.factorio.autosave-interval</code>.
Use <code>man configuration.nix</code> to see the available settings.</p>
<p>NixOS defaults to an X11 desktop, but I wanted to try Wayland (and <a href="https://github.com/swaywm/sway">Sway</a>).
Based on the <a href="https://nixos.wiki/wiki/Sway">NixOS wiki</a> instructions, I used this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="nix"><span class="line">  programs<span class="o">.</span><span class="ss">sway =</span> <span class="p">{</span>
</span><span class="line">    <span class="ss">enable =</span> <span class="no">true</span><span class="p">;</span>
</span><span class="line">    wrapperFeatures<span class="o">.</span><span class="ss">gtk =</span> <span class="no">true</span><span class="p">;</span> <span class="c1"># so that gtk works properly</span>
</span><span class="line">    <span class="ss">extraSessionCommands =</span> <span class="s2">&quot;export MOZ_ENABLE_WAYLAND=1&quot;</span><span class="p">;</span>
</span><span class="line">    <span class="ss">extraPackages =</span> <span class="k">with</span> pkgs<span class="p">;</span> <span class="p">[</span>
</span><span class="line">      swaylock
</span><span class="line">      swayidle
</span><span class="line">      xwayland
</span><span class="line">      wl-clipboard
</span><span class="line">      mako
</span><span class="line">      alacritty
</span><span class="line">      dmenu
</span><span class="line">    <span class="p">];</span>
</span><span class="line">  <span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>xwayland</code> bit is important; without that you can't run any X11 applications.</p>
<p>My only complaint with the NixOS installation instructions is that following them will leave you with an unencrypted system,
which isn't very useful.
When partitioning, you have to skip ahead to the LUKS section of the manual, which just gives some options but no firm advice.
I created two primary partitions: a 1G unencrypted <code>/boot</code>, and a LUKS partition for the rest of the disk.
Then I created an LVM volume group from the <code>/dev/mapper/crypted</code> device and added the other partitions in that.</p>
<p>Once the partitions are mounted and the configuration file is complete,
<code>nixos-install</code> downloads everything and configures grub.
Then you reboot into the new system.</p>
<p>Once running the new system you can made further edits to the configuration file there in the same way,
and use <code>nixos-rebuild switch</code> to generate a new system.
It seems to be pretty good at updating the running system to the new settings, so you don't normally need to reboot
after making changes.</p>
<p>The big mistake I made was forgetting to add <code>/boot</code> to fstab.
When I ran <code>nixos-rebuild</code> it put all the grub configuration on the encrypted partition, rendering the system unbootable.
I fixed that with <code>chattr +i /boot</code> on the unmounted partition.
That way, trying to rebuild with <code>/boot</code> unmounted will just give an error message.</p>
<h3 id="thoughts-on-nixos">Thoughts on NixOS</h3>
<p>I've been using the system for a few weeks now and I've had no problems with Nix so far.
Nix has been fast and reliable and there were fairly up-to-date packages for everything I wanted
(I'm using the stable release).
There is a lot to learn, but plenty of documentation.</p>
<p>When I wanted a newer package (<code>socat</code> with vsock support, only just released) I just told Nix to install it from the latest Git checkout of nixpkgs.
Unlike on Debian and similar systems, doing this doesn't interfere with any other packages (such as forcing a system-wide upgrade of libc).</p>
<p>I think Nix does download more data than most other systems, but networks are fast enough now that it doesn't seem to matter.
For example, let's say you're running Python 3.9.0 and you want to update to 3.9.1:</p>
<ul>
<li>
<p>With <strong>Debian</strong>: <code>apt-get upgrade</code> downloads the new version, which gets unpacked over the old one.
As the files are unpacked, the system moves through an exciting series of intermediate states no-one has thought about.
Running programs may crash as they find their library versions changing under them (though it's usually OK).
Only root can update software.</p>
</li>
<li>
<p>With <strong>0install</strong>: <code>0install update</code> downloads the new version, unpacking it to a new directory.
Running programs continue to use the old version.
When a new program is started, 0install notices the update and runs the solver again.
If the program is compatible with the new Python then it uses that. If not, it continues with the old one.
You can run any previous version if there is a problem.</p>
</li>
<li>
<p>With <strong>Nix</strong>: <code>nix-env -u</code> downloads the new version, unpacking it to a new directory.
It also downloads (or rebuilds) every package depending on Python, creating new directories for each of them.
It then creates a new environment with symlinks to the latest version of everything.
Running programs continue to use the old version.
Starting a new program will use the new version.
You can revert the whole environment back to the previous version if there is a problem.</p>
</li>
<li>
<p>With <strong>Docker</strong>: <code>docker pull</code> downloads the new version of a single application,
downloading most or all of the application's packages, whether Python related or not.
Existing containers continue running with the old version.
New containers will default to using the new version.
You can specify which version to use when starting a program.
Other applications continue using the old version of Python until their authors update them
(you must update each application individually, rather than just updating Python itself).</p>
</li>
</ul>
<p>The main problem with NixOS is that it's quite different to other Linux systems, so there's a lot to relearn.
Also, existing knowledge about how to edit <code>fstab</code>, <code>sudoers</code>, etc, isn't so useful, as you have to provide all configuration in Nix syntax.
However, having a single (fairly sane) syntax for everything is a nice bonus, and being able to generate things using the templating language is useful.
For example, for my network setup I use a bunch of tap devices (one for each of my VMs).
It was easy to write a little Nix function (<code>mktap</code>) to generate them all from a simple list.
Here's that section of my <code>configuration.nix</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
</pre></td><td class="code"><pre><code class="nix"><span class="line">  <span class="ss">networking =</span> <span class="p">{</span>
</span><span class="line">    <span class="ss">useDHCP =</span> <span class="no">false</span><span class="p">;</span>
</span><span class="line">    <span class="ss">interfaces =</span>
</span><span class="line">      <span class="k">let</span> <span class="ss">mktap =</span> ip<span class="p">:</span> <span class="p">{</span>
</span><span class="line">          <span class="ss">virtual =</span> <span class="no">true</span><span class="p">;</span>
</span><span class="line">          <span class="ss">virtualOwner =</span> <span class="s2">&quot;tal&quot;</span><span class="p">;</span>
</span><span class="line">          ipv4<span class="o">.</span><span class="ss">addresses =</span> <span class="p">[</span>
</span><span class="line">            <span class="p">{</span> <span class="ss">address =</span> ip<span class="p">;</span> <span class="ss">prefixLength =</span> <span class="mi">31</span><span class="p">;</span> <span class="p">}</span>
</span><span class="line">          <span class="p">];</span>
</span><span class="line">        <span class="p">};</span>
</span><span class="line">      <span class="k">in</span>
</span><span class="line">      <span class="p">{</span>
</span><span class="line">        eno2<span class="o">.</span><span class="ss">useDHCP =</span> <span class="no">true</span><span class="p">;</span>
</span><span class="line">        wlo1<span class="o">.</span><span class="ss">useDHCP =</span> <span class="no">true</span><span class="p">;</span>
</span><span class="line">        <span class="ss">tapdev =</span> mktap <span class="s2">&quot;10.0.0.2&quot;</span><span class="p">;</span>
</span><span class="line">        <span class="ss">tapcom =</span> mktap <span class="s2">&quot;10.0.0.4&quot;</span><span class="p">;</span>
</span><span class="line">        <span class="ss">tapshopping =</span> mktap <span class="s2">&quot;10.0.0.6&quot;</span><span class="p">;</span>
</span><span class="line">        <span class="ss">tapbanking =</span> mktap <span class="s2">&quot;10.0.0.8&quot;</span><span class="p">;</span>
</span><span class="line">        <span class="ss">tapuntrusted =</span> mktap <span class="s2">&quot;10.0.0.10&quot;</span><span class="p">;</span>
</span><span class="line">      <span class="p">};</span>
</span><span class="line">    <span class="ss">nat =</span> <span class="p">{</span>
</span><span class="line">      <span class="ss">enable =</span> <span class="no">true</span><span class="p">;</span>
</span><span class="line">      <span class="ss">externalInterface =</span> <span class="s2">&quot;eno2&quot;</span><span class="p">;</span>
</span><span class="line">      <span class="ss">internalIPs =</span> <span class="p">[</span> <span class="s2">&quot;10.0.0.0/8&quot;</span> <span class="p">];</span>
</span><span class="line">    <span class="p">};</span>
</span><span class="line">  <span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Overall, I'm very happy with NixOS so far.</p>
<h2 id="why-use-virtual-machines">Why use virtual machines?</h2>
<p>With NixOS I had a nice host environment, but after using Qubes I wanted to run my applications in VMs.</p>
<p>The basic problem is that Linux is the only thing that knows how to drive all the hardware,
but Linux security is not ideal. There are several problems:</p>
<ol>
<li>Linux is written in C.
This makes security bugs rather common and, more importantly, means that a bug in one part of the code
can impact any other part of the code. Nothing is secure unless everything is secure.
</li>
<li>Linux has a rather large API (hundreds of syscalls).
</li>
<li>The Linux (Unix) design predates the Internet, and security has been somewhat bolted on afterwards.
</li>
</ol>
<p>For example, imagine that we want to run a program with access to the network, but not to the graphical display.
We can create a new Linux container for it using <a href="https://github.com/containers/bubblewrap">bubblewrap</a>, like this:</p>
<pre><code>$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /run/user/1000/wayland-0
srwxr-xr-x 1 tal users 0 Feb 18 16:41 /tmp/.X11-unix/X0

$ bwrap \
    --ro-bind / / \
    --dev /dev \
    --tmpfs /home/tal \
    --tmpfs /run/user \
    --tmpfs /tmp \
    --unshare-all --share-net \
    bash

$ ls -l /run/user/1000/wayland-0 /tmp/.X11-unix/X0
ls: cannot access '/run/user/1000/wayland-0': No such file or directory
ls: cannot access '/tmp/.X11-unix/X0': No such file or directory
</code></pre>
<p>The container has an empty home directory, empty <code>/tmp</code>, and no access to the display sockets.
If we run Firefox in this environment then... it opens its window just fine!
How? <code>strace</code> shows what happened:</p>
<pre><code>connect(4, {sa_family=AF_UNIX, sun_path=&quot;/run/user/1000/wayland-0&quot;}, 27) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path=@&quot;/tmp/.X11-unix/X0&quot;}, 20) = 0
</code></pre>
<p>After failing to connect to Wayland, it then tried using X11 (via Xwayland) instead. Why did that work?
If the first byte of the socket pathname is <code>\0</code> then Linux instead interprets it as an &quot;abstract&quot; socket address,
not subject to the usual filesystem permission rules.</p>
<p>Trying to anticipate these kinds of special cases is just too much work.
Linux really wants everything on by default, and you have to find and disable every feature individually.
By contrast, virtual machines tend to have integrations with the host off by default.
The also tend to have much smaller APIs (e.g. just reading and writing disk blocks or network frames),
with the rich Unix API entirely inside the VM, provided by a separate instance of Linux.</p>
<h2 id="spectrumos">SpectrumOS</h2>
<p>I was able to set up a qemu guest and restore my <code>dev</code> Qubes VM in that, but it didn't integrate nicely with the rest of the desktop.
Installing ssh allowed me to connect in with <code>ssh -Y dev</code>, allowing apps in the VM to open an X connection to Xwayland on the host.
That was somewhat usable, but still a bit slower than Qubes had been (which was already a bit too slow).</p>
<p>Searching for a way to forward the Wayland connection directly, I came across the <a href="https://spectrum-os.org/">SpectrumOS</a> project.
SpectrumOS aims to use one virtual machine per application, using shared directories so that VM files are stored on the host,
simplifying management.
It uses <a href="https://chromium.googlesource.com/chromiumos/platform/crosvm/">crosvm</a> from the ChromiumOS project instead of qemu, because it has a driver that allows forwarding Wayland connections
(and also because it's written in Rust rather than C).
The project's single developer is currently taking a break from the project, and says &quot;I'm currently working towards a proof of concept&quot;.</p>
<p>However, there is some useful stuff in the <a href="https://spectrum-os.org/git/nixpkgs/">SpectrumOS repository</a> (which is a fork of nixpkgs).
In particular, it contains:</p>
<ul>
<li>A version of Linux with the <code>virtwl</code> kernel module, which connects to crosvm's Wayland driver.
</li>
<li>A package for <a href="https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/vm_tools/sommelier/">sommelier</a>, which connects applications to <code>virtwl</code>.
</li>
<li>A Nix expression to build a root filesystem for the VM.
</li>
</ul>
<p>Building that, I was able to run the project's demo, which runs the Wayfire compositor inside the VM, appearing in a window on the host.
Dragging the nested window around, the pixels flowed smoothly across my screen in exactly the way that pixels on QubesOS don't.</p>
<p>This was encouraging, but I didn't want to run a nested window manager.
I tried running Firefox directly (without Wayfire),
but it complained that sommelier didn't provide a new enough version of something, and
running weston-terminal immediately segfaulted sommelier.</p>
<p>Why do we need the sommelier process anyway?
The problem is that, while <code>virtwl</code> mostly proxies Wayland messages directly, it can't send arbitrary FDs to the host.
For example, if you want to forward a writable stream from an application to <code>virtwl</code>
you must first create a pipe from the host using a special <code>virtwl</code> ioctl,
then read from that and copy the data to the application's regular Linux pipe.</p>
<p>With <a href="https://spectrum-os.org/lists/hyperkitty/list/discuss@spectrum-os.org/thread/VP3KJV3JYWSLJTUKDT3MAKIABZGDCSPN/">help from the mailing list</a>, I managed to get it somewhat usable:</p>
<ul>
<li>I enabled <code>VIRTIO_FS</code>, allowing me to mount a host directory into the VM (for sharing files).
</li>
<li>I created some tap devices (as mentioned above) to get guest networking going.
</li>
<li>Adding ext4 to the kernel image allowed me to mount the VM's LVM partition.
</li>
<li>Setting <code>FONTCONFIG_FILE</code> got some usable fonts (otherwise, there was no monospace font for the terminal).
</li>
<li>I hacked sommelier to claim it supported the latest protocols, which got Firefox running.
</li>
<li>Configuring sommelier for Xwayland let X applications run.
</li>
<li>I replaced the non-interactive <code>bash</code> shell with <code>fish</code> so I could edit commands.
</li>
<li>I ran <code>(while true; do socat vsock-listen:5000 exec:dash; done)</code> at the end of the VM's boot script.
Then I could start e.g. the VM's Firefox with <code>echo 'firefox&amp;' | socat stdin vsock-connect:7:5000</code>
on the host, allowing me to add launchers for guest applications.
</li>
</ul>
<p>Making changes to the root filesystem was fairly easy once I'd read the Nix manuals.
To add an application (e.g. <code>libreoffice</code>), you import it at the start of <a href="https://spectrum-os.org/git/nixpkgs/tree/pkgs/os-specific/linux/spectrum/rootfs/default.nix">rootfs/default.nix</a> and add it to the <code>path</code> variable.
The Nix expression gets the transitive dependencies of <code>path</code> from the Nix store and packs them into a squashfs image.</p>
<p>True, my squashfs image is getting a bit big.
Maybe I should instead make a minimal squashfs boot image, plus a shared directory of hard links to the required files.
That would allow sharing the data with the host.
I could also just share the whole <code>/nix/store</code> directory, if I wanted to make all host software available to guests.</p>
<p>I made another Nix script to add various VM boot commands to my host environment.
For example, running <code>qvm-start-shopping</code> boots my shopping VM using crosvm,
with the appropriate LVM data partition, network settings, and shared host directory.</p>
<p>I think, ideally, this would be a systemd socket-activated user service rather than a shell script.
Then attempting to run Firefox by sending a command to the VM socket would cause systemd to boot the VM
(if not already running).
For now, I boot each VM manually in a terminal and then press Win-Shift-2 to banish it to workspace 2,
with all the other VM root consoles.</p>
<p>The <code>virlwl</code> Wayland forwarding feels pretty fast (much faster than Qubes' X graphics).</p>
<h2 id="wayland">Wayland</h2>
<p>I now had a mostly functional Qubes-like environment, running most of my applications in VMs,
with their windows appearing on the host desktop like any other application.
However, I also had some problems:</p>
<ul>
<li>A stated goal of Wayland is &quot;every frame is perfect&quot;. However, applications generally seemed to open at the wrong size and then jump to their correct size, which was a bit jarring.
</li>
<li>Vim opened its window with the scrollbar at the far left of the window, making the text invisible until you resized the window.
</li>
<li>Wayland is supposed to have better support for high-DPI displays.
However, this doesn't work with Xwayland, which turns everything blurry,
and the <a href="https://news.ycombinator.com/item?id=19360176">recommended work-around</a> is to use a scale-factor of 1
and configure each application to use bigger fonts.
This is easy enough with X applications (e.g. set <code>ft.dpi: 150</code> with <code>xrdb</code>), but Wayland apps must be configured individually.
</li>
<li>Wayland doesn't have cursor themes and you have to configure every application individually to use a larger cursor too.
</li>
<li>Copying text didn't seem to work reliably. Sometimes there would be a long delay, after which the text might or might not appear. More often, it would just paste something completely different and unexpected. Even when it did paste the right text, it would often have ^M characters inserted into it.
</li>
</ul>
<p>I decided it was time to learn more about Wayland.
I discovered <a href="https://wayland-book.com/">wayland-book.com</a>, which does a good job of introducing it
(though the book is only half finished at the moment).</p>
<h3 id="protocol">Protocol</h3>
<p>One very nice feature of Wayland is that you can run any Wayland application with <code>WAYLAND_DEBUG=1</code>
and it will display a fairly readable trace of all the Wayland messages it sends and receives.
Let's look at a simple application that just connects to the server (compositor) and opens a window:</p>
<pre><code>$ WAYLAND_DEBUG=1 test.exe
-&gt; wl_display@1.get_registry registry:+2
-&gt; wl_display@1.sync callback:+3
</code></pre>
<p>The client connects to the server's socket at <code>/run/user/1000/wayland-0</code> and sends two messages
to object 1 (of type <code>wl_display</code>), which is the only object available in a new connection.
The <code>get_registry</code> request asks the server to add the registry to the conversation and call it object 2.
The <code>sync</code> request just asks the server to confirm it got it, using a new callback object (with ID 3).</p>
<p>Both clients and servers can add objects to the conversation.
To avoid numbering conflicts, clients assign low numbers and servers pick high ones.</p>
<p>On the wire, each message gives the object ID, the operation ID, the length in bytes, and then the arguments.
Objects are thought of as being at the server, so the client sends request messages <em>to</em> objects,
while the server emits event messages <em>from</em> objects.
At the wire level there's no difference though.</p>
<p>When the server gets the <code>get_registry</code> request it adds the registry,
which immediately emits one event for each available service, giving the maximum supported version.
The client receives these messages, followed by the callback notification from the <code>sync</code> message:</p>
<pre><code>&lt;- wl_registry@2.global name:0 interface:&quot;wl_compositor&quot; version:4
&lt;- wl_registry@2.global name:1 interface:&quot;wl_subcompositor&quot; version:1
&lt;- wl_registry@2.global name:2 interface:&quot;wl_shm&quot; version:1
&lt;- wl_registry@2.global name:3 interface:&quot;xdg_wm_base&quot; version:1
&lt;- wl_registry@2.global name:4 interface:&quot;wl_output&quot; version:2
&lt;- wl_registry@2.global name:5 interface:&quot;wl_data_device_manager&quot; version:3
&lt;- wl_registry@2.global name:6 interface:&quot;zxdg_output_manager_v1&quot; version:3
&lt;- wl_registry@2.global name:7 interface:&quot;gtk_primary_selection_device_manager&quot; version:1
&lt;- wl_registry@2.global name:8 interface:&quot;wl_seat&quot; version:5
&lt;- wl_callback@3.done callback_data:1129040
</code></pre>
<p>The callback tells the client it has seen all the available services, and so it now picks the ones it wants.
It has to choose a version no higher than the one offered by the server.
Protocols starting with <code>wl_</code> are from the core Wayland protocol; the others are extensions.
The leading <code>z</code> in <code>zxdg_output_manager_v1</code> indicates that the protocol is &quot;unstable&quot; (under development).</p>
<p>The protocols are defined in various XML files, which are scattered over the web.
The core protocol is defined in <a href="https://github.com/wayland-project/wayland/blob/master/protocol/wayland.xml">wayland.xml</a>.
These XML files can be used to generate typed bindings for your programming language of choice.</p>
<p>Here, the application picks <code>wl_compositor</code> (for managing drawing surfaces), <code>wl_shm</code> (for sharing memory with the server),
and <code>xdg_wm_base</code> (for desktop windows).</p>
<pre><code>-&gt; wl_registry@2.bind name:0 id:+4(wl_compositor:v4)
-&gt; wl_registry@2.bind name:2 id:+5(wl_shm:v1)
-&gt; wl_registry@2.bind name:3 id:+6(xdg_wm_base:v1)
</code></pre>
<p>The bind message is unusual in that the client gives the interface and version of the object it is creating.
For other messages, both sides know the type from the schema, and the version is always the same as the parent object.
Because the client chose the new IDs, it doesn't need to wait for the server;
it continues by using the new objects to create a top-level window:</p>
<pre><code>-&gt; wl_compositor@4.create_surface id:+7
-&gt; xdg_wm_base@6.get_xdg_surface id:+8 surface:7
-&gt; xdg_surface@8.get_toplevel id:+9
-&gt; xdg_toplevel@9.set_title title:&quot;example app&quot;
-&gt; wl_surface@7.commit 
</code></pre>
<p>This API is pretty strange.
The core Wayland protocol says how to make generic drawing surfaces, but not how to make windows,
so the application is using the <code>xdg_wm_base</code> extension to do that.
Logically, there's only one object here (a toplevel window),
but it ends up making three separate Wayland objects representing the different aspects of it.</p>
<p>The <code>commit</code> tells the server that the client has finished setting up the window and the server should
now do something with it.</p>
<p>The above was all in response to the callback firing.
The client now processes the last message in that batch, which is the server destroying the callback:</p>
<pre><code>&lt;- wl_display@1.delete_id id:3
</code></pre>
<p>Object destruction is a bit strange in Wayland.
Normally, clients ask for things to be destroyed (by sending a &quot;destructor&quot; message)
and the server confirms by sending <code>delete_id</code> from object 1.
But this isn't symmetrical: there is no standard way for a client to confirm deletion when the server calls
a destructor (such as the callback's <code>done</code>), so these have to be handled on a case-by-case basis.
Since callbacks don't accept any messages, there is no need for the client to confirm that it got the <code>done</code>
message and the server just sends a delete message immediately.</p>
<p>The client now waits for the server to respond to all the messages it sent about the new window,
and gets a bunch of replies:</p>
<pre><code>&lt;- wl_shm@5.format format:0
&lt;- wl_shm@5.format format:1
&lt;- wl_shm@5.format format:875709016
&lt;- wl_shm@5.format format:875708993
&lt;- xdg_wm_base@6.ping serial:1129043
-&gt; xdg_wm_base@6.pong serial:1129043
&lt;- xdg_toplevel@9.configure width:0 height:0 states:&quot;&quot;
&lt;- xdg_surface@8.configure serial:1129042
-&gt; xdg_surface@8.ack_configure serial:1129042
</code></pre>
<p>It gets some messages telling it what pixel formats are supported, a ping message (which the server sends from time to time to check the client is still alive),
and a configure message giving the size for the new window.
Oddly, Sway has set the size to 0x0, which means the client should choose whatever size it likes.</p>
<p>The client picks a suitable default size, allocates some shared memory (by opening a tmpfs file and immediately unlinking it),
shares the file descriptor with the server (<code>create_pool</code>), and then carves out a portion of the memory to use as a buffer for the pixel data:</p>
<pre><code>-&gt; wl_shm@5.create_pool id:+3 fd:(fd) size:1228800
-&gt; wl_shm_pool@3.create_buffer id:+10 offset:0 width:640 height:480 stride:2560 format:1
-&gt; wl_shm_pool@3.destroy 
</code></pre>
<p>In this case it used the whole memory region. It could also have allocated two buffers for double-buffering.
The client then draws whatever it wants into the buffer (mapping the file into its memory and writing to it directly),
attaches the buffer to the window's surface, marks the whole area as &quot;damaged&quot; (in need of being redrawn) and calls <code>commit</code>,
telling the server the surface is ready for display:</p>
<pre><code>-&gt; wl_surface@7.attach buffer:10 x:0 y:0
-&gt; wl_surface@7.damage x:0 y:0 width:2147483647 height:2147483647
-&gt; wl_surface@7.commit 
</code></pre>
<p>At this point the window appears on the screen!
The server lets the client know it has finished with the buffer and the client destroys it:</p>
<pre><code>&lt;- wl_display@1.delete_id id:3
&lt;- wl_buffer@10.release 
-&gt; wl_buffer@10.destroy 
</code></pre>
<p>Although the window is visible, the content is the wrong size.
Sway now suddenly remembers that it's a tiling window manager.
It sends another <code>configure</code> event with the correct size, causing the client to allocate a fresh memory pool of the correct size,
allocate a fresh buffer from it, redraw everything at the new size, and tell the server to draw it.</p>
<pre><code>&lt;- xdg_toplevel@9.configure width:1534 height:1029 states:&quot;&quot;
...
</code></pre>
<p>This process of telling the client to pick a size and then overruling it explains why Firefox draws itself incorrectly at first and then flickers into position a moment later. It probably also explains why Vim tries to open a 0x0 window.</p>
<h3 id="copying-text">Copying text</h3>
<p>A bit of searching revealed that the <code>^M</code> problem is a known <a href="https://github.com/swaywm/wlroots/issues/1839">Sway bug</a>.</p>
<p>However, the main reason copying text wasn't working turned out to be a limitation in the design of the core <code>wl_data_device_manager</code> protocol.
The normal way to copy text on X11 is to select the text you want to copy,
then click the middle mouse button where you want it (or press Shift-Insert).</p>
<p>X also supports a clipboard mechanism, where you select text, then press Ctrl-C, then click at the destination, then press Ctrl-V.
The original Wayland protocol only supports the clipboard system, not the selection, and so Wayland compositors have added selection support through extensions.
Sommelier didn't proxy these extensions, leading to failure when copying in or out of VMs.</p>
<p>I also found that the reason weston-terminal wouldn't start was because I didn't have anything in my clipboard,
and sommelier was trying to dereference a null pointer.</p>
<p>One problem with the Wayland protocol is that it's very hard to proxy.
Although the wire protocol gives the length in bytes of each message, it doesn't say how many file descriptors it has.
This means that you can't just pass through messages you don't understand, because you don't know which FDs go with which message.
Also, the wire protocol doesn't give types for FDs (nor does the schema),
which is a problem for anything that needs to proxy across a VM boundary or over a network.</p>
<p>This all meant that VMs could only use protocols explicitly supported by sommelier, and sommelier limited the version too.
Which means that supporting extra extensions or new versions means writing (and debugging) loads of C++ code.</p>
<p>I didn't have time to write and debug C++ code for every missing Wayland protocol, so I took a short-cut:
I wrote my own Wayland library, <a href="https://github.com/talex5/ocaml-wayland">ocaml-wayland</a>, and then used that to write my own version of sommelier.
With that, adding support for copying text was fairly easy.</p>
<p>For each Wayland interface we need to handle each incoming message from the client and forward it to the host,
and also forward each message from the host to the client.
Here's <a href="https://github.com/talex5/wayland-virtwl-proxy/blob/29333ac7e6071a1c08ece77b513f4b0ee3ee8f8e/relay.ml#L587">the code</a> to handle the &quot;selection&quot; event in OCaml,
which we receive from the host and send to the client (<code>c</code>):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">method</span> <span class="n">on_selection</span> <span class="o">_</span> <span class="n">offer</span> <span class="o">=</span> <span class="nn">C</span><span class="p">.</span><span class="nn">Wl_data_device</span><span class="p">.</span><span class="n">selection</span> <span class="n">c</span> <span class="o">(</span><span class="nn">Option</span><span class="p">.</span><span class="n">map</span> <span class="n">to_client</span> <span class="n">offer</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The host passes us an &quot;offer&quot; argument, which is a previously-created host offer object.
We look up the corresponding client object with <code>to_client</code> and pass that as the argument
to the client.</p>
<p>For comparison, here's <a href="https://chromium.googlesource.com/chromiumos/platform2/+/7ea49bbabed436e608a0b8974ec90366a787d841/vm_tools/sommelier/sommelier-data-device-manager.cc#492">sommelier's equivalent</a> to this line of code, in C++:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">static</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="nf">sl_data_device_selection</span><span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="w"> </span><span class="n">data</span><span class="p">,</span>
</span><span class="line"><span class="w">                                     </span><span class="k">struct</span><span class="w"> </span><span class="nc">wl_data_device</span><span class="o">*</span><span class="w"> </span><span class="n">data_device</span><span class="p">,</span>
</span><span class="line"><span class="w">                                     </span><span class="k">struct</span><span class="w"> </span><span class="nc">wl_data_offer</span><span class="o">*</span><span class="w"> </span><span class="n">data_offer</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="k">struct</span><span class="w"> </span><span class="nc">sl_host_data_device</span><span class="o">*</span><span class="w"> </span><span class="n">host</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">static_cast</span><span class="o">&lt;</span><span class="n">sl_host_data_device</span><span class="o">*&gt;</span><span class="p">(</span>
</span><span class="line"><span class="w">      </span><span class="n">wl_data_device_get_user_data</span><span class="p">(</span><span class="n">data_device</span><span class="p">));</span>
</span><span class="line"><span class="w">  </span><span class="k">struct</span><span class="w"> </span><span class="nc">sl_host_data_offer</span><span class="o">*</span><span class="w"> </span><span class="n">host_data_offer</span><span class="w"> </span><span class="o">=</span>
</span><span class="line"><span class="w">      </span><span class="n">static_cast</span><span class="o">&lt;</span><span class="n">sl_host_data_offer</span><span class="o">*&gt;</span><span class="p">(</span><span class="n">wl_data_offer_get_user_data</span><span class="p">(</span><span class="n">data_offer</span><span class="p">));</span>
</span><span class="line">
</span><span class="line"><span class="w">  </span><span class="n">wl_data_device_send_selection</span><span class="p">(</span><span class="n">host</span><span class="o">-&gt;</span><span class="n">resource</span><span class="p">,</span><span class="w"> </span><span class="n">host_data_offer</span><span class="o">-&gt;</span><span class="n">resource</span><span class="p">);</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I think this is a great demonstration of the difference between &quot;type safety&quot; and &quot;type ceremony&quot;.
The C++ code is covered in types, making the code very hard to read, yet it crashes at runtime because it
fails to consider that <code>data_offer</code> can be <code>NULL</code>.</p>
<p>By contrast, the OCaml version has no type annotations, but the compiler would reject if I forgot to handle this (with <code>Option.map</code>).</p>
<h3 id="security">Security</h3>
<p>According to <a href="https://wiki.gnome.org/Initiatives/Wayland/PrimarySelection">the GNOME wiki</a>, the original justification for not supporting selection copies was
&quot;security concerns with unexpected data stealing if the mere act of selecting a text fragment makes it available to all running applications&quot;.
The implication is that applications stealing data instead from the clipboard is OK,
and that you should therefore never put anything confidential on the clipboard.</p>
<p>This seemed a bit odd, so I read the <a href="https://wayland.freedesktop.org/docs/html/ch04.html#sect-Protocol-Security-and-Authentication">security section</a> of the Wayland specification to learn more about its security model.
That section of the specification is fairly short, so I'll reproduce it here in full:</p>
<blockquote>
<p><strong>Security and Authentication</strong></p>
<ul>
<li>mostly about access to underlying buffers, need new drm auth mechanism (the grant-to ioctl idea), need to check the cmd stream?
</li>
<li>getting the server socket depends on the compositor type, could be a system wide name, through fd passing on the session dbus. or the client is forked by the compositor and the fd is already opened.
</li>
</ul>
</blockquote>
<p>It looks like implementations have to figure things out for themselves.</p>
<p>The main advantage of Wayland over X11 here is that Wayland mostly isolates applications from each other.
In X11 applications collaborate together to manage a tree of windows, and any application can access any window.
In the Wayland protocol, each application's connection only includes that application's objects.
Applications only get events relevant to their own windows
(for example, you only get pointer motion events while the pointer is over your window).
Communication between applications (e.g. copy-and-paste or drag-and-drop) is all handled though the compositor.</p>
<p>Also, to request the contents of the clipboard you need to quote the serial number of the mouse click or key press that triggered it.
If it's too far in the past, the compositor can ignore the request.</p>
<p>I've also heard people say that security is the reason you can't take screenshots with Wayland.
However, Sway lets you take screenshots, and this worked even from inside a VM through virtwl.
I didn't add screenshot support to the proxy, because I don't want VMs to be able to take screenshots,
but the proxy isn't a security tool (it runs inside the VM, which isn't trusted).</p>
<p>Clearly, the way to fix this was with a new compositor.
One that would offer a different Wayland socket to each VM, tag the windows with the VM name, colour the frames,
confirm copies across VM boundaries, and work with Vim.
Luckily, I already had a handy pure-OCaml Wayland protocol library available.
Unluckily, at this point I ran out of holiday.</p>
<h2 id="future-work">Future work</h2>
<p>There are quite a few things left to do here:</p>
<ul>
<li>
<p>One problem with <code>virtwl</code> is that, while we can receive shared memory FDs <em>from</em> the host, we can't export guest memory <em>to</em> the host.
This is unfortunate, because in Wayland the shared memory for window contents is allocated by the application from guest memory,
and the proxy therefore has to copy each frame. If the host provided the memory to the guest, this wouldn't be needed.
There is a <code>wl_drm</code> protocol for allocating video memory, which might help here, but I don't know how that works and,
like many Wayland specifications, it seems to be in the process of being replaced by something else.
Also, if we're going to copy the memory, we should at least only copy the damaged region, not the whole thing.
I only got this code working just far enough to run the Wayland applications I use (mainly Firefox and Evince).</p>
</li>
<li>
<p>I'm still using ssh to proxy X11 connections (mainly for Vim and gitk).
I'd prefer to run Xwayland in the VM, but it seems you need to provide a bit of <a href="https://wayland.freedesktop.org/docs/html/ch05.html">extra support</a> for that,
which I haven't implemented yet.
Sommelier can do this, but then copying doesn't work.</p>
</li>
<li>
<p>The host Wayland compositor needs to be aware of VMs, so it can colour the titles appropriately and
limit access to privileged operations.</p>
</li>
<li>
<p>For the full Qubes experience, the network card should be handled by a VM, with another VM managing the firewall.
Perhaps the <a href="https://github.com/mirage/qubes-mirage-firewall/">Mirage unikernel firewall</a> could be made to work on KVM too.
I'm not sure how guest-to-guest communication works with KVM.</p>
</li>
</ul>
<p>However, because the host NixOS environment is a fully-working Linux system,
I can always trade off some security to get things working
(e.g. by doing video conferencing directly on the host).</p>
<p>I hope the SpectrumOS project will resume at some point,
or that Qubes will find a solution to its hardware compatibility and performance problems.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">CI/CD pipelines: Monad, Arrow or Dart?</title>
    <link href="https://roscidus.com/blog/blog/2019/11/14/cicd-pipelines/"></link>
    <updated>2019-11-14T09:59:40+00:00</updated>
    <id>https://roscidus.com/blog/blog/2019/11/14/cicd-pipelines</id>
    <content type="html"><![CDATA[<p>In this post I describe three approaches to building a language for writing CI/CD pipelines. My first attempt used a <i>monad</i>, but this prevented static analysis of the pipelines. I then tried using an <i>arrow</i>, but found the syntax very difficult to use. Finally, I ended up using a light-weight alternative to arrows that I will refer to here as a <i>dart</i> (I don't know if this has a name already). This allows for static analysis like an arrow, but has a syntax even simpler than a monad.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#attempt-one-a-monad">Attempt one: a monad</a>
</li>
<li><a href="#attempt-two-an-arrow">Attempt two: an arrow</a>
</li>
<li><a href="#attempt-three-a-dart">Attempt three: a dart</a>
</li>
<li><a href="#comparison-with-arrows">Comparison with arrows</a>
</li>
<li><a href="#larger-examples">Larger examples</a>
<ul>
<li><a href="#ocaml-docker-base-image-builder">OCaml Docker base image builder</a>
</li>
<li><a href="#ocaml-ci">OCaml CI</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://www.reddit.com/r/ocaml/comments/dwpxdj/cicd_pipelines_monad_arrow_or_dart/">Reddit</a>
and <a href="https://lobste.rs/s/u5i2t0/ci_cd_pipelines_monad_arrow_dart">Lobsters</a> )</p>
<h2 id="introduction">Introduction</h2>
<p>I was asked to build a system for creating CI/CD pipelines.
The initial use for it was to build a CI for testing OCaml projects on GitHub (testing each commit against multiple versions of the OCaml compiler and on multiple operating systems).
Here's a simple pipeline that gets the Git commit at the head of a branch, builds it,
and then runs the tests:</p>
<p><img src="/blog/images/cicd/example1.svg" class="center"/></p>
<p>The colour-scheme here is that green boxes are completed, orange ones are in progress and grey means the step can't be started yet.</p>
<p>Here's a slightly more complex example, which also downloads a Docker base image, builds the commit in parallel using two different versions of the OCaml compiler, and then tests the resulting images. Here the red box indicates that this step failed:</p>
<p><img src="/blog/images/cicd/example2.svg" class="center"/></p>
<p>A more complex example is testing the project itself and then searching for other projects that depend on it and testing those against the new version too:</p>
<p><img src="/blog/images/cicd/example3.svg" class="center"/></p>
<p>Here, the circle means that we should wait for the tests to pass before checking the reverse dependencies.</p>
<p>We could describe these pipelines using YAML or similar, but that would be very limiting.
Instead, I decided to use an Embedded Domain Specific Language, so that we can use the host
language's features for free (e.g. string manipulation, variables, functions, imports,
type-checking, etc).</p>
<p>The most obvious approach is making each box a regular function.
Then the first example above could be (here, using OCaml syntax):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example1</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">test</span> <span class="n">image</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The second could be:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example2</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">base</span> <span class="o">=</span> <span class="n">docker_pull</span> <span class="s2">&quot;ocaml/opam2&quot;</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">build</span> <span class="n">ocaml_version</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">dockerfile</span> <span class="o">=</span> <span class="n">make_dockerfile</span> <span class="o">~</span><span class="n">base</span> <span class="o">~</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="o">~</span><span class="n">dockerfile</span> <span class="n">src</span> <span class="o">~</span><span class="n">label</span><span class="o">:</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="n">test</span> <span class="n">image</span>
</span><span class="line">  <span class="k">in</span>
</span><span class="line">  <span class="n">build</span> <span class="s2">&quot;4.07&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">build</span> <span class="s2">&quot;4.08&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>And the third might look something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example3</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">test</span> <span class="n">image</span><span class="o">;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">revdeps</span> <span class="o">=</span> <span class="n">get_revdeps</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">List</span><span class="p">.</span><span class="n">iter</span> <span class="n">example1</span> <span class="n">revdeps</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>However, we'd like to add some extras to the language:</p>
<ul>
<li>Pipeline steps should run in parallel when possible.
The <code>example2</code> function above would do the builds one at a time.
</li>
<li>Pipeline steps should be recalculated whenever their input changes.
e.g. when a new commit is made we need to rebuild.
</li>
<li>The user should be able to view the progress of each step.
</li>
<li>The user should be able to trigger a rebuild for any step.
</li>
<li>We should be able to generate the diagrams automatically from the code,
so we can see what the pipeline will do before running it.
</li>
<li>The failure of one step shouldn't stop the whole pipeline.
</li>
</ul>
<p>The exact extras don't matter too much to this blog post,
so for simplicity I'll focus on just running steps concurrently.</p>
<h2 id="attempt-one-a-monad">Attempt one: a monad</h2>
<p>Without the extra features, we have functions like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">fetch</span> <span class="o">:</span> <span class="n">commit</span> <span class="o">-&gt;</span> <span class="n">source</span>
</span><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="n">source</span> <span class="o">-&gt;</span> <span class="n">image</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You can read this as &quot;<code>build</code> is a function that takes a <code>source</code> value and returns a (Docker) <code>image</code>&quot;.</p>
<p>These functions compose together easily to make a larger function that will fetch a
commit and build it:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">fab</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">c</span> <span class="k">in</span>
</span><span class="line">  <span class="n">build</span> <span class="n">src</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><img src="/blog/images/cicd/fetch_and_build.svg" class="center"/></p>
<p>We could also shorten this to <code>build (fetch c)</code> or to <code>fetch c |&gt; build</code>.
The <code>|&gt;</code> (pipe) operator in OCaml just calls the function on its right with the argument on its left.</p>
<p>To extend these functions to be concurrent, we can make them return promises, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">fetch</span> <span class="o">:</span> <span class="n">commit</span> <span class="o">-&gt;</span> <span class="n">source</span> <span class="n">promise</span>
</span><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="n">source</span> <span class="o">-&gt;</span> <span class="n">image</span> <span class="n">promise</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>But now we can't compose them easily using <code>let</code> (or <code>|&gt;</code>), because the output type of <code>fetch</code> doesn't match the input of <code>build</code>.</p>
<p>However, we can define a similar operation, <code>let*</code> (or <code>&gt;&gt;=</code>) that works with promises. It immediately returns a promise for the final
result, and calls the body of the <code>let*</code> later, when the first promise is fulfilled. Then we have:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">fab</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span><span class="o">*</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">c</span> <span class="k">in</span>
</span><span class="line">  <span class="n">build</span> <span class="n">src</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In order words, by sprinkling a few <code>*</code> characters around we can turn our plain old pipeline into a new concurrent one!
The rules for when you can compose promise-returning functions using <code>let*</code> are exactly the same as the rules about when
you can compose regular functions using <code>let</code>, so writing programs using promises is just as easy as writing regular programs.</p>
<p>Just using <code>let*</code> doesn't add any concurrency within our pipeline
(it just allows it to execute concurrently with other code).
But we can define extra functions for that, such as <code>all</code> to evaluate every promise in a list at once,
or an <code>and*</code> operator to indicate that two things should run in parallel:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example2</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="c">(* Fetch the source code and Docker base image in parallel: *)</span>
</span><span class="line">  <span class="k">let</span><span class="o">*</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span>
</span><span class="line">  <span class="ow">and</span><span class="o">*</span> <span class="n">base</span> <span class="o">=</span> <span class="n">docker_pull</span> <span class="s2">&quot;ocaml/opam2&quot;</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">build</span> <span class="n">ocaml_version</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">dockerfile</span> <span class="o">=</span> <span class="n">make_dockerfile</span> <span class="o">~</span><span class="n">base</span> <span class="o">~</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span><span class="o">*</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="o">~</span><span class="n">dockerfile</span> <span class="n">src</span> <span class="o">~</span><span class="n">label</span><span class="o">:</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="n">test</span> <span class="n">image</span>
</span><span class="line">  <span class="k">in</span>
</span><span class="line">  <span class="c">(* Build and test against each compiler version in parallel: *)</span>
</span><span class="line">  <span class="n">all</span> <span class="o">[</span>
</span><span class="line">    <span class="n">build</span> <span class="s2">&quot;4.07&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="n">build</span> <span class="s2">&quot;4.08&quot;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>As well as handling promises,
we could also define a <code>let*</code> for functions that might return errors (the body of the let is called only if the first value
is successful), or for live updates (the body is called each time the input changes), or for all of these things together.
This is the basic idea of a monad.</p>
<p>This actually works pretty well.
In 2016, I used this approach to make <a href="https://github.com/moby/datakit/tree/master/ci">DataKitCI</a>, which was used initially as the CI system for Docker-for-Mac.
Later, Anil Madhavapeddy used it to create <a href="https://github.com/avsm/mirage-ci">opam-repo-ci</a>, which is the CI system for <a href="https://github.com/ocaml/opam-repository">opam-repository</a>, OCaml's main package repository.
This checks each new PR to see what packages it adds or modifies,
tests each one against multiple OCaml compiler versions and Linux distributions (Debian, Ubuntu, Alpine, CentOS, Fedora and OpenSUSE),
and then finds all versions of all packages depending on the changed packages and tests those too.</p>
<p>The main problem with using a monad is that we can't statically analyse the pipeline.
Consider the <code>example2</code> function above. Until we have queried GitHub to get a commit to
test, we cannot run the function and therefore have no idea what it will do.
Once we have <code>commit</code> we can call <code>example2 commit</code>,
but until the <code>fetch</code> and <code>docker_pull</code> operations complete we cannot evaluate the body of the <code>let*</code> to find out what the pipeline will do next.</p>
<p>In other words, we can only draw diagrams showing the bits of the pipeline that have already
executed or are currently executing, and we must indicate opportunities for concurrency
manually using <code>and*</code>.</p>
<h2 id="attempt-two-an-arrow">Attempt two: an arrow</h2>
<p>An <a href="https://en.wikipedia.org/wiki/Arrow_(computer_science)">arrow</a> makes it possible to analyse pipelines statically.
Instead of our monadic functions:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">fetch</span> <span class="o">:</span> <span class="n">commit</span> <span class="o">-&gt;</span> <span class="n">source</span> <span class="n">promise</span>
</span><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="n">source</span> <span class="o">-&gt;</span> <span class="n">image</span> <span class="n">promise</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>we can define an arrow type:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">fetch</span> <span class="o">:</span> <span class="o">(</span><span class="n">commit</span><span class="o">,</span> <span class="n">source</span><span class="o">)</span> <span class="n">arrow</span>
</span><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="o">(</span><span class="n">source</span><span class="o">,</span> <span class="n">image</span><span class="o">)</span> <span class="n">arrow</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An <code>('a, 'b) arrow</code> is a pipeline that takes an input of type <code>'a</code> and produces a result of type <code>'b</code>.
If we define <code>type ('a, 'b) arrow = 'a -&gt; 'b promise</code> then this is the same as the monadic version.
However, we can instead make the <code>arrow</code> type abstract and extend it to store whatever static information we require.
For example, we could label the arrows:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">f</span> <span class="o">:</span> <span class="k">&#39;</span><span class="n">a</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">b</span> <span class="n">promise</span><span class="o">;</span>
</span><span class="line">  <span class="n">label</span> <span class="o">:</span> <span class="kt">string</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>arrow</code> is a record. <code>f</code> is the old monadic function and <code>label</code> is the &quot;static analysis&quot;.</p>
<p>Users can't see the internals of the <code>arrow</code> type, and must build up pipelines using functions provided by the arrow implementation.
There are three basic functions available:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">arr</span> <span class="o">:</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span>
</span><span class="line"><span class="k">val</span> <span class="o">(</span> <span class="o">&gt;&gt;&gt;</span> <span class="o">)</span> <span class="o">:</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">b</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">c</span><span class="o">)</span> <span class="n">arrow</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">c</span><span class="o">)</span> <span class="n">arrow</span>
</span><span class="line"><span class="k">val</span> <span class="n">first</span> <span class="o">:</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span> <span class="o">-&gt;</span> <span class="o">((</span><span class="k">&#39;</span><span class="n">a</span> <span class="o">*</span> <span class="k">&#39;</span><span class="n">c</span><span class="o">),</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">b</span> <span class="o">*</span> <span class="k">&#39;</span><span class="n">c</span><span class="o">))</span> <span class="n">arrow</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>arr</code> takes a pure function and gives the equivalent arrow.
For our promise example, that means the arrow returns a promise that is already fulfilled.
<code>&gt;&gt;&gt;</code> joins two arrows together.
<code>first</code> takes an arrow from <code>'a</code> to <code>'b</code> and makes it work on pairs instead.
The first element of the pair will be processed by the given arrow and
the second component is returned unchanged.</p>
<p>We can have these operations automatically create new arrows with appropriate <code>f</code> and <code>label</code> fields.
For example, in <code>a &gt;&gt;&gt; b</code>, the resulting label field could be the string <code>{a.label} &gt;&gt;&gt; {b.label}</code>.
This means that we can display the pipeline without having to run it first,
and we could easily replace <code>label</code> with something more structured if needed.</p>
<p>With this our first example changes from:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example1</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">test</span> <span class="n">image</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>to</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example1</span> <span class="o">=</span>
</span><span class="line">  <span class="n">fetch</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">build</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">test</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That seems quite pleasant, although we did have to give up our variable names.
But things start to get complicated with larger examples. For <code>example2</code>, we
need to define a few standard combinators:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(** Process the second component of a tuple, leaving the first unchanged. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">second</span> <span class="n">f</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">swap</span> <span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span> <span class="o">=</span> <span class="o">(</span><span class="n">y</span><span class="o">,</span> <span class="n">x</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="n">arr</span> <span class="n">swap</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">first</span> <span class="n">f</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">arr</span> <span class="n">swap</span>
</span><span class="line">
</span><span class="line"><span class="c">(** [f *** g] processes the first component of a pair with [f] and the second</span>
</span><span class="line"><span class="c">    with [g]. *)</span>
</span><span class="line"><span class="k">let</span> <span class="o">(</span> <span class="o">***</span> <span class="o">)</span> <span class="n">f</span> <span class="n">g</span> <span class="o">=</span>
</span><span class="line">  <span class="n">first</span> <span class="n">f</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">second</span> <span class="n">g</span>
</span><span class="line">
</span><span class="line"><span class="c">(** [f &amp;&amp;&amp; g] processes a single value with [f] and [g] in parallel and</span>
</span><span class="line"><span class="c">    returns a pair with the results. *)</span>
</span><span class="line"><span class="k">let</span> <span class="o">(</span> <span class="o">&amp;&amp;&amp;</span> <span class="o">)</span> <span class="n">f</span> <span class="n">g</span> <span class="o">=</span>
</span><span class="line">  <span class="n">arr</span> <span class="o">(</span><span class="k">fun</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">x</span><span class="o">))</span> <span class="o">&gt;&gt;&gt;</span> <span class="o">(</span><span class="n">f</span> <span class="o">***</span> <span class="n">g</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then, <code>example2</code> changes from:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example2</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">base</span> <span class="o">=</span> <span class="n">docker_pull</span> <span class="s2">&quot;ocaml/opam2&quot;</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">build</span> <span class="n">ocaml_version</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">dockerfile</span> <span class="o">=</span> <span class="n">make_dockerfile</span> <span class="o">~</span><span class="n">base</span> <span class="o">~</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="o">~</span><span class="n">dockerfile</span> <span class="n">src</span> <span class="o">~</span><span class="n">label</span><span class="o">:</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="n">test</span> <span class="n">image</span>
</span><span class="line">  <span class="k">in</span>
</span><span class="line">  <span class="n">build</span> <span class="s2">&quot;4.07&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">build</span> <span class="s2">&quot;4.08&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>to:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example2</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">build</span> <span class="n">ocaml_version</span> <span class="o">=</span>
</span><span class="line">    <span class="n">first</span> <span class="o">(</span><span class="n">arr</span> <span class="o">(</span><span class="k">fun</span> <span class="n">base</span> <span class="o">-&gt;</span> <span class="n">make_dockerfile</span> <span class="o">~</span><span class="n">base</span> <span class="o">~</span><span class="n">ocaml_version</span><span class="o">))</span>
</span><span class="line">    <span class="o">&gt;&gt;&gt;</span> <span class="n">build_with_dockerfile</span> <span class="o">~</span><span class="n">label</span><span class="o">:</span><span class="n">ocaml_version</span>
</span><span class="line">    <span class="o">&gt;&gt;&gt;</span> <span class="n">test</span>
</span><span class="line">  <span class="k">in</span>
</span><span class="line">  <span class="n">arr</span> <span class="o">(</span><span class="k">fun</span> <span class="n">c</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="bp">()</span><span class="o">,</span> <span class="n">c</span><span class="o">))</span>
</span><span class="line">  <span class="o">&gt;&gt;&gt;</span> <span class="o">(</span><span class="n">docker_pull</span> <span class="s2">&quot;ocaml/opam2&quot;</span> <span class="o">***</span> <span class="n">fetch</span><span class="o">)</span>
</span><span class="line">  <span class="o">&gt;&gt;&gt;</span> <span class="o">(</span><span class="n">build</span> <span class="s2">&quot;4.07&quot;</span> <span class="o">&amp;&amp;&amp;</span> <span class="n">build</span> <span class="s2">&quot;4.08&quot;</span><span class="o">)</span>
</span><span class="line">  <span class="o">&gt;&gt;&gt;</span> <span class="n">arr</span> <span class="o">(</span><span class="k">fun</span> <span class="o">(</span><span class="bp">()</span><span class="o">,</span> <span class="bp">()</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="bp">()</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We've lost most of the variable names and instead have to use tuples, remembering where our values are.
It's not <em>too</em> bad here with two values,
but it gets very difficult very quickly as more are added and we start nesting tuples.
We also lost the ability to use an optional labelled argument in <code>build ~dockerfile src</code>
and instead need to use a new operation that takes a tuple of the dockerfile and the source.</p>
<p>Imagine that running the tests now requires getting the test cases from the source code.
In the original code, we'd just change <code>test image</code> to <code>test image ~using:src</code>.
In the arrow version, we need to duplicate the source before the build step,
run the build with <code>first build_with_dockerfile</code>,
and make sure the arguments are the right way around for a new <code>test_using</code>.</p>
<h2 id="attempt-three-a-dart">Attempt three: a dart</h2>
<p>I started wondering whether there might be an easier way to achieve the same static analysis that you get with arrows,
but without the point-free syntax, and it seems that there is. Consider the monadic version of <code>example1</code>.
We had:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="n">source</span> <span class="o">-&gt;</span> <span class="n">image</span> <span class="n">promise</span>
</span><span class="line"><span class="k">val</span> <span class="n">test</span> <span class="o">:</span> <span class="n">image</span> <span class="o">-&gt;</span> <span class="n">results</span> <span class="n">promise</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">example1</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span><span class="o">*</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span><span class="o">*</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">test</span> <span class="n">image</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If you didn't know about monads, there is another way you might try to do this.
Instead of using <code>let*</code> to wait for the <code>fetch</code> to complete and then calling <code>build</code> with
the source, you might define <code>build</code> and <code>test</code> to take promises as inputs:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">build</span> <span class="o">:</span> <span class="n">source</span> <span class="n">promise</span> <span class="o">-&gt;</span> <span class="n">image</span> <span class="n">promise</span>
</span><span class="line"><span class="k">val</span> <span class="n">test</span> <span class="o">:</span> <span class="n">image</span> <span class="n">promise</span> <span class="o">-&gt;</span> <span class="n">results</span> <span class="n">promise</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>After all, fetching gives you a <code>source promise</code> and you want an <code>image promise</code>, so this seems very natural.
We could even have <code>example1</code> take a promise of the commit.
Then it looks like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example1</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">test</span> <span class="n">image</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That's good, because it's identical to the simple version we started with.
The problem is that it is inefficient:</p>
<ul>
<li>We call <code>example1</code> with the promise of the commit (we don't know what it is yet).
</li>
<li>Without waiting to find out which commit we're testing, we call <code>fetch</code>, getting back a promise of some source.
</li>
<li>Without waiting to get the source, we call <code>build</code>, getting a promise of an image.
</li>
<li>Without waiting for the build, we call <code>test</code>, getting a promise of the results.
</li>
</ul>
<p>We return the final promise of the test results immediately, but we haven't done any real work yet.
Instead, we've built up a long chain of promises, wasting memory.</p>
<p>However, in this situation what we want is to perform a static analysis.
i.e. we want to build up in memory some data structure representing the pipeline...
and this is exactly what our &quot;inefficient&quot; use of the monad produces!</p>
<p>To make this useful, we need the primitive operations (such as <code>fetch</code>)
to provide some information (e.g. labels) for the static analysis.
OCaml's <code>let</code> syntax doesn't provide an obvious place for a label,
but I was able to define an operator (<code>let**</code>) that returns a function taking a label argument.
It can be used to build primitive operations like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="s2">&quot;fetch&quot;</span> <span class="o">|&gt;</span>
</span><span class="line">  <span class="k">let</span><span class="o">**</span> <span class="n">commit</span> <span class="o">=</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="c">(* (standard monadic implementation of fetch goes here) *)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>So, <code>fetch</code> takes a promise of a commit, does a monadic bind on it to wait for the actual commit and then proceeds as before,
but it labels the bind as a <code>fetch</code> operation.
If <code>fetch</code> took multiple arguments, it could use <code>and*</code> to wait for all of them in parallel.</p>
<p>In theory, the body of the <code>let**</code> in <code>fetch</code> could contain further binds.
In that case, we wouldn't be able to analyse the whole pipeline at the start.
But as long as the primitives wait for all their inputs at the start and don't do any binds internally,
we can discover the whole pipeline statically.</p>
<p>We can choose whether to expose these bind operations to application code or not.
If <code>let*</code> (or <code>let**</code>) is exposed, then applications get to use all the expressive power of monads,
but there will be points where we cannot show the whole pipeline until some promise resolves.
If we hide them, then applications can only make static pipelines.</p>
<p>My approach so far has been to use <code>let*</code> as an escape hatch, so that any required pipeline can be built,
but I later replace any uses of it by more specialised operations. For example, I added:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">list_map</span> <span class="o">:</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">b</span> <span class="n">t</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">a</span> <span class="kt">list</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">b</span> <span class="kt">list</span> <span class="n">t</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This processes each item in a list that isn't known until runtime.
However, we can still know statically what pipeline we will apply to each item,
even though we don't know what the items themselves are.
<code>list_map</code> could have been implemented using <code>let*</code>, but then we wouldn't be able to see the pipeline statically.</p>
<p>Here are the other two examples, using the dart approach:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example2</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">base</span> <span class="o">=</span> <span class="n">docker_pull</span> <span class="s2">&quot;ocaml/opam2&quot;</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">build</span> <span class="n">ocaml_version</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">dockerfile</span> <span class="o">=</span>
</span><span class="line">      <span class="k">let</span><span class="o">+</span> <span class="n">base</span> <span class="o">=</span> <span class="n">base</span> <span class="k">in</span>
</span><span class="line">      <span class="n">make_dockerfile</span> <span class="o">~</span><span class="n">base</span> <span class="o">~</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="o">~</span><span class="n">dockerfile</span> <span class="n">src</span> <span class="o">~</span><span class="n">label</span><span class="o">:</span><span class="n">ocaml_version</span> <span class="k">in</span>
</span><span class="line">    <span class="n">test</span> <span class="n">image</span>
</span><span class="line">  <span class="k">in</span>
</span><span class="line">  <span class="n">all</span> <span class="o">[</span>
</span><span class="line">    <span class="n">build</span> <span class="s2">&quot;4.07&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="n">build</span> <span class="s2">&quot;4.08&quot;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Compared to the original, we have an <code>all</code> to combine the results, and there's an extra <code>let+ base = base</code> when calculating the dockerfile.
<code>let+</code> is just another syntax for <code>map</code>, used here because I chose not to change the signature of <code>make_dockerfile</code>.
Alternatively, we could have <code>make_dockerfile</code> take a promise of the base image and do the map inside it instead.
Because <code>map</code> takes a pure body (<code>make_dockerfile</code> just generates a string; there are no promises or errors) it doesn't need its own box
on the diagrams and we don't lose anything by allowing its use.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example3</span> <span class="n">commit</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="n">fetch</span> <span class="n">commit</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">image</span> <span class="o">=</span> <span class="n">build</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">ok</span> <span class="o">=</span> <span class="n">test</span> <span class="n">image</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">revdeps</span> <span class="o">=</span> <span class="n">get_revdeps</span> <span class="n">src</span> <span class="k">in</span>
</span><span class="line">  <span class="n">gate</span> <span class="n">revdeps</span> <span class="o">~</span><span class="n">on</span><span class="o">:</span><span class="n">ok</span> <span class="o">|&gt;</span>
</span><span class="line">  <span class="n">list_iter</span> <span class="o">~</span><span class="n">pp</span><span class="o">:</span><span class="nn">Fmt</span><span class="p">.</span><span class="n">string</span> <span class="n">example1</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This shows another custom operation: <code>gate revdeps ~on:ok</code> is a promise that only resolves once both <code>revdeps</code> and <code>ok</code> have resolved.
This prevents it from testing the library's revdeps until the library's own tests have passed,
even though it could do this in parallel if we wanted it to.
Whereas with a monad we have to enable concurrency explicitly where we want it (using <code>and*</code>),
with a dart we have to disable concurrency explicitly where we don't want it (using <code>gate</code>).</p>
<p>I also added a <code>list_iter</code> convenience function,
and gave it a pretty-printer argument so that we can label the cases in the diagrams once the list inputs are known.</p>
<p>Finally, although I said that you can't use <code>let*</code> inside a primitive,
you can still use some other monad (that doesn't generate diagrams).
In fact, in the real system I used a separate <code>let&gt;</code> operator for primitives.
That expects a body using non-diagram-generating promises provided by the underlying promise library,
so you can't use <code>let*</code> (or <code>let&gt;</code>) inside the body of a primitive.</p>
<h2 id="comparison-with-arrows">Comparison with arrows</h2>
<p>Given a &quot;dart&quot; you can create an arrow interface from it easily by defining e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="o">(</span><span class="k">&#39;</span><span class="n">a</span><span class="o">,</span> <span class="k">&#39;</span><span class="n">b</span><span class="o">)</span> <span class="n">arrow</span> <span class="o">=</span> <span class="k">&#39;</span><span class="n">a</span> <span class="n">promise</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">b</span> <span class="n">promise</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then <code>arr</code> is just <code>map</code> and <code>f &gt;&gt;&gt; g</code> is just <code>fun x -&gt; g (f x)</code>. <code>first</code> can be defined easily too, assuming you have some kind
of function for doing two things in parallel (like our <code>and*</code> above).</p>
<p>So a dart API (even with <code>let*</code> hidden) is still enough to express any pipeline you can express using an arrow API.</p>
<p>The <a href="https://en.wikibooks.org/wiki/Haskell/Arrow_tutorial">Haskell Arrow tutorial</a> uses an example where an arrow is a stateful function.
For example, there is a <code>total</code> arrow that returns the sum of its input and every previous input it has been called with.
e.g. calling it three times with inputs <code>1 2 3</code> produces outputs <code>1 3 6</code>.
Running a pipeline on a sequence of inputs returns the sequence of outputs.</p>
<p>The tutorial uses <code>total</code> to define a <code>mean1</code> function like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="haskell"><span class="line"><span class="nf">mean1</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">&amp;&amp;&amp;</span><span class="w"> </span><span class="p">(</span><span class="n">arr</span><span class="w"> </span><span class="p">(</span><span class="n">const</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;&gt;&gt;</span><span class="w"> </span><span class="n">total</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;&gt;&gt;</span><span class="w"> </span><span class="n">arr</span><span class="w"> </span><span class="p">(</span><span class="n">uncurry</span><span class="w"> </span><span class="p">(</span><span class="o">/</span><span class="p">))</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>So this pipeline duplicates each input number,
replaces the second one with <code>1</code>,
totals both streams, and then
replaces each pair with its ratio.
Each time you put another number into the pipeline, you get out the average of all values input so far.</p>
<p>The equivalent code using the dart style would be (OCaml uses <code>/.</code> for floating-point division):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">mean</span> <span class="n">values</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">t</span> <span class="o">=</span> <span class="n">total</span> <span class="n">values</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">n</span> <span class="o">=</span> <span class="n">total</span> <span class="o">(</span><span class="n">const</span> <span class="mi">1</span><span class="o">.</span><span class="mi">0</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="n">map</span> <span class="o">(</span><span class="n">uncurry</span> <span class="o">(/.))</span> <span class="o">(</span><span class="n">pair</span> <span class="n">t</span> <span class="n">n</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That seems more readable to me.
We can simplify the code slightly by defining the standard operators <code>let+</code> (for <code>map</code>) and <code>and+</code> (for <code>pair</code>):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="o">(</span><span class="k">let</span><span class="o">+)</span> <span class="n">x</span> <span class="n">f</span> <span class="o">=</span> <span class="n">map</span> <span class="n">f</span> <span class="n">x</span>
</span><span class="line"><span class="k">let</span> <span class="o">(</span><span class="ow">and</span><span class="o">+)</span> <span class="o">=</span> <span class="n">pair</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">mean</span> <span class="n">values</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span><span class="o">+</span> <span class="n">t</span> <span class="o">=</span> <span class="n">total</span> <span class="n">values</span>
</span><span class="line">  <span class="ow">and</span><span class="o">+</span> <span class="n">n</span> <span class="o">=</span> <span class="n">total</span> <span class="o">(</span><span class="n">const</span> <span class="mi">1</span><span class="o">.</span><span class="mi">0</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="n">t</span> <span class="o">/.</span> <span class="n">n</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is not a great example of an arrow anyway,
because we don't use the output of one stateful function as the input to another,
so this is actually just a plain <a href="https://en.wikipedia.org/wiki/Applicative_functor">applicative</a>.</p>
<p>We could easily extend the example pipeline with another stateful function though,
perhaps by adding some smoothing.
That would look like <code>mean1 &gt;&gt;&gt; smooth</code> in the arrow notation,
and <code>values |&gt; mean |&gt; smooth</code> (or <code>smooth (mean values)</code>) in the dart notation.</p>
<p>Note: Haskell does also have an <code>Arrows</code> syntax extension, which allows the Haskell code to be written as:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="haskell"><span class="line"><span class="nf">mean2</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="n">proc</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="ow">-&gt;</span><span class="w"> </span><span class="kr">do</span>
</span><span class="line"><span class="w">    </span><span class="n">t</span><span class="w"> </span><span class="ow">&lt;-</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">-&lt;</span><span class="w"> </span><span class="n">value</span>
</span><span class="line"><span class="w">    </span><span class="n">n</span><span class="w"> </span><span class="ow">&lt;-</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">-&lt;</span><span class="w"> </span><span class="mi">1</span>
</span><span class="line"><span class="w">    </span><span class="n">returnA</span><span class="w"> </span><span class="o">-&lt;</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That's more similar to the dart notation.</p>
<h2 id="larger-examples">Larger examples</h2>
<p>I've put up a library using a slightly extended version of these ideas at <a href="https://github.com/ocurrent/ocurrent">ocurrent/ocurrent</a>.
The <code>lib_term</code> subdirectory is the part relevant to this blog post, with the various combinators described in <a href="https://github.com/ocurrent/ocurrent/blob/00688f949f3cfbf3d599949f89ca71c8e9e536fc/lib_term/s.ml#L48">TERM</a>.
The other directories handle more concrete details, such as integration with the Lwt promise library,
and providing the admin web UI or the Cap'n Proto RPC interface, as well as plugins with primitives
for using Git, GitHub, Docker and Slack.</p>
<h3 id="ocaml-docker-base-image-builder">OCaml Docker base image builder</h3>
<p><a href="https://github.com/ocurrent/docker-base-images">ocurrent/docker-base-images</a> contains a pipeline that builds Docker base images for OCaml for various Linux distributions, CPU architectures, OCaml compiler versions and configuration options.
For example, to test OCaml 4.09 on Debian 10, you can do:</p>
<pre><code>$ docker run --rm -it ocurrent/opam:debian-10-ocaml-4.09

:~$ ocamlopt --version
4.09.0

:~$ opam depext -i utop
[...]

:~$ utop
----+-------------------------------------------------------------+------------------
    | Welcome to utop version 2.4.2 (using OCaml version 4.09.0)! |                   
    +-------------------------------------------------------------+                   

Type #utop_help for help about using utop.

-( 11:50:06 )-&lt; command 0 &gt;-------------------------------------------{ counter: 0 }-
utop # 
</code></pre>
<p>Here's what the pipeline looks like (click for full-size):</p>
<p><a href="/blog/images/cicd/docker-base-images.svg"><img src="/blog/images/cicd/docker-base-images-thumb.png" class="center"/></a></p>
<p>It pulls the latest Git commit of opam-repository each week, then builds base images containing that and the opam package manager for each distribution version, then builds one image for each supported compiler variant. Many of the images are built on multiple architectures (<code>amd64</code>, <code>arm32</code>, <code>arm64</code> and <code>ppc64</code>) and pushed to a staging area on Docker Hub. Then, the pipeline combines all the hashes to push a multi-arch manifest to Docker Hub. There are also some aliases (e.g. <code>debian</code> means <code>debian-10-ocaml-4.09</code> at the moment). Finally, if there is any problem then the pipeline sends the error to a Slack channel.</p>
<p>You might wonder whether we really need a pipeline for this, rather than a simple script run from a cron-job.
But having a pipeline allows us to see what the pipeline will do before running it, watch the pipeline's progress, restart failed jobs individually, etc, with almost the same code we would have written anyway.</p>
<p>You can read <a href="https://github.com/ocurrent/docker-base-images/blob/f65663b09d78bbb17c39aca97cbd9425c2e7816e/src/pipeline.ml">pipeline.ml</a> if you want to see the full pipeline.</p>
<h3 id="ocaml-ci">OCaml CI</h3>
<p><a href="https://github.com/ocurrent/ocaml-ci">ocurrent/ocaml-ci</a> is an (experimental) GitHub app for testing OCaml projects.
The pipeline gets the list of installations of the app,
gets the configured repositories for each installation,
gets the branches and PRs for each repository,
and then tests the head of each one against multiple Linux distributions and OCaml compiler versions.
If the project uses ocamlformat, it also checks that the commit is formatted exactly as ocamlformat would do it.</p>
<p><a href="/blog/images/cicd/ocaml-ci.svg"><img src="/blog/images/cicd/ocaml-ci-thumb.png" class="center"/></a></p>
<p>The results are pushed back to GitHub as the commit status, and also recorded in a local index for the web and tty UIs.
There's quite a lot of red here mainly because if a project doesn't support a particular version of OCaml then the build is marked
as failed and shows up as red in the pipeline, although these failures are filtered out when making the GitHub status report.
We probably need a new colour for skipped stages.</p>
<h2 id="conclusions">Conclusions</h2>
<p>It's convenient to write CI/CD pipelines as if they were single-shot scripts
that run the steps once, in series, and always succeed,
and then with only minor changes have the pipeline run the steps whenever
the input changes, in parallel, with logging, error reporting, cancellation
and rebuild support.</p>
<p>Using a monad allows any program to be converted easily to have these features,
but, as with a regular program, we don't know what the program will do with some
data until we run it. In particular, we can only automatically generate diagrams
showing steps that have already started.</p>
<p>The traditional way to do static analysis is to use an arrow.
This is a little more limited than a monad, because the structure of the pipeline
can't change depending on the input data, although we can add limited flexibility
such as optional steps or a choice between two branches.
However, writing pipelines using arrow notation is difficult because we have to
program in a point-free style (without variables).</p>
<p>We can get the same benefits of static analysis by using a monad in an unusual way,
here referred to as a &quot;dart&quot;.
Instead of functions that take plain values and return wrapped values, our functions
both take and return wrapped values. This results in a syntax that looks
identical to plain programming, but allows static analysis (at the cost of not being
able to manipulate the wrapped values directly).</p>
<p>If we hide (or don't use) the monad's <code>let*</code> (bind) function then the pipelines we
create can always be determined statically. If we use a bind, then there will be holes
in the pipeline that may expand to more pipeline stages as the pipeline runs.</p>
<p>Primitive steps can be created by using a single &quot;labelled bind&quot;, where the label
provides the static analysis for the atomic component.</p>
<p>I haven't seen this pattern used before (or mentioned in the arrow documentation),
and it seems to provide exactly the same benefits as arrows with much less difficulty.
If this has a proper name, let me know!</p>
<p>This work was funded by OCaml Labs.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Using TLA+ to understand Xen vchan</title>
    <link href="https://roscidus.com/blog/blog/2019/01/01/using-tla-plus-to-understand-xen-vchan/"></link>
    <updated>2019-01-01T09:15:18+00:00</updated>
    <id>https://roscidus.com/blog/blog/2019/01/01/using-tla-plus-to-understand-xen-vchan</id>
    <content type="html"><![CDATA[<p>The vchan protocol is used to stream data between virtual machines on a Xen host without needing any locks.
It is largely undocumented.
The TLA Toolbox is a set of tools for writing and checking specifications.
In this post, I'll describe my experiences using these tools to understand how the vchan protocol works.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#background">Background</a>
<ul>
<li><a href="#qubes-and-the-vchan-protocol">Qubes and the vchan protocol</a>
</li>
<li><a href="#tla">TLA+</a>
</li>
</ul>
</li>
<li><a href="#is-tla-useful">Is TLA useful?</a>
</li>
<li><a href="#basic-tla-concepts">Basic TLA concepts</a>
<ul>
<li><a href="#variables-states-and-behaviour">Variables, states and behaviour</a>
</li>
<li><a href="#actions">Actions</a>
</li>
<li><a href="#correctness-of-spec">Correctness of Spec</a>
</li>
<li><a href="#the-model-checker">The model checker</a>
</li>
</ul>
</li>
<li><a href="#the-real-vchan">The real vchan</a>
<ul>
<li><a href="#the-algorithm">The algorithm</a>
</li>
<li><a href="#testing-the-full-spec">Testing the full spec</a>
</li>
<li><a href="#some-odd-things">Some odd things</a>
</li>
<li><a href="#why-does-vchan-work">Why does vchan work?</a>
</li>
<li><a href="#proving-integrity">Proving Integrity</a>
</li>
<li><a href="#availability">Availability</a>
</li>
</ul>
</li>
<li><a href="#experiences-with-tlaps">Experiences with TLAPS</a>
</li>
<li><a href="#the-final-specification">The final specification</a>
</li>
<li><a href="#the-original-bug">The original bug</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://www.reddit.com/r/tlaplus/comments/abi3oz/using_tla_to_understand_xen_vchan/">Reddit</a>, <a href="https://news.ycombinator.com/item?id=18814350">Hacker News</a>
and <a href="https://lobste.rs/s/a5zer2/using_tla_understand_xen_vchan">Lobsters</a> )</p>
<h2 id="background">Background</h2>
<h3 id="qubes-and-the-vchan-protocol">Qubes and the vchan protocol</h3>
<p>I run <a href="https://www.qubes-os.org/">QubesOS</a> on my laptop.
A QubesOS desktop environment is made up of multiple virtual machines.
A privileged VM, called dom0, provides the desktop environment and coordinates the other VMs.
dom0 doesn't have network access, so you have to use other VMs for doing actual work.
For example, I use one VM for email and another for development work (these are called &quot;application VMs&quot;).
There is another VM (called sys-net) that connects to the physical network, and
yet another VM (sys-firewall) that connects the application VMs to net-vm.</p>
<p><span class="caption-wrapper center"><img src="/blog/images/qubes/qubes-desktop.png" title="My QubesOS desktop. The windows with blue borders are from my Debian development VM, while the green one is from a Fedora VM, etc." class="caption"/><span class="caption-text">My QubesOS desktop. The windows with blue borders are from my Debian development VM, while the green one is from a Fedora VM, etc.</span></span></p>
<p>The default sys-firewall is based on Fedora Linux.
A few years ago, <a href="https://roscidus.com/blog/blog/2016/01/01/a-unikernel-firewall-for-qubesos/">I replaced sys-firewall with a MirageOS unikernel</a>.
MirageOS is written in OCaml, and has very little C code (unlike Linux).
It boots much faster and uses much less RAM than the Fedora-based VM.
But recently, a user reported that <a href="https://github.com/mirage/mirage-qubes/issues/25">restarting mirage-firewall was taking a very long time</a>.
The problem seemed to be that it was taking several minutes to transfer the information about the network configuration to the firewall.
This is sent over vchan.
The user reported that stracing the QubesDB process in dom0 revealed that it was sleeping for 10 seconds
between sending the records, suggesting that a wakeup event was missing.</p>
<p>The lead developer of QubesOS said:</p>
<blockquote>
<p>I'd guess missing evtchn trigger after reading/writing data in vchan.</p>
</blockquote>
<p>Perhaps <a href="https://github.com/mirage/ocaml-vchan">ocaml-vchan</a>, the OCaml implementation of vchan, wasn't implementing the vchan specification correctly?
I wanted to check, but there was a problem: there was no vchan specification.</p>
<p>The Xen wiki lists vchan under <a href="https://wiki.xenproject.org/wiki/Xen_Document_Days/TODO#Documentation_on_lib.28xen.29vchan">Xen Document Days/TODO</a>.
The <a href="https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=1a16a3351ff2f2cf9f0cc0a27c89a0652eb8dfb4">initial Git commit</a> on 2011-10-06 said:</p>
<blockquote>
<p>libvchan: interdomain communications library</p>
<p>This library implements a bidirectional communication interface between
applications in different domains, similar to unix sockets. Data can be
sent using the byte-oriented <code>libvchan_read</code>/<code>libvchan_write</code> or the
packet-oriented <code>libvchan_recv</code>/<code>libvchan_send</code>.</p>
<p>Channel setup is done using a client-server model; domain IDs and a port
number must be negotiated prior to initialization. The server allocates
memory for the shared pages and determines the sizes of the
communication rings (which may span multiple pages, although the default
places rings and control within a single page).</p>
<p>With properly sized rings, testing has shown that this interface
provides speed comparable to pipes within a single Linux domain; it is
significantly faster than network-based communication.</p>
</blockquote>
<p>I looked in the xen-devel mailing list around this period in case the reviewers had asked about how it worked.</p>
<p>One reviewer <a href="https://lists.xenproject.org/archives/html/xen-devel/2011-08/msg00874.html">suggested</a>:</p>
<blockquote>
<p>Please could you say a few words about the functionality this new
library enables and perhaps the design etc? In particular a protocol
spec would be useful for anyone who wanted to reimplement for another
guest OS etc. [...]
I think it would be appropriate to add protocol.txt at the same time as
checking in the library.</p>
</blockquote>
<p>However, the submitter pointed out that this was unnecessary, saying:</p>
<blockquote>
<p>The comments in the shared header file explain the layout of the shared
memory regions; any other parts of the protocol are application-defined.</p>
</blockquote>
<p>Now, ordinarily, I wouldn't be much interested in spending my free time
tracking down race conditions in 3rd-party libraries for the benefit of
strangers on the Internet. However, I did want to have another play with TLA...</p>
<h3 id="tla">TLA+</h3>
<p><a href="https://lamport.azurewebsites.net/tla/tla.html">TLA+</a> is a language for specifying algorithms.
It can be used for many things, but it is particularly designed for stateful parallel algorithms.</p>
<p>I learned about TLA while working at Docker.
Docker EE provides software for managing large clusters of machines.
It includes various orchestrators (SwarmKit, Kubernetes and Swarm Classic) and
a web UI.
Ensuring that everything works properly is very important, and to this end
a large collection of tests had been produced.
Part of my job was to run these tests.
You take a test from a list in a web UI and click whatever buttons it tells you to click,
wait for some period of time,
and then check that what you see matches what the test says you should see.
There were a lot of these tests, and they all had to be repeated on every
supported platform, and for every release, release candidate or preview release.
There was a lot of waiting involved and not much thinking required, so to keep
my mind occupied, I started reading the TLA documentation.</p>
<p>I read <a href="https://lamport.azurewebsites.net/tla/hyperbook.html">The TLA+ Hyperbook</a> and <a href="https://lamport.azurewebsites.net/tla/book.html">Specifying Systems</a>.
Both are by Leslie Lamport (the creator of TLA), and are freely available online.
They're both very easy to read.
The hyperbook introduces the tools right away so you can start playing, while
Specifying Systems starts with more theory and discusses the tools later.
I think it's worth reading both.</p>
<p>Once Docker EE 2.0 was released,
we engineers were allowed to spend a week on whatever fun (Docker-related) project we wanted.
I used the time to read the SwarmKit design documents and make a TLA model of that.
I felt that using TLA prompted useful discussions with the SwarmKit developers
(which can see seen in the <a href="https://github.com/docker/swarmkit/pull/2613">pull request</a> comments).</p>
<p>A specification document can answer questions such as:</p>
<ol>
<li>What does it do? (requirements / properties)
</li>
<li>How does it do it? (the algorithm)
</li>
<li>Does it work? (model checking)
</li>
<li>Why does it work? (inductive invariant)
</li>
<li>Does it <em>really</em> work? (proofs)
</li>
</ol>
<p>You don't have to answer all of them to have a useful document,
but I will try to answer each of them for vchan.</p>
<h2 id="is-tla-useful">Is TLA useful?</h2>
<p>In my (limited) experience with TLA, whenever I have reached the end of a specification
(whether reading it or writing it), I always find myself thinking &quot;Well, that was obvious.
It hardly seems worth writing a spec for that!&quot;.
You might feel the same after reading this blog post.</p>
<p>To judge whether TLA is useful, I suggest you take a few minutes to look at the code.
If you are good at reading C code then you might find, like the Xen reviewers,
that it is quite obvious what it does, how it works, and why it is correct.
Or, like me, you might find you'd prefer a little help.
You might want to jot down some notes about it now, to see whether you learn anything new.</p>
<p>To give the big picture:</p>
<ol>
<li>Two VMs decide to communicate over vchan. One will be the server and the other the client.
</li>
<li>The server allocates three chunks of memory: one to hold data in transit from the client to
the server, one for data going from server to client, and the third to track information about
the state of the system. This includes counters saying how much data has been written and how
much read, in each direction.
</li>
<li>The server tells Xen to grant the client access to this memory.
</li>
<li>The client asks Xen to map the memory into its address space.
Now client and server can both access it at once.
There are no locks in the protocol, so be careful!
</li>
<li>Either end sends data by writing it into the appropriate buffer and updating the appropriate counter
in the shared block. The buffers are <a href="https://en.wikipedia.org/wiki/Circular_buffer">ring buffers</a>, so after getting to the end, you
start again from the beginning.
</li>
<li>The data-written (producer) counter and the data-read (consumer) counter together
tell you how much data is in the buffer, and where it is.
When the difference is zero, the reader must stop reading and wait for more data.
When the difference is the size of the buffer, the writer must stop writing and wait for more space.
</li>
<li>When one end is waiting, the other can signal it using a <a href="https://wiki.xen.org/wiki/Event_Channel_Internals">Xen event channel</a>.
This essentially sets a pending flag to true at the other end, and wakes the VM if it is sleeping.
If a VM tries to sleep while it has an event pending, it will immediately wake up again.
Sending an event when one is already pending has no effect.
</li>
</ol>
<p>The <a href="http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/libxenvchan.h;h=44284f437ab30f01049f280035dbb711103ca9b0;hb=HEAD">public/io/libxenvchan.h</a> header file provides some information,
including the shared structures and comments about them:</p>
<figure class="code"><figcaption><span>xen/include/public/io/libxenvchan.h</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">struct</span><span class="w"> </span><span class="nc">ring_shared</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="kt">uint32_t</span><span class="w"> </span><span class="n">cons</span><span class="p">,</span><span class="w"> </span><span class="n">prod</span><span class="p">;</span>
</span><span class="line"><span class="p">};</span>
</span><span class="line">
</span><span class="line"><span class="cp">#define VCHAN_NOTIFY_WRITE 0x1</span>
</span><span class="line"><span class="cp">#define VCHAN_NOTIFY_READ 0x2</span>
</span><span class="line">
</span><span class="line"><span class="cm">/**</span>
</span><span class="line"><span class="cm"> * vchan_interface: primary shared data structure</span>
</span><span class="line"><span class="cm"> */</span>
</span><span class="line"><span class="k">struct</span><span class="w"> </span><span class="nc">vchan_interface</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="cm">/**</span>
</span><span class="line"><span class="cm">     * Standard consumer/producer interface, one pair per buffer</span>
</span><span class="line"><span class="cm">     * left is client write, server read</span>
</span><span class="line"><span class="cm">     * right is client read, server write</span>
</span><span class="line"><span class="cm">     */</span>
</span><span class="line"><span class="w">    </span><span class="k">struct</span><span class="w"> </span><span class="nc">ring_shared</span><span class="w"> </span><span class="n">left</span><span class="p">,</span><span class="w"> </span><span class="n">right</span><span class="p">;</span>
</span><span class="line"><span class="w">    </span><span class="cm">/**</span>
</span><span class="line"><span class="cm">     * size of the rings, which determines their location</span>
</span><span class="line"><span class="cm">     * 10   - at offset 1024 in ring&#39;s page</span>
</span><span class="line"><span class="cm">     * 11   - at offset 2048 in ring&#39;s page</span>
</span><span class="line"><span class="cm">     * 12+  - uses 2^(N-12) grants to describe the multi-page ring</span>
</span><span class="line"><span class="cm">     * These should remain constant once the page is shared.</span>
</span><span class="line"><span class="cm">     * Only one of the two orders can be 10 (or 11).</span>
</span><span class="line"><span class="cm">     */</span>
</span><span class="line"><span class="w">    </span><span class="kt">uint16_t</span><span class="w"> </span><span class="n">left_order</span><span class="p">,</span><span class="w"> </span><span class="n">right_order</span><span class="p">;</span>
</span><span class="line"><span class="w">    </span><span class="cm">/**</span>
</span><span class="line"><span class="cm">     * Shutdown detection:</span>
</span><span class="line"><span class="cm">     *  0: client (or server) has exited</span>
</span><span class="line"><span class="cm">     *  1: client (or server) is connected</span>
</span><span class="line"><span class="cm">     *  2: client has not yet connected</span>
</span><span class="line"><span class="cm">     */</span>
</span><span class="line"><span class="w">    </span><span class="kt">uint8_t</span><span class="w"> </span><span class="n">cli_live</span><span class="p">,</span><span class="w"> </span><span class="n">srv_live</span><span class="p">;</span>
</span><span class="line"><span class="w">    </span><span class="cm">/**</span>
</span><span class="line"><span class="cm">     * Notification bits:</span>
</span><span class="line"><span class="cm">     *  VCHAN_NOTIFY_WRITE: send notify when data is written</span>
</span><span class="line"><span class="cm">     *  VCHAN_NOTIFY_READ: send notify when data is read (consumed)</span>
</span><span class="line"><span class="cm">     * cli_notify is used for the client to inform the server of its action</span>
</span><span class="line"><span class="cm">     */</span>
</span><span class="line"><span class="w">    </span><span class="kt">uint8_t</span><span class="w"> </span><span class="n">cli_notify</span><span class="p">,</span><span class="w"> </span><span class="n">srv_notify</span><span class="p">;</span>
</span><span class="line"><span class="w">    </span><span class="cm">/**</span>
</span><span class="line"><span class="cm">     * Grant list: ordering is left, right. Must not extend into actual ring</span>
</span><span class="line"><span class="cm">     * or grow beyond the end of the initial shared page.</span>
</span><span class="line"><span class="cm">     * These should remain constant once the page is shared, to allow</span>
</span><span class="line"><span class="cm">     * for possible remapping by a client that restarts.</span>
</span><span class="line"><span class="cm">     */</span>
</span><span class="line"><span class="w">    </span><span class="kt">uint32_t</span><span class="w"> </span><span class="n">grants</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You might also like to look at <a href="http://xenbits.xen.org/gitweb/?p=xen.git;a=tree;f=tools/libvchan;h=44e5af5adacc92511f29d1ab3e1c1037c7ea60fa;hb=HEAD">the vchan source code</a>.
Note that the <code>libxenvchan.h</code> file in this directory includes and extends
the above header file (with the same name).</p>
<p>For this blog post, we will ignore the Xen-specific business of sharing the memory
and telling the client where it is, and assume that the client has mapped the
memory and is ready to go.</p>
<h2 id="basic-tla-concepts">Basic TLA concepts</h2>
<p>We'll take a first look at TLA concepts and notation using a simplified version of vchan.
TLA comes with excellent documentation, so I won't try to make this a full tutorial,
but hopefully you will be able to follow the rest of this blog post after reading it.
We will just consider a single direction of the channel (e.g. client-to-server) here.</p>
<h3 id="variables-states-and-behaviour">Variables, states and behaviour</h3>
<p>A <em>variable</em> in TLA is just what a programmer expects: something that changes over time.
For example, I'll use <code>Buffer</code> to represent the data currently being transmitted.</p>
<p>We can also add variables that are just useful for the specification.
I use <code>Sent</code> to represent everything the sender-side application asked the vchan library to transmit,
and <code>Got</code> for everything the receiving application has received:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="kn">VARIABLES</span> <span class="n">Got</span><span class="p">,</span> <span class="n">Buffer</span><span class="p">,</span> <span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A <em>state</em> in TLA represents a snapshot of the world at some point.
It gives a value for each variable.
For example, <code>{ Got: &quot;H&quot;, Buffer: &quot;i&quot;, Sent: &quot;Hi&quot;, ... }</code> is a state.
The <code>...</code> is just a reminder that a state also includes everything else in the world,
not just the variables we care about.</p>
<p>Here are some more states:</p>
<table class="table"><thead><tr><th> State </th><th> Got </th><th> Buffer </th><th> Sent </th></tr></thead><tbody><tr><td> s0    </td><td> </td><td> </td><td> </td></tr><tr><td> s1    </td><td> </td><td> H      </td><td> H    </td></tr><tr><td> s2    </td><td> H   </td><td> </td><td> H    </td></tr><tr><td> s3    </td><td> H   </td><td> i      </td><td> Hi   </td></tr><tr><td> s4    </td><td> Hi  </td><td> </td><td> Hi   </td></tr><tr><td> s5    </td><td> iH  </td><td> </td><td> Hi   </td></tr></tbody></table><p>A <em>behaviour</em> is a sequence of states, representing some possible history of the world.
For example, <code>&lt;&lt; s0, s1, s2, s3, s4 &gt;&gt;</code> is a behaviour.
So is <code>&lt;&lt; s0, s1, s5 &gt;&gt;</code>, but not one we want.
The basic idea in TLA is to specify precisely which behaviours we want and which we don't want.</p>
<p>A <em>state expression</em> is an expression that can be evaluated in the context of some state.
For example, this defines <code>Integrity</code> to be a state expression that is true whenever what we have got
so far matches what we wanted to send:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* Take(m, i) is just the first i elements of message m. *)</span>
</span><span class="line"><span class="n">Take</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="ni">==</span> <span class="n">SubSeq</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="cm">(* Everything except the first i elements of message m. *)</span>
</span><span class="line"><span class="n">Drop</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="ni">==</span> <span class="n">SubSeq</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">i</span> <span class="o">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">Len</span><span class="p">(</span><span class="n">m</span><span class="p">))</span>
</span><span class="line">
</span><span class="line"><span class="n">Integrity</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Take</span><span class="p">(</span><span class="n">Sent</span><span class="p">,</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">))</span> <span class="ni">=</span> <span class="n">Got</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Integrity</code> is true for all the states above except for <code>s5</code>.
I added some helper operators <code>Take</code> and <code>Drop</code> here.
Sequences in TLA+ can be confusing because they are indexed from 1 rather than from 0,
so it is easy to make off-by-one errors.
These operators just use lengths, which we can all agree on.
In Python syntax, it would be written something like:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="k">def</span> <span class="nf">Integrity</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span><span class="line">    <span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">Sent</span><span class="o">.</span><span class="n">starts_with</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">Got</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A <em>temporal formula</em> is an expression that is evaluated in the context of a complete behaviour.
It can use the temporal operators, which include:</p>
<ul>
<li><code>[]</code> (that's supposed to look like a square) : &quot;always&quot;
</li>
<li><code>&lt;&gt;</code> (that's supposed to look like a diamond) : &quot;eventually&quot;
</li>
</ul>
<p><code>[] F</code> is true if the expression <code>F</code> is true at <em>every</em> point in the behaviour.
<code>&lt;&gt; F</code> is true if the expression <code>F</code> is true at <em>any</em> point in the behaviour.</p>
<p>Messages we send should eventually arrive.
Here's one way to express that:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">    <span class="p">[]</span> <span class="p">(</span><span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="ni">=&gt;</span> <span class="ni">&lt;&gt;</span> <span class="p">(</span><span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span><span class="p">)</span> <span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>TLA syntax is a bit odd. It's rather like LaTeX (which is not surprising: Lamport is also the &quot;La&quot; in LaTeX).
<code>\A</code> means &quot;for all&quot; (rendered as an upside-down A).
So this says that for every number <code>x</code>, it is always true that if we have sent <code>x</code> bytes then
eventually we will have received at least <code>x</code> bytes.</p>
<p>This pattern of <code>[] (F =&gt; &lt;&gt;G)</code> is common enough that it has a shorter notation of <code>F ~&gt; G</code>, which
is read as &quot;F (always) leads to G&quot;. So, <code>Availability</code> can also be written as:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We're only checking the lengths in <code>Availability</code>, but combined with <code>Integrity</code> that's enough to ensure
that we eventually receive what we want.
So ideally, we'd like to ensure that every possible behaviour of the vchan library will satisfy
the temporal formula <code>Properties</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Properties</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Availability</span> <span class="o">/\</span> <span class="p">[]</span><span class="n">Integrity</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That <code>/\</code> is &quot;and&quot; by the way, and <code>\/</code> is &quot;or&quot;.
I did eventually start to be able to tell one from the other, though I still think <code>&amp;&amp;</code> and <code>||</code> would be easier.
In case I forget to explain some syntax, <a href="https://lamport.azurewebsites.net/tla/summary.pdf">A Summary of TLA</a> lists most of it.</p>
<h3 id="actions">Actions</h3>
<p>It is hopefully easy to see that <code>Properties</code> defines properties we want.
A user of vchan would be happy to see that these are things they can rely on.
But they don't provide much help to someone trying to implement vchan.
For that, TLA provides another way to specify behaviours.</p>
<p>An <em>action</em> in TLA is an expression that is evaluated in the context of a pair of states,
representing a single atomic step of the system.
For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Read</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&gt;</span> <span class="m">0</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Buffer</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Buffer</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Read</code> action is true of a step if that step transfers all the data from <code>Buffer</code> to <code>Got</code>.
Unprimed variables (e.g. <code>Buffer</code>) refer to the current state and primed ones (e.g. <code>Buffer'</code>)
refer to the next state.
There's some more strange notation here too:</p>
<ul>
<li>We're using <code>/\</code> to form a bulleted list here rather than as an infix operator.
This is indentation-sensitive. TLA also supports <code>\/</code> lists in the same way.
</li>
<li><code>\o</code> is sequence concatenation (<code>+</code> in Python).
</li>
<li><code>&lt;&lt; &gt;&gt;</code> is the empty sequence (<code>[ ]</code> in Python).
</li>
<li><code>UNCHANGED Sent</code> means <code>Sent' = Sent</code>.
</li>
</ul>
<p>In Python, it might look like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="k">def</span> <span class="nf">Read</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="nb">next</span><span class="p">):</span>
</span><span class="line">  <span class="k">return</span> <span class="n">Len</span><span class="p">(</span><span class="n">current</span><span class="o">.</span><span class="n">Buffer</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span> \
</span><span class="line">     <span class="ow">and</span> <span class="nb">next</span><span class="o">.</span><span class="n">Got</span> <span class="o">=</span> <span class="n">current</span><span class="o">.</span><span class="n">Got</span> <span class="o">+</span> <span class="n">current</span><span class="o">.</span><span class="n">Buffer</span> \
</span><span class="line">     <span class="ow">and</span> <span class="nb">next</span><span class="o">.</span><span class="n">Buffer</span> <span class="o">=</span> <span class="p">[]</span> \
</span><span class="line">     <span class="ow">and</span> <span class="nb">next</span><span class="o">.</span><span class="n">Sent</span> <span class="o">=</span> <span class="n">current</span><span class="o">.</span><span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Actions correspond more closely to code than temporal formulas,
because they only talk about how the next state is related to the current one.</p>
<p>This action only allows one thing: reading the whole buffer at once.
In the C implementation of vchan the receiving application can provide a buffer of any size
and the library will read at most enough bytes to fill the buffer.
To model that, we will need a slightly more flexible version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Read</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\E</span> <span class="n">n</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="p">:</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Got</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">Buffer</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Buffer</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">Buffer</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Sent</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that a step is a <code>Read</code> step if there is any <code>n</code> (in the range 1 to the length of the buffer)
such that we transferred <code>n</code> bytes from the buffer. <code>\E</code> means &quot;there exists ...&quot;.</p>
<p>A <code>Write</code> action can be defined in a similar way:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="kn">CONSTANT</span> <span class="n">BufferSize</span>
</span><span class="line"><span class="n">Byte</span> <span class="ni">==</span> <span class="m">0</span><span class="o">..</span><span class="m">255</span>
</span><span class="line">
</span><span class="line"><span class="n">Write</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\E</span> <span class="n">m</span> <span class="s">\in</span> <span class="n">Seq</span><span class="p">(</span><span class="n">Byte</span><span class="p">)</span> <span class="o">\</span> <span class="ni">{&lt;&lt;</span> <span class="ni">&gt;&gt;}</span> <span class="p">:</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Buffer</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">m</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="err">&#39;</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="n">BufferSize</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">Sent</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="n">Sent</span> <span class="nb">\o</span> <span class="n">m</span>
</span><span class="line">    <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">Got</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A <code>CONSTANT</code> defines a parameter (input) of the specification
(it's constant in the sense that it doesn't change between states).
A <code>Write</code> operation adds some message <code>m</code> to the buffer, and also adds a copy of it to <code>Sent</code>
so we can talk about what the system is doing.
<code>Seq(Byte)</code> is the set of all possible sequences of bytes,
and <code>\ {&lt;&lt; &gt;&gt;}</code> just excludes the empty sequence.</p>
<p>A step of the combined system is either a <code>Read</code> step or a <code>Write</code> step:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Next</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Read</span> <span class="o">\/</span> <span class="n">Write</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We also need to define what a valid starting state for the algorithm looks like:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Init</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Buffer</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Finally, we can put all this together to get a temporal formula for the algorithm:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">vars</span> <span class="ni">==</span> <span class="ni">&lt;&lt;</span> <span class="n">Got</span><span class="p">,</span> <span class="n">Buffer</span><span class="p">,</span> <span class="n">Sent</span> <span class="ni">&gt;&gt;</span> 
</span><span class="line">
</span><span class="line"><span class="n">Spec</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Init</span> <span class="o">/\</span> <span class="p">[][</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Some more notation here:</p>
<ul>
<li><code>[Next]_vars</code> (that's <code>Next</code> in brackets with a subscript <code>vars</code>) means
<code>Next \/ UNCHANGED vars</code>.
</li>
<li>Using <code>Init</code> (a state expression) in a temporal formula means it must be
true for the <em>first</em> state of the behaviour.
</li>
<li><code>[][Action]_vars</code> means that <code>[Action]_vars</code> must be true for each step.
</li>
</ul>
<p>TLA syntax requires the <code>_vars</code> subscript here.
This is because other things can be going on in the world beside our algorithm,
so it must always be possible to take a step without our algorithm doing anything.</p>
<p><code>Spec</code> defines behaviours just like <code>Properties</code> does,
but in a way that makes it more obvious how to implement the protocol.</p>
<h3 id="correctness-of-spec">Correctness of Spec</h3>
<p>Now we have definitions of <code>Spec</code> and <code>Properties</code>,
it makes sense to check that every behaviour of <code>Spec</code> satisfies <code>Properties</code>.
In Python terms, we want to check that all behaviours <code>b</code> satisfy this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="k">def</span> <span class="nf">SpecOK</span><span class="p">(</span><span class="n">b</span><span class="p">):</span>
</span><span class="line">  <span class="k">return</span> <span class="n">Spec</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">=</span> <span class="kc">False</span> <span class="ow">or</span> <span class="n">Properties</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>i.e. either <code>b</code> isn't a behaviour that could result from the actions of our algorithm or,
if it is, it satisfies <code>Properties</code>. In TLA notation, we write this as:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">SpecOK</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Spec</span> <span class="ni">=&gt;</span> <span class="n">Properties</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's OK if a behaviour is allowed by <code>Properties</code> but not by <code>Spec</code>.
For example, the behaviour which goes straight from <code>Got=&quot;&quot;, Sent=&quot;&quot;</code> to
<code>Got=&quot;Hi&quot;, Sent=&quot;Hi&quot;</code> in one step meets our requirements, but it's not a
behaviour of <code>Spec</code>.</p>
<p>The real implementation may itself further restrict <code>Spec</code>.
For example, consider the behaviour <code>&lt;&lt; s0, s1, s2 &gt;&gt;</code>:</p>
<table class="table"><thead><tr><th> State </th><th> Got </th><th> Buffer </th><th> Sent </th></tr></thead><tbody><tr><td> s0    </td><td> </td><td> Hi     </td><td> Hi   </td></tr><tr><td> s1    </td><td> H   </td><td> i      </td><td> Hi   </td></tr><tr><td> s2    </td><td> Hi  </td><td> </td><td> Hi   </td></tr></tbody></table><p>The sender sends two bytes at once, but the reader reads them one at a time.
This <em>is</em> a behaviour of the C implementation,
because the reading application can ask the library to read into a 1-byte buffer.
However, it is <em>not</em> a behaviour of the OCaml implementation,
which gets to choose how much data to return to the application and will return both bytes together.</p>
<p>That's fine.
We just need to show that <code>OCamlImpl =&gt; Spec</code> and <code>Spec =&gt; Properties</code> and we can deduce that
<code>OCamlImpl =&gt; Properties</code>.
This is, of course, the key purpose of a specification:
we only need to check that each implementation implements the specification,
not that each implementation directly provides the desired properties.</p>
<p>It might seem strange that an implementation doesn't have to allow all the specified behaviours.
In fact, even the trivial specification <code>Spec == FALSE</code> is considered to be a correct implementation of <code>Properties</code>,
because it has no bad behaviours (no behaviours at all).
But that's OK.
Once the algorithm is running, it must have <em>some</em> behaviour, even if that behaviour is to do nothing.
As the user of the library, you are responsible for checking that you can use it
(e.g. by ensuring that the <code>Init</code> conditions are met).
An algorithm without any behaviours corresponds to a library you could never use,
not to one that goes wrong once it is running.</p>
<h3 id="the-model-checker">The model checker</h3>
<p>Now comes the fun part: we can ask TLC (the TLA model checker) to check that <code>Spec =&gt; Properties</code>.
You do this by asking the toolbox to create a new model (I called mine <code>SpecOK</code>) and setting <code>Spec</code> as the
&quot;behaviour spec&quot;. It will prompt for a value for <code>BufferSize</code>. I used <code>2</code>.
There will be various things to fix up:</p>
<ul>
<li>To check <code>Write</code>, TLC first tries to get every possible <code>Seq(Byte)</code>, which is an infinite set.
I defined <code>MSG == Seq(Byte)</code> and changed <code>Write</code> to use <code>MSG</code>.
I then added an alternative definition for <code>MSG</code> in the model so that we only send messages of limited length.
In fact, my replacement <code>MSG</code> ensures that <code>Sent</code> will always just be an incrementing sequence (<code>&lt;&lt; 1, 2, 3, ... &gt;&gt;</code>).
That's enough to check <code>Properties</code>, and much quicker than checking every possible message.
</li>
<li>The system can keep sending forever. I added a state constraint to the model: <code>Len(Sent) &lt; 4</code>
This tells TLC to stop considering any execution once this becomes false.
</li>
</ul>
<p>With that, the model runs successfully.
This is a nice feature of TLA: instead of changing our specification to make it testable,
we keep the specification correct and just override some aspects of it in the model.
So, the specification says we can send any message, but the model only checks a few of them.</p>
<p>Now we can add <code>Integrity</code> as an invariant to check.
That passes, but it's good to double-check by changing the algorithm.
I changed <code>Read</code> so that it doesn't clear the buffer, using <code>Buffer' = Drop(Buffer, 0)</code>
(with <code>0</code> instead of <code>n</code>).
Then TLC reports a counter-example (&quot;Invariant Integrity is violated&quot;):</p>
<ol>
<li>The sender writes <code>&lt;&lt; 1, 2 &gt;&gt;</code> to <code>Buffer</code>.
</li>
<li>The reader reads one byte, to give <code>Got=1, Buffer=12, Sent=12</code>.
</li>
<li>The reader reads another byte, to give <code>Got=11, Buffer=12, Sent=12</code>.
</li>
</ol>
<p>Looks like it really was checking what we wanted.
It's good to be careful. If we'd accidentally added <code>Integrity</code> as a &quot;property&quot; to check rather than
as an &quot;invariant&quot; then it would have interpreted it as a temporal formula and reported success just because
it <em>is</em> true in the <em>initial</em> state.</p>
<p>One really nice feature of TLC is that (unlike a fuzz tester) it does a breadth-first search and therefore
finds minimal counter-examples for invariants.
The example above is therefore the quickest way to violate <code>Integrity</code>.</p>
<p>Checking <code>Availability</code> complains because of the use of <code>Nat</code> (we're asking it to check for every possible
length).
I replaced the <code>Nat</code> with <code>AvailabilityNat</code> and overrode that to be <code>0..4</code> in the model.
It then complains &quot;Temporal properties were violated&quot; and shows an example where the sender wrote
some data and the reader never read it.</p>
<p>The problem is, <code>[Next]_vars</code> always allows us to do nothing.
To fix this, we can specify a &quot;weak fairness&quot; constraint.
<code>WF_vars(action)</code>, says that we can't just stop forever with <code>action</code> being always possible but never happening.
I updated <code>Spec</code> to require the <code>Read</code> action to be fair:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Spec</span> <span class="ni">==</span> <span class="n">Init</span> <span class="o">/\</span> <span class="p">[][</span><span class="n">Next</span><span class="p">]</span><span class="n">_vars</span> <span class="o">/\</span> <span class="n">WF_vars</span><span class="p">(</span><span class="n">Read</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Again, care is needed here.
If we had specified <code>WF_vars(Next)</code> then we would be forcing the sender to keep sending forever, which users of vchan are not required to do.
Worse, this would mean that every possible behaviour of the system would result in <code>Sent</code> growing forever.
Every behaviour would therefore hit our <code>Len(Sent) &lt; 4</code> constraint and
TLC wouldn't consider it further.
That means that TLC would <em>never</em> check any actual behaviour against <code>Availability</code>,
and its reports of success would be meaningless!
Changing <code>Read</code> to require <code>n \in 2..Len(Buffer)</code> is a quick way to see that TLC is actually checking <code>Availability</code>.</p>
<p>Here's the complete spec so far: <a href="/blog/images/tla/vchan1.pdf">vchan1.pdf</a> (<a href="https://github.com/talex5/spec-vchan/commit/75a846d5c83d86ba7be42b5c3b9f98635bcc544d">source</a>)</p>
<h2 id="the-real-vchan">The real vchan</h2>
<p>The simple <code>Spec</code> algorithm above has some limitations.
One obvious simplification is that <code>Buffer</code> is just the sequence of bytes in transit, whereas in the real system it is a ring buffer, made up of an array of bytes along with the producer and consumer counters.
We could replace it with three separate variables to make that explicit.
However, ring buffers in Xen are well understood and I don't feel that it would make the specification any clearer
to include that.</p>
<p>A more serious problem is that <code>Spec</code> assumes that there is a way to perform the <code>Read</code> and <code>Write</code> operations atomically.
Otherwise the real system would have behaviours not covered by the spec.
To implement the above <code>Spec</code> correctly, you'd need some kind of lock.
The real vchan protocol is more complicated than <code>Spec</code>, but avoids the need for a lock.</p>
<p>The real system has more shared state than just <code>Buffer</code>.
I added extra variables to the spec for each item of shared state in the C code, along with its initial value:</p>
<ul>
<li><code>SenderLive = TRUE</code> (sender sets to FALSE to close connection)
</li>
<li><code>ReceiverLive = TRUE</code> (receiver sets to FALSE to close connection)
</li>
<li><code>NotifyWrite = TRUE</code> (receiver wants to be notified of next write)
</li>
<li><code>DataReadyInt = FALSE</code> (sender has signalled receiver over event channel)
</li>
<li><code>NotifyRead = FALSE</code> (sender wants to be notified of next read)
</li>
<li><code>SpaceAvailableInt = FALSE</code> (receiver has notified sender over event channel)
</li>
</ul>
<p><code>DataReadyInt</code> represents the state of the receiver's event port.
The sender can make a Xen hypercall to set this and wake (or interrupt) the receiver.
I guess sending these events is somewhat slow,
because the <code>NotifyWrite</code> system is used to avoid sending events unnecessarily.
Likewise, <code>SpaceAvailableInt</code> is the sender's event port.</p>
<h3 id="the-algorithm">The algorithm</h3>
<p>Here is my understanding of the protocol. On the sending side:</p>
<ol>
<li>The sending application asks to send some bytes.<br/>
We check whether the receiver has closed the channel and abort if so.
</li>
<li>We check the amount of buffer space available.
</li>
<li>If there isn't enough, we set <code>NotifyRead</code> so the receiver will notify us when there is more.<br/>
We also check the space again after this, in case it changed while setting the flag.
</li>
<li>If there is any space:
<ul>
<li>We write as much data as we can to the buffer.
</li>
<li>If the <code>NotifyWrite</code> flag is set, we clear it and notify the receiver of the write.
</li>
</ul>
</li>
<li>If we wrote everything, we return success.
</li>
<li>Otherwise, we wait to be notified of more space.
</li>
<li>We check whether the receiver has closed the channel.<br/>
If so we abort. Otherwise, we go back to step 2.
</li>
</ol>
<p>On the receiving side:</p>
<ol>
<li>The receiving application asks us to read up to some amount of data.
</li>
<li>We check the amount of data available in the buffer.
</li>
<li>If there isn't as much as requested, we set <code>NotifyWrite</code> so the sender will notify us when there is.<br/>
We also check the space again after this, in case it changed while setting the flag.
</li>
<li>If there is any data, we read up to the amount requested.<br/>
If the <code>NotifyRead</code> flag is set, we clear it and notify the sender of the new space.<br/>
We return success to the application (even if we didn't get as much as requested).
</li>
<li>Otherwise (if there was no data), we check whether the sender has closed the connection.
</li>
<li>If not (if the connection is still open), we wait to be notified of more data,
and then go back to step 2.
</li>
</ol>
<p>Either side can close the connection by clearing their &quot;live&quot; flag and signalling
the other side. I assumed there is also some process-local way that the close operation
can notify its own side if it's currently blocked.</p>
<p>To make expressing this kind of step-by-step algorithm easier,
TLA+ provides a programming-language-like syntax called PlusCal.
It then translates PlusCal into TLA actions.</p>
<p>Confusingly, there are two different syntaxes for PlusCal: Pascal style and C style.
This means that, when you search for examples on the web,
there is a 50% chance they won't work because they're using the other flavour.
I started with the Pascal one because that was the first example I found, but switched to C-style later because it was more compact.</p>
<p>Here is my attempt at describing the sender algorithm above in PlusCal:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line">  <span class="n">fair</span> <span class="n">process</span> <span class="p">(</span><span class="n">SenderWrite</span> <span class="ni">=</span> <span class="n">SenderWriteID</span><span class="p">)</span>
</span><span class="line">  <span class="n">variables</span> <span class="n">free</span> <span class="ni">=</span> <span class="m">0</span><span class="p">,</span>     <span class="c">\* Our idea of how much free space is available.</span>
</span><span class="line">            <span class="n">msg</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span><span class="p">,</span>  <span class="c">\* The data we haven&#39;t sent yet.</span>
</span><span class="line">            <span class="n">Sent</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span><span class="p">;</span> <span class="c">\* Everything we were asked to send.</span>
</span><span class="line">  <span class="ni">{</span>
</span><span class="line"><span class="n">sender_ready</span><span class="p">:</span><span class="o">-</span>        <span class="n">while</span> <span class="p">(</span><span class="bp">TRUE</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                        <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">SenderLive</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="n">goto</span> <span class="n">Done</span>
</span><span class="line">                        <span class="n">else</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">with</span> <span class="p">(</span><span class="n">m</span> <span class="s">\in</span> <span class="n">MSG</span><span class="p">)</span> <span class="ni">{</span> <span class="n">msg</span> <span class="o">:=</span> <span class="n">m</span> <span class="ni">}</span><span class="p">;</span>
</span><span class="line">                          <span class="n">Sent</span> <span class="o">:=</span> <span class="n">Sent</span> <span class="nb">\o</span> <span class="n">msg</span><span class="p">;</span>    <span class="c">\* Remember we wanted to send this</span>
</span><span class="line">                        <span class="ni">}</span><span class="p">;</span>
</span><span class="line"><span class="n">sender_write</span><span class="p">:</span>           <span class="n">while</span> <span class="p">(</span><span class="bp">TRUE</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">free</span> <span class="o">:=</span> <span class="n">BufferSize</span> <span class="o">-</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">);</span>
</span><span class="line"><span class="n">sender_request_notify</span><span class="p">:</span>    <span class="n">if</span> <span class="p">(</span><span class="n">free</span> <span class="o">&gt;=</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">))</span> <span class="n">goto</span> <span class="n">sender_write_data</span>
</span><span class="line">                          <span class="n">else</span> <span class="n">NotifyRead</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span><span class="line"><span class="n">sender_recheck_len</span><span class="p">:</span>       <span class="n">free</span> <span class="o">:=</span> <span class="n">BufferSize</span> <span class="o">-</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">);</span>
</span><span class="line"><span class="n">sender_write_data</span><span class="p">:</span>        <span class="n">if</span> <span class="p">(</span><span class="n">free</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                            <span class="n">Buffer</span> <span class="o">:=</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="n">Min</span><span class="p">(</span><span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="n">free</span><span class="p">));</span>
</span><span class="line">                            <span class="n">msg</span> <span class="o">:=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="n">Min</span><span class="p">(</span><span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="n">free</span><span class="p">));</span>
</span><span class="line">                            <span class="n">free</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span>
</span><span class="line"><span class="n">sender_check_notify_data</span><span class="p">:</span>   <span class="n">if</span> <span class="p">(</span><span class="n">NotifyWrite</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                              <span class="n">NotifyWrite</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>   <span class="c">\* Atomic test-and-clear</span>
</span><span class="line"><span class="n">sender_notify_data</span><span class="p">:</span>           <span class="n">DataReadyInt</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>   <span class="c">\* Signal receiver</span>
</span><span class="line">                              <span class="n">if</span> <span class="p">(</span><span class="n">msg</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span><span class="p">)</span> <span class="n">goto</span> <span class="n">sender_ready</span>
</span><span class="line">                            <span class="ni">}</span> <span class="n">else</span> <span class="n">if</span> <span class="p">(</span><span class="n">msg</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span><span class="p">)</span> <span class="n">goto</span> <span class="n">sender_ready</span>
</span><span class="line">                          <span class="ni">}</span><span class="p">;</span>
</span><span class="line"><span class="n">sender_blocked</span><span class="p">:</span>           <span class="n">await</span> <span class="n">SpaceAvailableInt</span> <span class="o">\/</span> <span class="o">~</span><span class="n">SenderLive</span><span class="p">;</span>
</span><span class="line">                          <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">SenderLive</span><span class="p">)</span> <span class="n">goto</span> <span class="n">Done</span><span class="p">;</span>
</span><span class="line">                          <span class="n">else</span> <span class="n">SpaceAvailableInt</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>
</span><span class="line"><span class="n">sender_check_recv_live</span><span class="p">:</span>   <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="n">goto</span> <span class="n">Done</span><span class="p">;</span>
</span><span class="line">                        <span class="ni">}</span>
</span><span class="line">                      <span class="ni">}</span>
</span><span class="line">  <span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The labels (e.g. <code>sender_request_notify:</code>) represent points in the program where other actions can happen.
Everything between two labels is considered to be atomic.
I <a href="https://github.com/talex5/spec-vchan/blob/d6e1c803820c952c53314da47270812e2fe88e79/vchan.tla#L654-L692">checked</a> that every block of code between labels accesses only one shared variable.
This means that the real system can't see any states that we don't consider.
The toolbox doesn't provide any help with this; you just have to check manually.</p>
<p>The <code>sender_ready</code> label represents a state where the client application hasn't yet decided to send any data.
Its label is tagged with <code>-</code> to indicate that fairness doesn't apply here, because the protocol doesn't
require applications to keep sending more data forever.
The other steps are fair, because once we've decided to send something we should keep going.</p>
<p>Taking a step from <code>sender_ready</code> to <code>sender_write</code> corresponds to the vchan library's write function
being called with some argument <code>m</code>.
The <code>with (m \in MSG)</code> says that <code>m</code> could be any message from the set <code>MSG</code>.
TLA also contains a <code>CHOOSE</code> operator that looks like it might do the same thing, but it doesn't.
When you use <code>with</code>, you are saying that TLC should check <em>all</em> possible messages.
When you use <code>CHOOSE</code>, you are saying that it doesn't matter which message TLC tries (and it will always try the
same one).
Or, in terms of the specification, a <code>CHOOSE</code> would say that applications can only ever send one particular message, without telling you what that message is.</p>
<p>In <code>sender_write_data</code>, we set <code>free := 0</code> for no obvious reason.
This is just to reduce the number of states that the model checker needs to explore,
since we don't care about its value after this point.</p>
<p>Some of the code is a little awkward because I had to put things in <code>else</code> branches that would more naturally go after the whole <code>if</code> block, but the translator wouldn't let me do that.
The use of semi-colons is also a bit confusing: the PlusCal-to-TLA translator requires them after a closing brace in some places, but the PDF generator messes up the indentation if you include them.</p>
<p>Here's how the code block starting at <code>sender_request_notify</code> gets translated into a TLA action:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">sender_request_notify</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_request_notify&quot;</span>
</span><span class="line">  <span class="o">/\</span> <span class="k k-Conditional">IF</span> <span class="n">free</span> <span class="o">&gt;=</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span><span class="line">        <span class="k k-Conditional">THEN</span> <span class="o">/\</span> <span class="n">pc</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="p">[</span><span class="n">pc</span> <span class="n">EXCEPT</span> <span class="err">!</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_write_data&quot;</span><span class="p">]</span>
</span><span class="line">             <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="n">NotifyRead</span>
</span><span class="line">        <span class="k k-Conditional">ELSE</span> <span class="o">/\</span> <span class="n">NotifyRead</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="bp">TRUE</span>
</span><span class="line">             <span class="o">/\</span> <span class="n">pc</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="p">[</span><span class="n">pc</span> <span class="n">EXCEPT</span> <span class="err">!</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_recheck_len&quot;</span><span class="p">]</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="ni">&lt;&lt;</span> <span class="n">SenderLive</span><span class="p">,</span> <span class="n">ReceiverLive</span><span class="p">,</span> <span class="n">Buffer</span><span class="p">,</span> 
</span><span class="line">                  <span class="n">NotifyWrite</span><span class="p">,</span> <span class="n">DataReadyInt</span><span class="p">,</span> 
</span><span class="line">                  <span class="n">SpaceAvailableInt</span><span class="p">,</span> <span class="n">free</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">Sent</span><span class="p">,</span> 
</span><span class="line">                  <span class="n">have</span><span class="p">,</span> <span class="n">want</span><span class="p">,</span> <span class="n">Got</span> <span class="ni">&gt;&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>pc</code> is a mapping from process ID to the label where that process is currently executing.
So <code>sender_request_notify</code> can only be performed when the SenderWriteID process is
at the <code>sender_request_notify</code> label.
Afterwards <code>pc[SenderWriteID]</code> will either be at <code>sender_write_data</code> or <code>sender_recheck_len</code>
(if there wasn't enough space for the whole message).</p>
<p>Here's the code for the receiver:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line">  <span class="n">fair</span> <span class="n">process</span> <span class="p">(</span><span class="n">ReceiverRead</span> <span class="ni">=</span> <span class="n">ReceiverReadID</span><span class="p">)</span>
</span><span class="line">  <span class="n">variables</span> <span class="n">have</span> <span class="ni">=</span> <span class="m">0</span><span class="p">,</span>     <span class="c">\* The amount of data we think the buffer contains.</span>
</span><span class="line">            <span class="n">want</span> <span class="ni">=</span> <span class="m">0</span><span class="p">,</span>     <span class="c">\* The amount of data the user wants us to read.</span>
</span><span class="line">            <span class="n">Got</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span><span class="p">;</span>  <span class="c">\* Pseudo-variable recording all data ever received by receiver.</span>
</span><span class="line">  <span class="ni">{</span>
</span><span class="line"><span class="n">recv_ready</span><span class="p">:</span>         <span class="n">while</span> <span class="p">(</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                      <span class="n">with</span> <span class="p">(</span><span class="n">n</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">MaxReadLen</span><span class="p">)</span> <span class="n">want</span> <span class="o">:=</span> <span class="n">n</span><span class="p">;</span>
</span><span class="line"><span class="n">recv_reading</span><span class="p">:</span>         <span class="n">while</span> <span class="p">(</span><span class="bp">TRUE</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                        <span class="n">have</span> <span class="o">:=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">);</span>
</span><span class="line"><span class="n">recv_got_len</span><span class="p">:</span>           <span class="n">if</span> <span class="p">(</span><span class="n">have</span> <span class="o">&gt;=</span> <span class="n">want</span><span class="p">)</span> <span class="n">goto</span> <span class="n">recv_read_data</span>
</span><span class="line">                        <span class="n">else</span> <span class="n">NotifyWrite</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span><span class="line"><span class="n">recv_recheck_len</span><span class="p">:</span>       <span class="n">have</span> <span class="o">:=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">);</span>
</span><span class="line"><span class="n">recv_read_data</span><span class="p">:</span>         <span class="n">if</span> <span class="p">(</span><span class="n">have</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">Got</span> <span class="o">:=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">Buffer</span><span class="p">,</span> <span class="n">Min</span><span class="p">(</span><span class="n">want</span><span class="p">,</span> <span class="n">have</span><span class="p">));</span>
</span><span class="line">                          <span class="n">Buffer</span> <span class="o">:=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">Buffer</span><span class="p">,</span> <span class="n">Min</span><span class="p">(</span><span class="n">want</span><span class="p">,</span> <span class="n">have</span><span class="p">));</span>
</span><span class="line">                          <span class="n">want</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span>
</span><span class="line">                          <span class="n">have</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span>
</span><span class="line"><span class="n">recv_check_notify_read</span><span class="p">:</span>   <span class="n">if</span> <span class="p">(</span><span class="n">NotifyRead</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                            <span class="n">NotifyRead</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>      <span class="c">\* (atomic test-and-clear)</span>
</span><span class="line"><span class="n">recv_notify_read</span><span class="p">:</span>           <span class="n">SpaceAvailableInt</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span><span class="line">                            <span class="n">goto</span> <span class="n">recv_ready</span><span class="p">;</span>          <span class="c">\* Return success</span>
</span><span class="line">                          <span class="ni">}</span> <span class="n">else</span> <span class="n">goto</span> <span class="n">recv_ready</span><span class="p">;</span>     <span class="c">\* Return success</span>
</span><span class="line">                        <span class="ni">}</span> <span class="n">else</span> <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">SenderLive</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">goto</span> <span class="n">Done</span><span class="p">;</span>
</span><span class="line">                        <span class="ni">}</span><span class="p">;</span>
</span><span class="line"><span class="n">recv_await_data</span><span class="p">:</span>        <span class="n">await</span> <span class="n">DataReadyInt</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">;</span>
</span><span class="line">                        <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="ni">{</span> <span class="n">want</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">goto</span> <span class="n">Done</span> <span class="ni">}</span>
</span><span class="line">                        <span class="n">else</span> <span class="n">DataReadyInt</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>
</span><span class="line">                      <span class="ni">}</span>
</span><span class="line">                    <span class="ni">}</span>
</span><span class="line">  <span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's quite similar to before.
<code>recv_ready</code> corresponds to a state where the application hasn't yet called <code>read</code>.
When it does, we take <code>n</code> (the maximum number of bytes to read) as an argument and
store it in the local variable <code>want</code>.</p>
<p>Note: you can use the C library in blocking or non-blocking mode.
In blocking mode, a <code>write</code> (or <code>read</code>) waits until data is sent (or received).
In non-blocking mode, it returns a special code to the application indicating that it needs to wait.
The application then does the waiting itself and then calls the library again.
I think the specification above covers both cases, depending on whether you think of
<code>sender_blocked</code> and <code>recv_await_data</code> as representing code inside or outside of the library.</p>
<p>We also need a way to close the channel.
It wasn't clear to me, from looking at the C headers, when exactly you're allowed to do that.
I <em>think</em> that if you had a multi-threaded program and you called the close function while the write
function was blocked, it would unblock and return.
But if you happened to call it at the wrong time, it would try to use a closed file descriptor and fail
(or read from the wrong one).
So I guess it's single threaded, and you should use the non-blocking mode if you want to cancel things.</p>
<p>That means that the sender can close only when it is at <code>sender_ready</code> or <code>sender_blocked</code>,
and similarly for the receiver.
The situation with the OCaml code is the same, because it is cooperatively threaded and so the close
operation can only be called while blocked or idle.
However, I decided to make the specification more general and allow for closing at any point
by modelling closing as separate processes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line">  <span class="n">fair</span> <span class="n">process</span> <span class="p">(</span><span class="n">SenderClose</span> <span class="ni">=</span> <span class="n">SenderCloseID</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">    <span class="n">sender_open</span><span class="p">:</span><span class="o">-</span>         <span class="n">SenderLive</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>  <span class="c">\* Clear liveness flag</span>
</span><span class="line">    <span class="n">sender_notify_closed</span><span class="p">:</span> <span class="n">DataReadyInt</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span> <span class="c">\* Signal receiver</span>
</span><span class="line">  <span class="ni">}</span>
</span><span class="line">
</span><span class="line">  <span class="n">fair</span> <span class="n">process</span> <span class="p">(</span><span class="n">ReceiverClose</span> <span class="ni">=</span> <span class="n">ReceiverCloseID</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">    <span class="n">recv_open</span><span class="p">:</span><span class="o">-</span>         <span class="n">ReceiverLive</span> <span class="o">:=</span> <span class="bp">FALSE</span><span class="p">;</span>      <span class="c">\* Clear liveness flag</span>
</span><span class="line">    <span class="n">recv_notify_closed</span><span class="p">:</span> <span class="n">SpaceAvailableInt</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>  <span class="c">\* Signal sender</span>
</span><span class="line">  <span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Again, the processes are &quot;fair&quot; because once we start closing we should finish,
but the initial labels are tagged with &quot;-&quot; to disable fairness there: it's OK if
you keep a vchan open forever.</p>
<p>There's a slight naming problem here.
The PlusCal translator names the actions it generates after the <em>starting</em> state of the action.
So <em>sender_open</em> is the action that moves <em>from</em> the <em>sender_open</em> label.
That is, the <em>sender_open</em> action actually closes the connection!</p>
<p>Finally, we share the event channel with the buffer going in the other direction, so we might
get notifications that are nothing to do with us.
To ensure we handle that, I added another process that can send events at any time:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line">  <span class="n">process</span> <span class="p">(</span><span class="n">SpuriousInterrupts</span> <span class="ni">=</span> <span class="n">SpuriousID</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">    <span class="n">spurious</span><span class="p">:</span> <span class="n">while</span> <span class="p">(</span><span class="bp">TRUE</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line">                <span class="n">either</span> <span class="n">SpaceAvailableInt</span> <span class="o">:=</span> <span class="bp">TRUE</span>
</span><span class="line">                <span class="n">or</span>     <span class="n">DataReadyInt</span> <span class="o">:=</span> <span class="bp">TRUE</span>
</span><span class="line">              <span class="ni">}</span>
</span><span class="line">  <span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>either/or</code> says that we need to consider both possibilities.
This process isn't marked fair, because we can't rely these interrupts coming.
But we do have to handle them when they happen.</p>
<h3 id="testing-the-full-spec">Testing the full spec</h3>
<p>PlusCal code is written in a specially-formatted comment block, and you have to press Ctrl-T to
generate (or update) then TLA translation before running the model checker.</p>
<p>Be aware that the TLA Toolbox is a bit unreliable about keyboard short-cuts.
While typing into the editor always works, short-cuts such as Ctrl-S (save) sometimes get disconnected.
So you think you're doing &quot;edit/save/translate/save/check&quot; cycles, but really you're just checking some old version over and over again.
You can avoid this by always running the model checker with the keyboard shortcut too, since that always seems to fail at the same time as the others.
Focussing a different part of the GUI and then clicking back in the editor again fixes everything for a while.</p>
<p>Anyway, running our model on the new spec shows that <code>Integrity</code> is still OK.
However, the <code>Availability</code> check fails with the following counter-example:</p>
<ol>
<li>The sender writes <code>&lt;&lt; 1 &gt;&gt;</code> to <code>Buffer</code>.
</li>
<li>The sender closes the connection.
</li>
<li>The receiver closes the connection.
</li>
<li>All processes come to a stop, but the data never arrived.
</li>
</ol>
<p>We need to update <code>Availability</code> to consider the effects of closing connections.
And at this point, I'm very unsure what vchan is intended to do.
We could say:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">AvailabilityNat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="o">\/</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span><span class="line">                     <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span>
</span><span class="line">                     <span class="o">\/</span> <span class="o">~</span><span class="n">SenderLive</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That passes.
But vchan describes itself as being like a Unix socket.
If you write to a Unix socket and then close it, you still expect the data to be delivered.
So actually I tried this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Availability</span> <span class="ni">==</span>
</span><span class="line">  <span class="s">\A</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">AvailabilityNat</span> <span class="p">:</span>
</span><span class="line">    <span class="n">x</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Sent</span><span class="p">)</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span> <span class="o">~</span><span class="ni">&gt;</span>
</span><span class="line">         <span class="o">\/</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">x</span>
</span><span class="line">         <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that if a sender write operation completes successfully (we're back at <code>sender_ready</code>)
and at that point the sender hasn't closed the connection, then the receiver will eventually receive
the data (or close its end).</p>
<p>That is how I would expect it to behave.
But TLC reports that the new spec does <em>not</em> satisfy this, giving this example (simplified - there are 16 steps in total):</p>
<ol>
<li>The receiver starts reading. It finds that the buffer is empty.
</li>
<li>The sender writes some data to <code>Buffer</code> and returns to <code>sender_ready</code>.
</li>
<li>The sender closes the channel.
</li>
<li>The receiver sees that the connection is closed and stops.
</li>
</ol>
<p>Is this a bug? Without a specification, it's impossible to say.
Maybe vchan was never intended to ensure delivery once the sender has closed its end.
But this case only happens if you're very unlucky about the scheduling.
If the receiving application calls <code>read</code> when the sender has closed the connection but there is data
available then the C code <em>does</em> return the data in that case.
It's only if the sender happens to close the connection just after the receiver has checked the buffer and just before it checks the close flag that this happens.</p>
<p>It's also easy to fix.
I changed the code in the receiver to do a final check on the buffer before giving up:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line">                        <span class="ni">}</span> <span class="n">else</span> <span class="n">if</span> <span class="p">(</span><span class="o">~</span><span class="n">SenderLive</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">)</span> <span class="ni">{</span>
</span><span class="line"><span class="n">recv_final_check</span><span class="p">:</span>         <span class="n">if</span> <span class="p">(</span><span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">=</span> <span class="m">0</span><span class="p">)</span> <span class="ni">{</span> <span class="n">want</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">goto</span> <span class="n">Done</span> <span class="ni">}</span>
</span><span class="line">                          <span class="n">else</span> <span class="n">goto</span> <span class="n">recv_reading</span><span class="p">;</span>
</span><span class="line">                        <span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With that change, we can be sure that data sent while the connection is open will always be delivered
(provided only that the receiver doesn't close the connection itself).
If you spotted this issue yourself while you were reviewing the code earlier, then well done!</p>
<p>Note that when TLC finds a problem with a temporal property (such as <code>Availability</code>),
it does not necessarily find the shortest example first.
I changed the limit on <code>Sent</code> to <code>Len(Sent) &lt; 2</code> and added an action constraint of <code>~SpuriousInterrupts</code>
to get a simpler example, with only 1 byte being sent and no spurious interrupts.</p>
<h3 id="some-odd-things">Some odd things</h3>
<p>I noticed a couple of other odd things, which I thought I'd mention.</p>
<p>First, <code>NotifyWrite</code> is initialised to <code>TRUE</code>, which seemed unnecessary.
We can initialise it to <code>FALSE</code> instead and everything still works.
We can even initialise it with <code>NotifyWrite \in {TRUE, FALSE}</code> to allow either behaviour,
and thus test that old programs that followed the original version of the spec still work
with either behaviour.</p>
<p>That's a nice advantage of using a specification language.
Saying &quot;the code is the spec&quot; becomes less useful as you build up more and more versions of the code!</p>
<p>However, because there was no spec before, we can't be sure that existing programs do follow it.
And, in fact, I found that QubesDB uses the vchan library in a different and unexpected way.
Instead of calling read, and then waiting if libvchan says to, QubesDB blocks first in all cases, and
then calls the read function once it gets an event.</p>
<p>We can document that by adding an extra step at the start of ReceiverRead:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">recv_init</span><span class="p">:</span>          <span class="n">either</span> <span class="n">goto</span> <span class="n">recv_ready</span>        <span class="c">\* (recommended)</span>
</span><span class="line">                    <span class="n">or</span> <span class="ni">{</span>    <span class="c">\* (QubesDB does this)</span>
</span><span class="line">                      <span class="n">with</span> <span class="p">(</span><span class="n">n</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">MaxReadLen</span><span class="p">)</span> <span class="n">want</span> <span class="o">:=</span> <span class="n">n</span><span class="p">;</span>
</span><span class="line">                      <span class="n">goto</span> <span class="n">recv_await_data</span><span class="p">;</span>
</span><span class="line">                    <span class="ni">}</span><span class="p">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then TLC shows that <code>NotifyWrite</code> cannot start as <code>FALSE</code>.</p>
<p>The second odd thing is that the receiver sets <code>NotifyRead</code> whenever there isn't enough data available
to fill the application's buffer completely.
But usually when you do a read operation you just provide a buffer large enough for the largest likely message.
It would probably make more sense to set <code>NotifyWrite</code> only when the buffer is completely empty.
After checking the current version of the algorithm, I changed the specification to allow either behaviour.</p>
<h3 id="why-does-vchan-work">Why does vchan work?</h3>
<p>At this point, we have specified what vchan should do and how it does it.
We have also checked that it does do this, at least for messages up to 3 bytes long with a buffer size of 2.
That doesn't sound like much, but we still checked 79,288 distinct states, with behaviours up to 38 steps long.
This would be a perfectly reasonable place to declare the specification (and blog post) finished.</p>
<p>However, TLA has some other interesting abilities.
In particular, it provides a very interesting technique to help discover <em>why</em> the algorithm works.</p>
<p>We'll start with <code>Integrity</code>.
We would like to argue as follows:</p>
<ol>
<li><code>Integrity</code> is true in any initial state (i.e. <code>Init =&gt; Integrity</code>).
</li>
<li>Any <code>Next</code> step preserves <code>Integrity</code> (i.e. <code>Integrity /\ Next =&gt; Integrity'</code>).
</li>
</ol>
<p>Then it would just be a matter looking at each possible action that makes up <code>Next</code> and
checking that each one individually preserves <code>Integrity</code>.
However, we can't do this with <code>Integrity</code> because (2) isn't true.
For example, the state <code>{ Got: &quot;&quot;, Buffer: &quot;21&quot;, Sent: &quot;12&quot; }</code> satisfies <code>Integrity</code>,
but if we take a read step then the new state won't.
Instead, we have to argue &quot;If we take a <code>Next</code> step in any reachable state then <code>Integrity'</code>&quot;,
but that's very difficult because how do we know whether a state is reachable without searching them all?</p>
<p>So the idea is to make a stronger version of <code>Integrity</code>, called <code>IntegrityI</code>, which does what we want.
<code>IntegrityI</code> is called an <em>inductive invariant</em>.
The first step is fairly obvious - I began with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">IntegrityI</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">Sent</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">msg</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Integrity</code> just said that <code>Got</code> is a prefix of <code>Sent</code>.
This says specifically that the rest is <code>Buffer \o msg</code> - the data currently being transmitted and the data yet to be transmitted.</p>
<p>We can ask TLC to check <code>Init /\ [][Next]_vars =&gt; []IntegrityI</code> to check that it is an invariant, as before.
It does that by finding all the <code>Init</code> states and then taking <code>Next</code> steps to find all reachable states.
But we can also ask it to check <code>IntegrityI /\ [][Next]_vars =&gt; []IntegrityI</code>.
That is, the same thing but starting from any state matching <code>IntegrityI</code> instead of <code>Init</code>.</p>
<p>I created a new model (<code>IntegrityI</code>) to do that.
It reports a few technical problems at the start because it doesn't know the types of anything.
For example, it can't choose initial values for <code>SenderLive</code> without knowing that <code>SenderLive</code> is a boolean.
I added a <code>TypeOK</code> state expression that gives the expected type of every variable:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">MESSAGE</span> <span class="ni">==</span> <span class="n">Seq</span><span class="p">(</span><span class="n">Byte</span><span class="p">)</span>
</span><span class="line"><span class="n">FINITE_MESSAGE</span><span class="p">(</span><span class="n">L</span><span class="p">)</span> <span class="ni">==</span> <span class="n">UNION</span> <span class="p">(</span> <span class="ni">{</span> <span class="p">[</span> <span class="m">1</span><span class="o">..</span><span class="n">N</span> <span class="o">-&gt;</span> <span class="n">Byte</span> <span class="p">]</span> <span class="p">:</span> <span class="n">N</span> <span class="s">\in</span> <span class="m">0</span><span class="o">..</span><span class="n">L</span> <span class="ni">}</span> <span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="n">TypeOK</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Got</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Buffer</span> <span class="s">\in</span> <span class="n">FINITE_MESSAGE</span><span class="p">(</span><span class="n">BufferSize</span><span class="p">)</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">SenderLive</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">ReceiverLive</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">NotifyWrite</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">DataReadyInt</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">NotifyRead</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">SpaceAvailableInt</span> <span class="s">\in</span> <span class="bp">BOOLEAN</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">free</span> <span class="s">\in</span> <span class="m">0</span><span class="o">..</span><span class="n">BufferSize</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">msg</span> <span class="s">\in</span> <span class="n">FINITE_MESSAGE</span><span class="p">(</span><span class="n">MaxWriteLen</span><span class="p">)</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">want</span> <span class="s">\in</span> <span class="m">0</span><span class="o">..</span><span class="n">MaxReadLen</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">have</span> <span class="s">\in</span> <span class="m">0</span><span class="o">..</span><span class="n">BufferSize</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We also need to tell it all the possible states of <code>pc</code> (which says which label each process it at):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">PCOK</span> <span class="ni">==</span> <span class="n">pc</span> <span class="s">\in</span> <span class="p">[</span>
</span><span class="line">    <span class="n">SW</span><span class="p">:</span> <span class="ni">{</span><span class="s">&quot;sender_ready&quot;</span><span class="p">,</span> <span class="s">&quot;sender_write&quot;</span><span class="p">,</span> <span class="s">&quot;sender_request_notify&quot;</span><span class="p">,</span> <span class="s">&quot;sender_recheck_len&quot;</span><span class="p">,</span>
</span><span class="line">         <span class="s">&quot;sender_write_data&quot;</span><span class="p">,</span> <span class="s">&quot;sender_blocked&quot;</span><span class="p">,</span> <span class="s">&quot;sender_check_notify_data&quot;</span><span class="p">,</span>
</span><span class="line">         <span class="s">&quot;sender_notify_data&quot;</span><span class="p">,</span> <span class="s">&quot;sender_check_recv_live&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span><span class="p">,</span>
</span><span class="line">    <span class="n">SC</span><span class="p">:</span> <span class="ni">{</span><span class="s">&quot;sender_open&quot;</span><span class="p">,</span> <span class="s">&quot;sender_notify_closed&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span><span class="p">,</span>
</span><span class="line">    <span class="n">RR</span><span class="p">:</span> <span class="ni">{</span><span class="s">&quot;recv_init&quot;</span><span class="p">,</span> <span class="s">&quot;recv_ready&quot;</span><span class="p">,</span> <span class="s">&quot;recv_reading&quot;</span><span class="p">,</span> <span class="s">&quot;recv_got_len&quot;</span><span class="p">,</span> <span class="s">&quot;recv_recheck_len&quot;</span><span class="p">,</span>
</span><span class="line">         <span class="s">&quot;recv_read_data&quot;</span><span class="p">,</span> <span class="s">&quot;recv_final_check&quot;</span><span class="p">,</span> <span class="s">&quot;recv_await_data&quot;</span><span class="p">,</span>
</span><span class="line">         <span class="s">&quot;recv_check_notify_read&quot;</span><span class="p">,</span> <span class="s">&quot;recv_notify_read&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span><span class="p">,</span>
</span><span class="line">    <span class="n">RC</span><span class="p">:</span> <span class="ni">{</span><span class="s">&quot;recv_open&quot;</span><span class="p">,</span> <span class="s">&quot;recv_notify_closed&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span><span class="p">,</span>
</span><span class="line">    <span class="n">SP</span><span class="p">:</span> <span class="ni">{</span><span class="s">&quot;spurious&quot;</span><span class="ni">}</span>
</span><span class="line"><span class="p">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You might imagine that the PlusCal translator would generate that for you, but it doesn't.
We also need to override <code>MESSAGE</code> with <code>FINITE_MESSAGE(n)</code> for some <code>n</code> (I used <code>2</code>).
Otherwise, it can't enumerate all possible messages.
Now we have:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">IntegrityI</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">TypeOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">PCOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">msg</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With that out of the way, TLC starts finding real problems
(that is, examples showing that <code>IntegrityI /\ Next =&gt; IntegrityI'</code> isn't true).
First, <code>recv_read_data</code> would do an out-of-bounds read if <code>have = 1</code> and <code>Buffer = &lt;&lt; &gt;&gt;</code>.
Our job is to explain why that isn't a valid state.
We can fix it with an extra constraint:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">IntegrityI</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">TypeOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">PCOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">msg</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_read_data&quot;</span> <span class="ni">=&gt;</span> <span class="n">have</span> <span class="o">&lt;=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(note: that <code>=&gt;</code> is &quot;implies&quot;, while the <code>&lt;=</code> is &quot;less-than-or-equal-to&quot;)</p>
<p>Now it complains that if we do <code>recv_got_len</code> with <code>Buffer = &lt;&lt; &gt;&gt;, have = 1, want = 0</code> then we end up in <code>recv_read_data</code> with
<code>Buffer = &lt;&lt; &gt;&gt;, have = 1</code>, and we have to explain why <em>that</em> can't happen and so on.</p>
<p>Because TLC searches breadth-first, the examples it finds never have more than 2 states.
You just have to explain why the first state can't happen in the real system.
Eventually, you get a big ugly pile of constraints, which you then think about for a bit and simply.
I ended up with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">IntegrityI</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">TypeOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">PCOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Sent</span> <span class="ni">=</span> <span class="n">Got</span> <span class="nb">\o</span> <span class="n">Buffer</span> <span class="nb">\o</span> <span class="n">msg</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">have</span> <span class="o">&lt;=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">free</span> <span class="o">&lt;=</span> <span class="n">BufferSize</span> <span class="o">-</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_write&quot;</span><span class="p">,</span> <span class="s">&quot;sender_request_notify&quot;</span><span class="p">,</span> <span class="s">&quot;sender_recheck_len&quot;</span><span class="p">,</span>
</span><span class="line">                            <span class="s">&quot;sender_write_data&quot;</span><span class="p">,</span> <span class="s">&quot;sender_blocked&quot;</span><span class="p">,</span> <span class="s">&quot;sender_check_recv_live&quot;</span><span class="ni">}</span>
</span><span class="line">     <span class="ni">=&gt;</span> <span class="n">msg</span> <span class="o">/</span><span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_ready&quot;</span><span class="ni">}</span> <span class="ni">=&gt;</span> <span class="n">msg</span> <span class="ni">=</span> <span class="ni">&lt;&lt;</span> <span class="ni">&gt;&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's a good idea to check the final <code>IntegrityI</code> with the original <code>SpecOK</code> model,
just to check it really is an invariant.</p>
<p>So, in summary, <code>Integrity</code> is always true because:</p>
<ul>
<li>
<p><code>Sent</code> is always the concatenation of <code>Got</code>, <code>Buffer</code> and <code>msg</code>.
That's fairly obvious, because <code>sender_ready</code> sets <code>msg</code> and appends the same thing to <code>Sent</code>,
and the other steps (<code>sender_write_data</code> and <code>recv_read_data</code>) just transfer some bytes from
the start of one variable to the end of another.</p>
</li>
<li>
<p>Although, like all local information, the receiver's <code>have</code> variable might be out-of-date,
there must be <em>at least</em> that much data in the buffer, because the sender process will only
have added more, not removed any. This is sufficient to ensure that we never do an
out-of-range read.</p>
</li>
<li>
<p>Likewise, the sender's <code>free</code> variable is a lower bound on the true amount of free space,
because the receiver only ever creates more space. We will therefore never write beyond the
free space.</p>
</li>
</ul>
<p>I think this ability to explain why an algorithm works, by being shown examples where the inductive property
doesn't hold, is a really nice feature of TLA.
Inductive invariants are useful as a first step towards writing a proof,
but I think they're valuable even on their own.
If you're documenting your own algorithm,
this process will get you to explain your own reasons for believing it works
(I <a href="https://github.com/mirage/capnp-rpc/pull/149">tried it</a> on a simple algorithm in my own code and it seemed helpful).</p>
<p>Some notes:</p>
<ul>
<li>
<p>Originally, I had the <code>free</code> and <code>have</code> constraints depending on <code>pc</code>.
However, the algorithm sets them to zero when not in use so it turns out they're always true.</p>
</li>
<li>
<p><code>IntegrityI</code> matches 532,224 states, even with a maximum <code>Sent</code> length of 1, but it passes!
There are some games you can play to speed things up;
see <a href="https://lamport.azurewebsites.net/tla/inductive-invariant.pdf">Using TLC to Check Inductive Invariance</a> for some suggestions
(I only discovered that while writing this up).</p>
</li>
</ul>
<h3 id="proving-integrity">Proving Integrity</h3>
<p>TLA provides a syntax for writing proofs,
and integrates with <a href="https://tla.msr-inria.inria.fr/tlaps/content/Home.html">TLAPS</a> (the <em>TLA+ Proof System</em>) to allow them to be checked automatically.</p>
<p>Proving <code>IntegrityI</code> is just a matter of showing that <code>Init =&gt; IntegrityI</code> and that it is preserved
by any possible <code>[Next]_vars</code> step.
To do that, we consider each action of <code>Next</code> individually, which is long but simple enough.</p>
<p>I was able to prove it, but the <code>recv_read_data</code> action was a little difficult
because we don't know that <code>want &gt; 0</code> at that point, so we have to do some extra work
to prove that transferring 0 bytes works, even though the real system never does that.</p>
<p>I therefore added an extra condition to <code>IntegrityI</code> that <code>want</code> is non-zero whenever it's in use,
and also conditions about <code>have</code> and <code>free</code> being 0 when not in use, for completeness:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">IntegrityI</span> <span class="ni">==</span>
</span><span class="line">  <span class="p">[</span><span class="o">...</span><span class="p">]</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">want</span> <span class="ni">=</span> <span class="m">0</span> <span class="o">&lt;=</span><span class="ni">&gt;</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_check_notify_read&quot;</span><span class="p">,</span> <span class="s">&quot;recv_notify_read&quot;</span><span class="p">,</span>
</span><span class="line">                                          <span class="s">&quot;recv_init&quot;</span><span class="p">,</span> <span class="s">&quot;recv_ready&quot;</span><span class="p">,</span> <span class="s">&quot;recv_notify_read&quot;</span><span class="p">,</span>
</span><span class="line">                                          <span class="s">&quot;Done&quot;</span><span class="ni">}</span>
</span><span class="line">  <span class="o">/\</span> <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_got_len&quot;</span><span class="p">,</span> <span class="s">&quot;recv_recheck_len&quot;</span><span class="p">,</span> <span class="s">&quot;recv_read_data&quot;</span><span class="ni">}</span>
</span><span class="line">     <span class="o">\/</span> <span class="n">have</span> <span class="ni">=</span> <span class="m">0</span>
</span><span class="line">  <span class="o">/\</span> <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_write&quot;</span><span class="p">,</span> <span class="s">&quot;sender_request_notify&quot;</span><span class="p">,</span>
</span><span class="line">                               <span class="s">&quot;sender_recheck_len&quot;</span><span class="p">,</span> <span class="s">&quot;sender_write_data&quot;</span><span class="ni">}</span>
</span><span class="line">     <span class="o">\/</span> <span class="n">free</span> <span class="ni">=</span> <span class="m">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="availability">Availability</h3>
<p><code>Integrity</code> was quite easy to prove, but I had more trouble trying to explain <code>Availability</code>.
One way to start would be to add <code>Availability</code> as a property to check to the <code>IntegrityI</code> model.
However, it takes a while to check properties as it does them at the end, and the examples
it finds may have several steps (it took 1m15s to find a counter-example for me).</p>
<p>Here's a faster way (37s).
The algorithm will deadlock if both sender and receiver are in their blocked states and neither
interrupt is pending, so I made a new invariant, <code>I</code>, which says that deadlock can't happen:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">I</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">IntegrityI</span>
</span><span class="line">  <span class="o">/\</span> <span class="o">~</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span>
</span><span class="line">       <span class="o">/\</span> <span class="o">~</span><span class="n">SpaceAvailableInt</span>
</span><span class="line">       <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_await_data&quot;</span>
</span><span class="line">       <span class="o">/\</span> <span class="o">~</span><span class="n">DataReadyInt</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I discovered some obvious facts about closing the connection.
For example, the <code>SenderLive</code> flag is set if and only if the sender's close thread hasn't done anything.
I've put them all together in <code>CloseOK</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* Some obvious facts about shutting down connections. *)</span>
</span><span class="line"><span class="n">CloseOK</span> <span class="ni">==</span>
</span><span class="line">  <span class="c">\* An endpoint is live iff its close thread hasn&#39;t done anything:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_open&quot;</span> <span class="o">&lt;=</span><span class="ni">&gt;</span> <span class="n">SenderLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_open&quot;</span> <span class="o">&lt;=</span><span class="ni">&gt;</span> <span class="n">ReceiverLive</span>
</span><span class="line">  <span class="c">\* The send and receive loops don&#39;t terminate unless someone has closed the connection:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_final_check&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span> <span class="ni">=&gt;</span> <span class="o">~</span><span class="n">ReceiverLive</span> <span class="o">\/</span> <span class="o">~</span><span class="n">SenderLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;Done&quot;</span><span class="ni">}</span> <span class="ni">=&gt;</span> <span class="o">~</span><span class="n">ReceiverLive</span> <span class="o">\/</span> <span class="o">~</span><span class="n">SenderLive</span>
</span><span class="line">  <span class="c">\* If the receiver closed the connection then we will get (or have got) the signal:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;Done&quot;</span> <span class="ni">=&gt;</span>
</span><span class="line">          <span class="o">\/</span> <span class="n">SpaceAvailableInt</span>
</span><span class="line">          <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_check_recv_live&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>But I had problems with other examples TLC showed me, and
I realised that I didn't actually know why this algorithm doesn't deadlock.</p>
<p>Intuitively it seems clear enough:
the sender puts data in the buffer when there's space and notifies the receiver,
and the receiver reads it and notifies the writer.
What could go wrong?
But both processes are working with information that can be out-of-date.
By the time the sender decides to block because the buffer looked full, the buffer might be empty.
And by the time the receiver decides to block because it looked empty, it might be full.</p>
<p>Maybe you already saw why it works from the C code, or the algorithm above,
but it took me a while to figure it out!
I eventually ended up with an invariant of the form:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">I</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">..</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">SendMayBlock</span>    <span class="ni">=&gt;</span> <span class="n">SpaceWakeupComing</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">ReceiveMayBlock</span> <span class="ni">=&gt;</span> <span class="n">DataWakeupComing</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>SendMayBlock</code> is <code>TRUE</code> if we're in a state that may lead to being blocked without checking the
buffer's free space again. Likewise, <code>ReceiveMayBlock</code> indicates that the receiver might block.
<code>SpaceWakeupComing</code> and <code>DataWakeupComing</code> predict whether we're going to get an interrupt.
The idea is that if we're going to block, we need to be sure we'll be woken up.
It's a bit ugly, though, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">DataWakeupComing</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">DataReadyInt</span> <span class="c">\* Event sent</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_notify_data&quot;</span>     <span class="c">\* Event being sent</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_notify_closed&quot;</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_notify_closed&quot;</span>
</span><span class="line">  <span class="o">\/</span> <span class="o">/\</span> <span class="n">NotifyWrite</span>   <span class="c">\* Event requested and ...</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">ReceiverLive</span>  <span class="c">\* Sender can see receiver is still alive and ...</span>
</span><span class="line">     <span class="o">/\</span> <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_write_data&quot;</span> <span class="o">/\</span> <span class="n">free</span> <span class="ni">&gt;</span> <span class="m">0</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_check_notify_data&quot;</span> 
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_recheck_len&quot;</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&lt;</span> <span class="n">BufferSize</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&lt;</span> <span class="n">BufferSize</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_write&quot;</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&lt;</span> <span class="n">BufferSize</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_request_notify&quot;</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&lt;</span> <span class="n">BufferSize</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">SpaceWakeupComing</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">&lt;</span> <span class="n">BufferSize</span> <span class="o">/\</span> <span class="n">SenderLive</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It did pass my model that tested sending one byte, and I decided to try a proof.
Well, it didn't work.
The problem seems to be that <code>DataWakeupComing</code> and <code>SpaceWakeupComing</code> are really mutually recursive.
The reader will wake up if the sender wakes it, but the sender might be blocked, or about to block.
That's OK though, as long as the receiver will wake it, which it will do, once the sender wakes it...</p>
<p>You've probably already figured it out, but I thought I'd document my confusion.
It occurred to me that although each process might have out-of-date information,
that could be fine as long as at any one moment one of them was right.
The last process to update the buffer must know how full it is,
so one of them must have correct information at any given time, and that should be enough to avoid deadlock.</p>
<p>That didn't work either.
When you're at a proof step and can't see why it's correct, you can ask TLC to show you an example.
e.g. if you're stuck trying to prove that <code>sender_request_notify</code> preserves <code>I</code> when the
receiver is at <code>recv_ready</code>, the buffer is full, and <code>ReceiverLive = FALSE</code>,
you can ask for an example of that:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">Example</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">PCOK</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_request_notify&quot;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_ready&quot;</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">ReceiverLive</span> <span class="ni">=</span> <span class="bp">FALSE</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">I</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">=</span> <span class="n">BufferSize</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You then create a new model that searches <code>Example /\ [][Next]_vars</code> and tests <code>I</code>.
As long as <code>Example</code> has several constraints, you can use a much larger model for this.
I also ask it to check the property <code>[][FALSE]_vars</code>, which means it will show any step starting from <code>Example</code>.</p>
<p>It quickly became clear what was wrong: it is quite possible that neither process is up-to-date.
If both processes see the buffer contains <code>X</code> bytes of data, and the sender sends <code>Y</code> bytes and the receiver reads <code>Z</code> bytes, then the sender will think there are <code>X + Y</code> bytes in the buffer and the receiver will think there are <code>X - Z</code> bytes, and neither is correct.
My original 1-byte buffer was just too small to find a counter-example.</p>
<p>The real reason why vchan works is actually rather obvious.
I don't know why I didn't see it earlier.
But eventually it occurred to me that I could make use of <code>Got</code> and <code>Sent</code>.
I defined <code>WriteLimit</code> to be the total number of bytes that the sender would write before blocking,
if the receiver never did anything further.
And I defined <code>ReadLimit</code> to be the total number of bytes that the receiver would read if the sender
never did anything else.</p>
<p>Did I define these limits correctly?
It's easy to ask TLC to check some extra properties while it's running.
For example, I used this to check that <code>ReadLimit</code> behaves sensibly:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">ReadLimitCorrect</span> <span class="ni">==</span>
</span><span class="line">  <span class="c">\* We will eventually receive what ReadLimit promises:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">WF_vars</span><span class="p">(</span><span class="n">ReceiverRead</span><span class="p">)</span> <span class="ni">=&gt;</span>
</span><span class="line">      <span class="s">\A</span> <span class="n">i</span> <span class="s">\in</span> <span class="n">AvailabilityNat</span> <span class="p">:</span>
</span><span class="line">        <span class="n">ReadLimit</span> <span class="ni">=</span> <span class="n">i</span> <span class="o">~</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">i</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span>
</span><span class="line">  <span class="c">\* ReadLimit can only decrease if we decide to shut down:</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[][</span><span class="n">ReadLimit</span><span class="err">&#39;</span> <span class="o">&gt;=</span> <span class="n">ReadLimit</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">]</span><span class="n">_vars</span>
</span><span class="line">  <span class="c">\* ReceiverRead steps don&#39;t change the read limit:</span>
</span><span class="line">  <span class="o">/\</span> <span class="p">[][</span><span class="n">ReceiverRead</span> <span class="ni">=&gt;</span> <span class="n">UNCHANGED</span> <span class="n">ReadLimit</span> <span class="o">\/</span> <span class="o">~</span><span class="n">ReceiverLive</span><span class="p">]</span><span class="n">_vars</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Because <code>ReadLimit</code> is defined in terms of what it does when no other processes run,
this property should ideally be tested in a model without the fairness conditions
(i.e. just <code>Init /\ [][Next]_vars</code>).
Otherwise, fairness may force the sender to perform a step.
We still want to allow other steps, though, to show that <code>ReadLimit</code> is a lower bound.</p>
<p>With this, we can argue that e.g. a 2-byte buffer will eventually transfer 3 bytes:</p>
<ol>
<li>The receiver will eventually read 3 bytes as long as the sender eventually sends 3 bytes.
</li>
<li>The sender will eventually send 3, if the receiver reads at least 1.
</li>
<li>The receiver will read 1 if the sender sends at least 1.
</li>
<li>The sender will send 1 if the reader has read at least 0 bytes, which is always true.
</li>
</ol>
<p>By this point, I was learning to be more cautious before trying a proof,
so I added some new models to check this idea further.
One prevents the sender from ever closing the connection and the other prevents the receiver from ever closing.
That reduces the number of states to consider and I was able to check a slightly larger model.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">I</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">IntegrityI</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">CloseOK</span>
</span><span class="line">  <span class="c">\* If the reader is stuck, but data is available, the sender will unblock it:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">ReaderShouldBeUnblocked</span>
</span><span class="line">     <span class="ni">=&gt;</span> <span class="c">\* The sender is going to write more:</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">WriteLimit</span> <span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="ni">&gt;</span> <span class="m">0</span> <span class="o">/\</span> <span class="n">SenderLive</span>
</span><span class="line">        <span class="c">\* The sender is about to increase ReadLimit:</span>
</span><span class="line">        <span class="o">\/</span> <span class="p">(</span><span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_check_notify_data&quot;</span> <span class="o">/\</span> <span class="n">NotifyWrite</span>
</span><span class="line">            <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_notify_data&quot;</span><span class="p">)</span> <span class="o">/\</span> <span class="n">ReadLimit</span> <span class="ni">&lt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span><span class="line">        <span class="c">\* The sender is about to notify us of shutdown:</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderCloseID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_notify_closed&quot;</span><span class="ni">}</span>
</span><span class="line">  <span class="c">\* If the writer is stuck, but there is now space available, the receiver will unblock it:</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">WriterShouldBeUnblocked</span>
</span><span class="line">     <span class="ni">=&gt;</span> <span class="c">\* The reader is going to read more:</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">ReadLimit</span> <span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line">        <span class="c">\* The reader is about to increase WriteLimit:</span>
</span><span class="line">        <span class="o">\/</span> <span class="p">(</span><span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_check_notify_read&quot;</span> <span class="o">/\</span> <span class="n">NotifyRead</span>
</span><span class="line">            <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_notify_read&quot;</span><span class="p">)</span> <span class="o">/\</span> <span class="n">WriteLimit</span> <span class="ni">&lt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">BufferSize</span>
</span><span class="line">        <span class="c">\* The receiver is about to notify us of shutdown:</span>
</span><span class="line">        <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_notify_closed&quot;</span><span class="ni">}</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">NotifyFlagsCorrect</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If a process is on a path to being blocked then it must have set its notify flag.
<code>NotifyFlagsCorrect</code> says that in that case, the flag it still set, or the interrupt has been sent,
or the other process is just about to trigger the interrupt.</p>
<p>I managed to use that to prove that the sender's steps preserved <code>I</code>,
but I needed a little extra to finish the receiver proof.
At this point, I finally spotted the obvious invariant (which you, no doubt, saw all along):
whenever <code>NotifyRead</code> is still set, the sender has accurate information about the buffer.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="o">/\</span> <span class="n">NotifyRead</span> <span class="ni">=&gt;</span>
</span><span class="line">      <span class="c">\* The sender has accurate information about the buffer:</span>
</span><span class="line">      <span class="o">\/</span> <span class="n">WriteLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">BufferSize</span>
</span><span class="line">      <span class="c">\* Or the flag is being cleared right now:</span>
</span><span class="line">      <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_check_notify_read&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That's pretty obvious, isn't it?
The sender checks the buffer after setting the flag, so it must have accurate information at that point.
The receiver clears the flag after reading from the buffer (which invalidates the sender's information).</p>
<p>Now I had a dilemma.
There was obviously going to be a matching property about <code>NotifyWrite</code>.
Should I add that, or continue with just this?
I was nearly done, so I continued and finished off the proofs.</p>
<p>With <code>I</code> proved, I was able to prove some other nice things quite easily:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> 
</span><span class="line">  <span class="o">/\</span> <span class="n">I</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span>
</span><span class="line">     <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span> <span class="o">/\</span> <span class="o">~</span><span class="n">SpaceAvailableInt</span>
</span><span class="line">  <span class="ni">=&gt;</span> <span class="n">ReadLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That says that, whenever the sender is idle or blocked, the receiver will read everything sent so far,
without any further help from the sender. And:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> 
</span><span class="line">  <span class="o">/\</span> <span class="n">I</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_await_data&quot;</span><span class="ni">}</span> <span class="o">/\</span> <span class="o">~</span><span class="n">DataReadyInt</span>
</span><span class="line">  <span class="ni">=&gt;</span> <span class="n">WriteLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">BufferSize</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That says that whenever the receiver is blocked, the sender can fill the buffer.
That's pretty nice.
It would be possible to make a vchan system that e.g. could only send 1 byte at a time and still
prove it couldn't deadlock and would always deliver data,
but here we have shown that the algorithm can use the whole buffer.
At least, that's what these theorems say as long as you believe that <code>ReadLimit</code> and <code>WriteLimit</code> are defined correctly.</p>
<p>With the proof complete, I then went back and deleted all the stuff about <code>ReadLimit</code> and <code>WriteLimit</code> from <code>I</code>
and started again with just the new rules about <code>NotifyRead</code> and <code>NotifyWrite</code>.
Instead of using <code>WriteLimit = Len(Got) + BufferSize</code> to indicate that the sender has accurate information,
I made a new <code>SenderInfoAccurate</code> that just returns <code>TRUE</code> whenever the sender will fill the buffer without further help.
That avoids some unnecessary arithmetic, which TLAPS needs a lot of help with.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* The sender&#39;s information is accurate if whenever it is going to block, the buffer</span>
</span><span class="line"><span class="cm">   really is full. *)</span>
</span><span class="line"><span class="n">SenderInfoAccurate</span> <span class="ni">==</span>
</span><span class="line">  <span class="c">\* We have accurate information:</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="o">+</span> <span class="n">free</span> <span class="ni">=</span> <span class="n">BufferSize</span>
</span><span class="line">  <span class="c">\* In these states, we&#39;re going to check the buffer before blocking:</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_ready&quot;</span><span class="p">,</span> <span class="s">&quot;sender_request_notify&quot;</span><span class="p">,</span> <span class="s">&quot;sender_write&quot;</span><span class="p">,</span>
</span><span class="line">                            <span class="s">&quot;sender_recheck_len&quot;</span><span class="p">,</span> <span class="s">&quot;sender_check_recv_live&quot;</span><span class="p">,</span> <span class="s">&quot;Done&quot;</span><span class="ni">}</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_request_notify&quot;</span><span class="ni">}</span> <span class="o">/\</span> <span class="n">free</span> <span class="ni">&lt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span><span class="line">  <span class="c">\* If we&#39;ve been signalled, we&#39;ll immediately wake next time we try to block:</span>
</span><span class="line">  <span class="o">\/</span> <span class="n">SpaceAvailableInt</span>
</span><span class="line">  <span class="c">\* We&#39;re about to write some data:</span>
</span><span class="line">  <span class="o">\/</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_write_data&quot;</span><span class="ni">}</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">free</span> <span class="o">&gt;=</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>                <span class="c">\* But we won&#39;t need to block</span>
</span><span class="line">  <span class="c">\* If we wrote all the data we intended to, we&#39;ll return without blocking:</span>
</span><span class="line">  <span class="o">\/</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;sender_check_notify_data&quot;</span><span class="p">,</span> <span class="s">&quot;sender_notify_data&quot;</span><span class="ni">}</span>
</span><span class="line">     <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="ni">=</span> <span class="m">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>By talking about accuracy instead of the write limit, I was also able to include &quot;Done&quot; in with
the other happy cases.
Before, that had to be treated as a possible problem because the sender can't use the full buffer when it's Done.</p>
<p>With this change, the proof of <code>Spec =&gt; []I</code> became much simpler (384 lines shorter).
And most of the remaining steps were trivial.</p>
<p>The <code>ReadLimit</code> and <code>WriteLimit</code> idea still seemed useful, though,
but I found I was able to prove the same things from <code>I</code>.
e.g. we can still conclude this, even if <code>I</code> doesn't mention <code>WriteLimit</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> 
</span><span class="line">  <span class="o">/\</span> <span class="n">I</span> <span class="o">/\</span> <span class="n">SenderLive</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="s">\in</span> <span class="ni">{</span><span class="s">&quot;recv_await_data&quot;</span><span class="ni">}</span> <span class="o">/\</span> <span class="o">~</span><span class="n">DataReadyInt</span>
</span><span class="line">  <span class="ni">=&gt;</span> <span class="n">WriteLimit</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">Got</span><span class="p">)</span> <span class="o">+</span> <span class="n">BufferSize</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That's nice, because it keeps the invariant and its proofs simple,
but we still get the same result in the end.</p>
<p>I initially defined <code>WriteLimit</code> to be the number of bytes the sender <em>could</em> write if
the sending application wanted to send enough data,
but I later changed it to be the actual number of bytes it <em>would</em> write if the application didn't
try to send any more.
This is because otherwise, with packet-based sends
(where we only write when the buffer has enough space for the whole message at once)
<code>WriteLimit</code> could go down.
e.g. we think we can write another 3 bytes,
but then the application decides to write 10 bytes and now we can't write anything more.</p>
<p>The limit theorems above are useful properties,
but it would be good to have more confidence that <code>ReadLimit</code> and <code>WriteLimit</code> are correct.
I was able to prove some useful lemmas here.</p>
<p>First, <code>ReceiverRead</code> steps don't change <code>ReadLimit</code> (as long as the receiver hasn't closed
the connection):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ReceiverReadPreservesReadLimit</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">ReceiverLive</span><span class="p">,</span> <span class="n">ReceiverRead</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">UNCHANGED</span> <span class="n">ReadLimit</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This gives us a good reason to think that ReadLimit is correct:</p>
<ul>
<li>When the receiver is blocked it cannot read any more than it has without help.
</li>
<li><code>ReadLimit</code> is defined to be <code>Len(Got)</code> then, so <code>ReadLimit</code> is obviously correct for this case.
</li>
<li>Since read steps preserve <code>ReadLimit</code>, this shows that ReadLimit is correct in all cases.
</li>
</ul>
<p>e.g. if <code>ReadLimit = 5</code> and no other processes do anything,
then we will end up in a state with the receiver blocked, and <code>ReadLimit = Len(Got) = 5</code>
and so we really did read a total of 5 bytes.</p>
<p>I was also able to prove that it never decreases (unless the receiver closes the connection):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span> <span class="n">ReadLimitMonotonic</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">Next</span><span class="p">,</span> <span class="n">ReceiverLive</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">ReadLimit</span><span class="err">&#39;</span> <span class="o">&gt;=</span> <span class="n">ReadLimit</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>So, if <code>ReadLimit = n</code> then it will always be at least <code>n</code>,
and if the receiver ever blocks then it will have read at least <code>n</code> bytes.</p>
<p>I was able to prove similar properties about <code>WriteLimit</code>.
So, I feel reasonably confident that these limit predictions are correct.</p>
<p>Disappointingly, we can't actually prove <code>Availability</code> using TLAPS,
because currently it understands very little temporal logic (see <a href="https://github.com/tlaplus/v2-tlapm/blob/c0ea83d8481e9dffbcbc5b54822c0e235ff59153/library/TLAPS.tla#L312">TLAPS limitations</a>).
However, I could show that the system can't deadlock while there's data to be transmitted:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="cm">(* We can&#39;t get into a state where the sender and receiver are both blocked</span>
</span><span class="line"><span class="cm">   and there is no wakeup pending: *)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">DeadlockFree1</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="o">~</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span>
</span><span class="line">           <span class="o">/\</span> <span class="o">~</span><span class="n">SpaceAvailableInt</span> <span class="o">/\</span> <span class="n">SenderLive</span>
</span><span class="line">           <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_await_data&quot;</span>
</span><span class="line">           <span class="o">/\</span> <span class="o">~</span><span class="n">DataReadyInt</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ASSUME</span> <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_blocked&quot;</span>
</span><span class="line">                    <span class="o">/\</span> <span class="o">~</span><span class="n">SpaceAvailableInt</span> <span class="o">/\</span> <span class="n">SenderLive</span>
</span><span class="line">                    <span class="o">/\</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_await_data&quot;</span>
</span><span class="line">                    <span class="o">/\</span> <span class="o">~</span><span class="n">DataReadyInt</span> <span class="o">/\</span> <span class="n">ReceiverLive</span>
</span><span class="line">             <span class="n">PROVE</span>  <span class="bp">FALSE</span>
</span><span class="line">    <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">NotifyFlagsCorrect</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">NotifyRead</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">NotifyFlagsCorrect</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">NotifyWrite</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">have</span> <span class="ni">=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line">    <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">NotifyFlagsCorrect</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SenderInfoAccurate</span> <span class="o">/\</span> <span class="n">ReaderInfoAccurate</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">free</span> <span class="ni">=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">=</span> <span class="n">BufferSize</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">SenderInfoAccurate</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">=</span> <span class="m">0</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">ReaderInfoAccurate</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">BufferSizeType</span>
</span><span class="line">
</span><span class="line"><span class="cm">(* We can&#39;t get into a state where the sender is idle and the receiver is blocked</span>
</span><span class="line"><span class="cm">   unless the buffer is empty (all data sent has been consumed): *)</span>
</span><span class="line"><span class="n">THEOREM</span> <span class="n">DeadlockFree2</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_ready&quot;</span><span class="p">,</span> <span class="n">SenderLive</span><span class="p">,</span>
</span><span class="line">         <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;recv_await_data&quot;</span><span class="p">,</span> <span class="n">ReceiverLive</span><span class="p">,</span>
</span><span class="line">         <span class="o">~</span><span class="n">DataReadyInt</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Len</span><span class="p">(</span><span class="n">Buffer</span><span class="p">)</span> <span class="ni">=</span> <span class="m">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I've included the proof of <code>DeadlockFree1</code> above:</p>
<ul>
<li>To show deadlock can't happen, it suffices to assume it has happened and show a contradiction.
</li>
<li>If both processes are blocked then <code>NotifyRead</code> and <code>NotifyWrite</code> must both be set
(because processes don't block without setting them,
and if they'd been unset then an interrupt would now be pending and we wouldn't be blocked).
</li>
<li>Since <code>NotifyRead</code> is still set,
the sender is correct in thinking that the buffer is still full.
</li>
<li>Since <code>NotifyWrite</code> is still set,
the receiver is correct in thinking that the buffer is still empty.
</li>
<li>That would be a contradiction, since <code>BufferSize</code> isn't zero.
</li>
</ul>
<p>If it doesn't deadlock, then some process must keep getting woken up by interrupts,
which means that interrupts keep being sent.
We only send interrupts after making progress (writing to the buffer or reading from it),
so we must keep making progress.
We'll have to content ourselves with that argument.</p>
<h2 id="experiences-with-tlaps">Experiences with TLAPS</h2>
<p>The toolbox doesn't come with the proof system, so you need to install it separately.
The instructions are out-of-date and have a lot of broken links.
In May, I turned the steps into a Dockerfile, which got it partly installed, and asked on the TLA group for help,
but no-one else seemed to know how to install it either.
By looking at the error messages and searching the web for programs with the same names, I finally managed to get it working in December.
If you have trouble installing it too, try using <a href="https://github.com/talex5/tla">my Docker image</a>.</p>
<p>Once installed, you can write a proof in the toolbox and then press Ctrl-G, Ctrl-G to check it.
On success, the proof turns green. On failure, the failing step turns red.
You can also do the Ctrl-G, Ctrl-G combination on a single step to check just that step.
That's useful, because it's pretty slow.
It takes more than 10 minutes to check the complete specification.</p>
<p>TLA proofs are done in the mathematical style,
which is to write a set of propositions and vaguely suggest that thinking about these will lead you to the proof.
This is good for building intuition, but bad for reproducibility.
A mathematical proof is considered correct if the reader is convinced by it, which depends on the reader.
In this case, the &quot;reader&quot; is a collection of automated theorem-provers with various timeouts.
This means that whether a proof is correct or not depends on how fast your computer is,
how many programs are currently running, etc.
A proof might pass one day and fail the next.
Some proof steps consistently pass when you try them individually,
but consistently fail when checked as part of the whole proof.
If a step fails, you need to break it down into smaller steps.</p>
<p>Sometimes the proof system is very clever, and immediately solves complex steps.
For example, here is the proof that the <code>SenderClose</code> process (which represents the sender closing the channel),
preserves the invariant <code>I</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">SenderClosePreservesI</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">I</span> <span class="o">/\</span> <span class="n">SenderClose</span> <span class="ni">=&gt;</span> <span class="n">I</span><span class="err">&#39;</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">SUFFICES</span> <span class="n">ASSUME</span> <span class="n">I</span><span class="p">,</span> <span class="n">SenderClose</span>
</span><span class="line">             <span class="n">PROVE</span>  <span class="n">I</span><span class="err">&#39;</span>
</span><span class="line">    <span class="n">OBVIOUS</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">IntegrityI</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">TypeOK</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span> <span class="n">PCOK</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">1</span><span class="o">.</span> <span class="k k-Conditional">CASE</span> <span class="n">sender_open</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">USE</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">1</span> <span class="n">DEF</span> <span class="n">sender_open</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">UNCHANGED</span> <span class="ni">&lt;&lt;</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">],</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">],</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="ni">&gt;&gt;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">pc</span><span class="err">&#39;</span><span class="p">[</span><span class="n">SenderCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;sender_notify_closed&quot;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">TypeOK</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">TypeOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">PCOK</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">IntegrityI</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">NotifyFlagsCorrect</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">NotifyFlagsCorrect</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span><span class="p">,</span> <span class="n">SenderInfoAccurate</span><span class="p">,</span> <span class="n">ReaderInfoAccurate</span><span class="p">,</span> <span class="n">CloseOK</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span><span class="o">.</span> <span class="k k-Conditional">CASE</span> <span class="n">sender_notify_closed</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">USE</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span> <span class="n">DEF</span> <span class="n">sender_notify_closed</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">UNCHANGED</span> <span class="ni">&lt;&lt;</span> <span class="n">pc</span><span class="p">[</span><span class="n">SenderWriteID</span><span class="p">],</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverReadID</span><span class="p">],</span> <span class="n">pc</span><span class="p">[</span><span class="n">ReceiverCloseID</span><span class="p">]</span> <span class="ni">&gt;&gt;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">pc</span><span class="err">&#39;</span><span class="p">[</span><span class="n">SenderCloseID</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;Done&quot;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">TypeOK</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">TypeOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">PCOK</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">PCOK</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">IntegrityI</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">IntegrityI</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">NotifyFlagsCorrect</span><span class="err">&#39;</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">NotifyFlagsCorrect</span><span class="p">,</span> <span class="n">I</span>
</span><span class="line">      <span class="ni">&lt;</span><span class="m">2</span><span class="ni">&gt;</span> <span class="n">QED</span> <span class="n">BY</span> <span class="n">DEF</span> <span class="n">I</span><span class="p">,</span> <span class="n">SenderInfoAccurate</span><span class="p">,</span> <span class="n">ReaderInfoAccurate</span><span class="p">,</span> <span class="n">CloseOK</span>
</span><span class="line"><span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">3</span><span class="o">.</span> <span class="n">QED</span>
</span><span class="line">  <span class="n">BY</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">1</span><span class="p">,</span> <span class="ni">&lt;</span><span class="m">1</span><span class="ni">&gt;</span><span class="m">2</span> <span class="n">DEF</span> <span class="n">SenderClose</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A step such as <code>IntegrityI' BY DEF IntegrityI</code> says
&quot;You can see that <code>IntegrityI</code> will be true in the next step just by looking at its definition&quot;.
So this whole lemma is really just saying &quot;it's obvious&quot;.
And TLAPS agrees.</p>
<p>At other times, TLAPS can be maddeningly stupid.
And it can't tell you what the problem is - it can only make things go red.</p>
<p>For example, this fails:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">pc</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="p">[</span><span class="n">pc</span> <span class="n">EXCEPT</span> <span class="err">!</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l2&quot;</span><span class="p">],</span>
</span><span class="line">         <span class="n">pc</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l1&quot;</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">pc</span><span class="err">&#39;</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l1&quot;</span>
</span><span class="line"><span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We're trying to say that <code>pc[2]</code> is unchanged, given that <code>pc'</code> is the same as <code>pc</code> except that we changed <code>pc[1]</code>.
The problem is that TLA is an untyped language.
Even though we know we did a mapping update to <code>pc</code>,
that isn't enough (apparently) to conclude that <code>pc</code> is in fact a mapping.
To fix it, you need:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">pc</span> <span class="s">\in</span> <span class="p">[</span><span class="n">Nat</span> <span class="o">-&gt;</span> <span class="n">STRING</span><span class="p">],</span>
</span><span class="line">         <span class="n">pc</span><span class="err">&#39;</span> <span class="ni">=</span> <span class="p">[</span><span class="n">pc</span> <span class="n">EXCEPT</span> <span class="err">!</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l2&quot;</span><span class="p">],</span>
</span><span class="line">         <span class="n">pc</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l1&quot;</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">pc</span><span class="err">&#39;</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="ni">=</span> <span class="s">&quot;l1&quot;</span>
</span><span class="line"><span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The extra <code>pc \in [Nat -&gt; STRING]</code> tells TLA the type of the <code>pc</code> variable.
I found missing type information to be the biggest problem when doing proofs,
because you just automatically assume that the computer will know the types of things.
Another example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">y</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span>
</span><span class="line">         <span class="n">x</span> <span class="o">+</span> <span class="n">Min</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Min</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="ni">=</span> <span class="n">y</span>
</span><span class="line"><span class="n">OBVIOUS</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We're just trying to remove the <code>x + ...</code> from both sides of the equation.
The problem is, TLA doesn't know that <code>Min(y, 10)</code> is a number,
so it doesn't know whether the normal laws of addition apply in this case.
It can't tell you that, though - it can only go red.
Here's the solution:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">THEOREM</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">x</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">y</span> <span class="s">\in</span> <span class="n">Nat</span><span class="p">,</span>
</span><span class="line">         <span class="n">x</span> <span class="o">+</span> <span class="n">Min</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="ni">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
</span><span class="line">  <span class="n">PROVE</span>  <span class="n">Min</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="ni">=</span> <span class="n">y</span>
</span><span class="line"><span class="n">BY</span> <span class="n">DEF</span> <span class="n">Min</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>BY DEF Min</code> tells TLAPS to share the definition of <code>Min</code> with the solvers.
Then they can see that <code>Min(y, 10)</code> must be a natural number too and everything works.</p>
<p>Another annoyance is that sometimes it can't find the right lemma to use,
even when you tell it exactly what it needs.
Here's an extreme case:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">LEMMA</span> <span class="n">TransferFacts</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">src</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">src2</span><span class="p">,</span>   <span class="c">\* (TLAPS doesn&#39;t cope with &quot;NEW VARAIBLE src&quot;)</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">dst</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">dst2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">i</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">),</span>
</span><span class="line">         <span class="n">src</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst2</span> <span class="ni">=</span> <span class="n">dst</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
</span><span class="line">         <span class="n">src2</span> <span class="ni">=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"> <span class="n">PROVE</span>  <span class="o">/\</span> <span class="n">src2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">src2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">)</span> <span class="o">-</span> <span class="n">i</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">dst2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="p">(</span><span class="n">dst</span> <span class="nb">\o</span> <span class="n">src</span><span class="p">)</span>
</span><span class="line"><span class="n">PROOF</span> <span class="n">OMITTED</span>
</span><span class="line">
</span><span class="line"><span class="n">LEMMA</span> <span class="n">SameAgain</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">src</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">src2</span><span class="p">,</span>   <span class="c">\* (TLAPS doesn&#39;t cope with &quot;NEW VARAIBLE src&quot;)</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">dst</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">dst2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">i</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">),</span>
</span><span class="line">         <span class="n">src</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst2</span> <span class="ni">=</span> <span class="n">dst</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
</span><span class="line">         <span class="n">src2</span> <span class="ni">=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"> <span class="n">PROVE</span>  <span class="o">/\</span> <span class="n">src2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">src2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">)</span> <span class="o">-</span> <span class="n">i</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">dst2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span>
</span><span class="line">        <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="p">(</span><span class="n">dst</span> <span class="nb">\o</span> <span class="n">src</span><span class="p">)</span>
</span><span class="line"><span class="n">BY</span> <span class="n">TransferFacts</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>TransferFacts</code> states some useful facts about transferring data between two variables.
You can prove that quite easily.
<code>SameAgain</code> is identical in every way, and just refers to <code>TransferFacts</code> for the proof.
But even with only one lemma to consider - one that matches all the assumptions and conclusions perfectly -
none of the solvers could figure this one out!</p>
<p>My eventual solution was to name the bundle of results.
This works:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">TransferResults</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src2</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">dst2</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="ni">==</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">src2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">src2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">)</span> <span class="o">-</span> <span class="n">i</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">dst2</span> <span class="s">\in</span> <span class="n">MESSAGE</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst2</span><span class="p">)</span> <span class="ni">=</span> <span class="n">Len</span><span class="p">(</span><span class="n">dst</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span>
</span><span class="line">  <span class="o">/\</span> <span class="n">UNCHANGED</span> <span class="p">(</span><span class="n">dst</span> <span class="nb">\o</span> <span class="n">src</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="n">LEMMA</span> <span class="n">TransferFacts</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">src</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">src2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">dst</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">dst2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">i</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">),</span>
</span><span class="line">         <span class="n">src</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst2</span> <span class="ni">=</span> <span class="n">dst</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
</span><span class="line">         <span class="n">src2</span> <span class="ni">=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"> <span class="n">PROVE</span>   <span class="n">TransferResults</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src2</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">dst2</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"><span class="n">PROOF</span> <span class="n">OMITTED</span>
</span><span class="line">
</span><span class="line"><span class="n">LEMMA</span> <span class="n">SameAgain</span> <span class="ni">==</span>
</span><span class="line">  <span class="n">ASSUME</span> <span class="n">NEW</span> <span class="n">src</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">src2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">dst</span><span class="p">,</span> <span class="n">NEW</span> <span class="n">dst2</span><span class="p">,</span>
</span><span class="line">         <span class="n">NEW</span> <span class="n">i</span> <span class="s">\in</span> <span class="m">1</span><span class="o">..</span><span class="n">Len</span><span class="p">(</span><span class="n">src</span><span class="p">),</span>
</span><span class="line">         <span class="n">src</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst</span> <span class="s">\in</span> <span class="n">MESSAGE</span><span class="p">,</span>
</span><span class="line">         <span class="n">dst2</span> <span class="ni">=</span> <span class="n">dst</span> <span class="nb">\o</span> <span class="n">Take</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">),</span>
</span><span class="line">         <span class="n">src2</span> <span class="ni">=</span> <span class="n">Drop</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"> <span class="n">PROVE</span>   <span class="n">TransferResults</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src2</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">dst2</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span><span class="line"><span class="n">BY</span> <span class="n">TransferFacts</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Most of the art of using TLAPS is in controlling how much information to share with the provers.
Too little (such as failing to provide the definition of <code>Min</code>) and they don't have enough information to find the proof.
Too much (such as providing the definition of <code>TransferResults</code>) and they get overwhelmed and fail to find the proof.</p>
<p>It's all a bit frustrating, but it does work,
and being machine checked does give you some confidence that your proofs are actually correct.</p>
<p>Another, perhaps more important, benefit of machine checked proofs is that
when you decide to change something in the specification you can just ask it to re-check everything.
Go and have a cup of tea, and when you come back it will have highlighted in red any steps that need to be updated.
I made a lot of changes, and this worked very well.</p>
<p>The TLAPS philosophy is that</p>
<blockquote>
<p>If you are concerned with an algorithm or system, you should not be spending your time proving basic mathematical facts.
Instead, you should assert the mathematical theorems you need as assumptions or theorems.</p>
</blockquote>
<p>So even if you can't find a formal proof of every step, you can still use TLAPS to break it down into steps than you
either can prove, or that you think are obvious enough that they don't require a proof.
However, I was able to prove everything I needed for the vchan specification within TLAPS.</p>
<h2 id="the-final-specification">The final specification</h2>
<p>I did a little bit of tidying up at the end.
In particular, I removed the <code>want</code> variable from the specification.
I didn't like it because it doesn't correspond to anything in the OCaml implementation,
and the only place the algorithm uses it is to decide whether to set <code>NotifyWrite</code>,
which I thought might be wrong anyway.</p>
<p>I changed this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">recv_got_len</span><span class="p">:</span>           <span class="n">if</span> <span class="p">(</span><span class="n">have</span> <span class="o">&gt;=</span> <span class="n">want</span><span class="p">)</span> <span class="n">goto</span> <span class="n">recv_read_data</span>
</span><span class="line">                        <span class="n">else</span> <span class="n">NotifyWrite</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>to:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="tla"><span class="line"><span class="n">recv_got_len</span><span class="p">:</span>           <span class="n">either</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">if</span> <span class="p">(</span><span class="n">have</span> <span class="ni">&gt;</span> <span class="m">0</span><span class="p">)</span> <span class="n">goto</span> <span class="n">recv_read_data</span>
</span><span class="line">                          <span class="n">else</span> <span class="n">NotifyWrite</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span><span class="line">                        <span class="ni">}</span> <span class="n">or</span> <span class="ni">{</span>
</span><span class="line">                          <span class="n">NotifyWrite</span> <span class="o">:=</span> <span class="bp">TRUE</span><span class="p">;</span>
</span><span class="line">                        <span class="ni">}</span><span class="p">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That always allows an implementation to set <code>NotifyWrite</code> if it wants to,
or to skip that step just as long as <code>have &gt; 0</code>.
That covers the current C behaviour, my proposed C behaviour, and the OCaml implementation.
It also simplifies the invariant, and even made the proofs shorter!</p>
<p>I put the final specification online at <a href="https://github.com/talex5/spec-vchan">spec-vchan</a>.
I also configured Travis CI to check all the models and verify all the proofs.
That's useful because sometimes I'm too impatient to recheck everything on my laptop before pushing updates.</p>
<p>You can generate a PDF version of the specification with <code>make pdfs</code>.
Expressions there can be a little easier to read because they use proper symbols, but
it also breaks things up into pages, which is highly annoying.
It would be nice if it could omit the proofs too, as they're really only useful if you're trying to edit them.
I'd rather just see the statement of each theorem.</p>
<h2 id="the-original-bug">The original bug</h2>
<p>With my new understanding of vchan, I couldn't see anything obvious wrong with the C code
(at least, as long as you keep the connection open, which the firewall does).</p>
<p>I then took a look at <a href="https://github.com/mirage/ocaml-vchan">ocaml-vchan</a>.
The first thing I noticed was that someone had commented out all the memory barriers,
noting in the Git log that they weren't needed on x86.
I am using x86, so that's not it, but I filed a bug about it anyway: <a href="https://github.com/mirage/ocaml-vchan/issues/122">Missing memory barriers</a>.</p>
<p>The other strange thing I saw was the behaviour of the <code>read</code> function.
It claims to implement the Mirage <code>FLOW</code> interface, which says that <code>read</code>
&quot;blocks until some data is available and returns a fresh buffer containing it&quot;.
However, looking at the code, what it actually does is to return a pointer directly into the shared buffer.
It then delays updating the consumer counter until the <em>next</em> call to <em>read</em>.
That's rather dangerous, and I filed another bug about that: <a href="https://github.com/mirage/ocaml-vchan/issues/119">Read has very surprising behaviour</a>.
However, when I checked the <code>mirage-qubes</code> code, it just takes this buffer and <a href="https://github.com/mirage/mirage-qubes/blob/ea900d5ac93278a43150cd21ced407806416681c/lib/msg_chan.ml#L34">makes a copy of it</a> immediately.
So that's not the bug either.</p>
<p>Also, the original bug report mentioned a 10 second timeout,
and neither the C implementation nor the OCaml one had any timeouts.
Time to look at QubesDB itself.</p>
<p>QubesDB accepts messages from either the guest VM (the firewall) or from local clients connected over Unix domain sockets.
The basic structure is:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span><span class="line">  <span class="k">await</span> <span class="n">vchan</span> <span class="n">event</span><span class="p">,</span> <span class="n">local</span> <span class="n">client</span> <span class="n">data</span><span class="p">,</span> <span class="ow">or</span> <span class="mi">10</span> <span class="n">second</span> <span class="n">timeout</span>
</span><span class="line">  <span class="k">while</span> <span class="n">vchan</span><span class="o">.</span><span class="n">receive_buffer</span> <span class="n">non</span><span class="o">-</span><span class="n">empty</span><span class="p">:</span>
</span><span class="line">    <span class="n">handle_vchan_data</span><span class="p">()</span>
</span><span class="line">  <span class="k">for</span> <span class="n">each</span> <span class="n">ready</span> <span class="n">client</span><span class="p">:</span>
</span><span class="line">    <span class="n">handle_client_data</span><span class="p">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The suspicion was that we were missing a vchan event,
but then it was discovering that there was data in the buffer anyway due to the timeout.
Looking at the code, it does seem to me that there is a possible race condition here:</p>
<ol>
<li>A local client asks to send some data.
</li>
<li><code>handle_client_data</code> sends the data to the firewall using a blocking write.
</li>
<li>The firewall sends a message to QubesDB at the same time and signals an event because the firewall-to-db buffer has data.
</li>
<li>QubesDB gets the event but ignores it because it's doing a blocking write and there's still no space in the db-to-firewall direction.
</li>
<li>The firewall updates its consumer counter and signals another event, because the buffer now has space.
</li>
<li>The blocking write completes and QubesDB returns to the main loop.
</li>
<li>QubesDB goes to sleep for 10 seconds, without checking the buffer.
</li>
</ol>
<p>I don't think this is the cause of the bug though,
because the only messages the firewall might be sending here are <code>QDB_RESP_OK</code> messages,
and QubesDB just discards such messages.</p>
<p>I managed to reproduce the problem myself,
and saw that in fact QubesDB doesn't make any progress due to the 10 second timeout.
It just tries to go back to sleep for another 10 seconds and then
immediately gets woken up by a message from a local client.
So, it looks like QubesDB is only sending updates every 10 seconds because its client, <code>qubesd</code>,
is only asking it to send updates every 10 seconds!
And looking at the <code>qubesd</code> logs, I saw stacktraces about libvirt failing to attach network devices, so
I read the Xen network device attachment specification to check that the firewall implemented that correctly.</p>
<p>I'm kidding, of course.
There isn't any such specification.
But maybe this blog post will inspire someone to write one...</p>
<h2 id="conclusions">Conclusions</h2>
<p>As users of open source software, we're encouraged to look at the source code and check that it's correct ourselves.
But that's pretty difficult without a specification saying what things are <em>supposed</em> to do.
Often I deal with this by learning just enough to fix whatever bug I'm working on,
but this time I decided to try making a proper specification instead.
Making the TLA specification took rather a long time, but it was quite pleasant.
Hopefully the next person who needs to know about vchan will appreciate it.</p>
<p>A TLA specification generally defines two sets of behaviours.
The first is the set of desirable behaviours (e.g. those where the data is delivered correctly).
This definition should clearly explain what users can expect from the system.
The second defines the behaviours of a particular algorithm.
This definition should make it easy to see how to implement the algorithm.
The TLC model checker can check that the algorithm's behaviours are all acceptable,
at least within some defined limits.</p>
<p>Writing a specification using the TLA notation forces us to be precise about what we mean.
For example, in a prose specification we might say &quot;data sent will eventually arrive&quot;, but in an
executable TLA specification we're forced to clarify what happens if the connection is closed.
I would have expected that if a sender writes some data and then closes the connection then the data would still arrive,
but the C implementation of vchan does not always ensure that.
The TLC model checker can find a counter-example showing how this can fail in under a minute.</p>
<p>To explain why the algorithm always works, we need to find an inductive invariant.
The TLC model checker can help with this,
by presenting examples of unreachable states that satisfy the invariant but don't preserve it after taking a step.
We must add constraints to explain why these states are invalid.
This was easy for the <code>Integrity</code> invariant, which explains why we never receive incorrect data, but
I found it much harder to prove that the system cannot deadlock.
I suspect that the original designer of a system would find this step easy, as presumably they already know why it works.</p>
<p>Once we have found an inductive invariant, we can write a formal machine-checked proof that the invariant is always true.
Although TLAPS doesn't allow us to prove liveness properties directly,
I was able to prove various interesting things about the algorithm: it doesn't deadlock; when the sender is blocked, the receiver can read everything that has been sent; and when the receiver is blocked, the sender can fill the entire buffer.</p>
<p>Writing formal proofs is a little tedious, largely because TLA is an untyped language.
However, there is nothing particularly difficult about it,
once you know how to work around various limitations of the proof checkers.</p>
<p>You might imagine that TLA would only work on very small programs like libvchan, but this is not the case.
It's just a matter of deciding what to specify in detail.
For example, in this specification I didn't give any details about how ring buffers work,
but instead used a single <code>Buffer</code> variable to represent them.
For a specification of a larger system using vchan, I would model each channel using just <code>Sent</code> and <code>Got</code>
and an action that transferred some of the difference on each step.</p>
<p>The TLA Toolbox has some rough edges.
The ones I found most troublesome were: the keyboard shortcuts frequently stop working;
when a temporal property is violated, it doesn't tell you which one it was; and
the model explorer tooltips appear right under the mouse pointer,
preventing you from scrolling with the mouse wheel.
It also likes to check its &quot;news feed&quot; on a regular basis.
It can't seem to do this at the same time as other operations,
and if you're in the middle of a particularly complex proof checking operation,
it will sometimes suddenly pop up a box suggesting that you cancel your job,
so that it can get back to reading the news.</p>
<p>However, it is improving.
In the latest versions, when you get a syntax error, it now tells you where in the file the error is.
And pressing Delete or Backspace while editing no longer causes it to crash and lose all unsaved data.
In general I feel that the TLA Toolbox is quite usable now.
If I were designing a new protocol, I would certainly use TLA to help with the design.</p>
<p>TLA does not integrate with any language type systems, so even after you have a specification
you still need to check manually that your code matches the spec.
It would be nice if you could check this automatically, somehow.</p>
<p>One final problem is that whenever I write a TLA specification, I feel the need to explain first what TLA is.
Hopefully it will become more popular and that problem will go away.</p>
<p>Update 2019-01-10: Marek Marczykowski-Górecki told me that the state model for network devices is the same as
the one for block devices, which is documented in the <code>blkif.h</code> block device header file, and provided libvirt debugging help -
so the bug is <a href="https://github.com/mirage/mirage-qubes/issues/25#issuecomment-452921207">now fixed</a>!</p>
]]></content>
  </entry>
  <entry>
    <title type="html">A Unikernel Firewall for QubesOS</title>
    <link href="https://roscidus.com/blog/blog/2016/01/01/a-unikernel-firewall-for-qubesos/"></link>
    <updated>2016-01-01T17:55:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2016/01/01/a-unikernel-firewall-for-qubesos</id>
    <content type="html"><![CDATA[<p>QubesOS provides a desktop operating system made up of multiple virtual machines, running under Xen.
To protect against buggy network drivers, the physical network hardware is accessed only by a dedicated (and untrusted) &quot;NetVM&quot;, which is connected to the rest of the system via a separate (trusted) &quot;FirewallVM&quot;.
This firewall VM runs Linux, processing network traffic with code written in C.</p>
<p>In this blog post, I replace the Linux firewall VM with a <a href="https://mirage.io/">MirageOS</a> unikernel.
The resulting VM uses safe (bounds-checked, type-checked) OCaml code to process network traffic,
uses less than a tenth of the memory of the default FirewallVM, boots several times faster,
and should be much simpler to audit or extend.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#qubes">Qubes</a>
<ul>
<li><a href="#qubes-networking">Qubes networking</a>
</li>
<li><a href="#problems-with-firewallvm">Problems with FirewallVM</a>
</li>
</ul>
</li>
<li><a href="#a-unikernel-firewall">A Unikernel Firewall</a>
<ul>
<li><a href="#booting-a-unikernel-on-qubes">Booting a Unikernel on Qubes</a>
</li>
<li><a href="#networking">Networking</a>
</li>
<li><a href="#the-xen-virtual-network-layer">The Xen virtual network layer</a>
</li>
<li><a href="#the-ethernet-layer">The Ethernet layer</a>
</li>
<li><a href="#the-ip-layer">The IP layer</a>
</li>
<li><a href="#evaluation">Evaluation</a>
</li>
<li><a href="#exercises">Exercises</a>
</li>
</ul>
</li>
<li><a href="#summary">Summary</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://www.reddit.com/r/programming/comments/3z1n44/a_unikernel_firewall_for_qubesos/">Reddit</a> and <a href="https://news.ycombinator.com/item?id=10822880">Hacker News</a> )</p>
<h2 id="qubes">Qubes</h2>
<p><a href="https://www.qubes-os.org/">QubesOS</a> is a security-focused desktop operating system that uses virtual machines to isolate applications from each other. The screenshot below shows my current desktop. The windows with green borders are running Fedora in my &quot;comms&quot; VM, which I use for gmail and similar trusted sites (with NoScript). The blue windows are from a Debian VM which I use for software development. The red windows are another Fedora VM, which I use for general browsing (with flash, etc) and running various untrusted applications:</p>
<p><img src="/blog/images/qubes/qubes-desktop.png" class="center"/></p>
<p>Another Fedora VM (&quot;dom0&quot;) runs the window manager and drives most of the physical hardware (mouse, keyboard, screen, disks, etc).</p>
<p>Networking is a particularly dangerous activity, since attacks can come from anywhere in the world and handling network hardware and traffic is complex.
Qubes therefore uses two extra VMs for networking:</p>
<ul>
<li>
<p>NetVM drives the physical network device directly. It runs network-manager and provides the system tray applet for configuring the network.</p>
</li>
<li>
<p>FirewallVM sits between the application VMs and NetVM. It implements a firewall and router.</p>
</li>
</ul>
<p>The full system looks something like this:</p>
<p><img src="/blog/images/qubes/vms.png" class="center"/></p>
<p>The lines between VMs in the diagram above represent network connections.
If NetVM is compromised (e.g. by exploiting a bug in the kernel module driving the wifi card) then the system as a whole can still be considered secure - the attacker is still outside the firewall.</p>
<p>Besides traditional networking, all VMs can communicate with dom0 via some Qubes-specific protocols.
These are used to display window contents, tell VMs about their configuration, and provide direct channels between VMs where appropriate.</p>
<h3 id="qubes-networking">Qubes networking</h3>
<p>There are three IP networks in the default configuration:</p>
<p><img src="/blog/images/qubes/qubes-net.png" class="center"/></p>
<ul>
<li><code>192.168.1.*</code> is the external network (to my house router).
</li>
<li><code>10.137.1.*</code> is a virtual network connecting NetVM to the firewalls (you can have multiple firewall VMs).
</li>
<li><code>10.137.2.*</code> connects the app VMs to the default FirewallVM.
</li>
</ul>
<p>Both NetVM and FirewallVM perform <a href="https://en.wikipedia.org/wiki/Network_address_translation">NAT</a>, so packets from &quot;comms&quot; appear to NetVM to have been sent by the firewall, and packets from the firewall appear to my house router to have come from NetVM.</p>
<p>Each of the AppVMs is configured to use the firewall (<code>10.137.2.1</code>) as its DNS resolver.
FirewallVM uses an iptables rule to forward DNS traffic to its resolver, which is NetVM.</p>
<h3 id="problems-with-firewallvm">Problems with FirewallVM</h3>
<p>After using Qubes for a while, there are a number of things about the default FirewallVM that I'm unhappy about:</p>
<ul>
<li>It runs a full Linux system, which uses at least 300 MB of RAM. This seems excessive.
</li>
<li>It takes several seconds to boot.
</li>
<li>There is <a href="https://groups.google.com/forum/#!searchin/qubes-users/resolve.conf/qubes-users/3SVsCdiFvwA/SHBvvbUICQAJ">a race somewhere</a> setting up the DNS redirection. Adding some debug to track down the bug made it disappear.
</li>
<li>The iptables configuration is huge and hard to understand.
</li>
</ul>
<p>There is another, more serious, problem.
Xen virtual network devices are implemented as a client (&quot;netfront&quot;) and a server (&quot;netback&quot;), which are Linux kernel modules in sys-firewall.
In a traditional Xen system, the netback driver runs in dom0 and is fully trusted. It is coded to protect itself against misbehaving client VMs. Netfront, by contrast, assumes that netback is trustworthy.
The Xen developers only considers bugs in netback to be security critical.</p>
<p>In Qubes, NetVM acts as netback to FirewallVM, which acts as a netback in turn to its clients.
But in Qubes, NetVM is supposed to be untrusted! So, we have code running in kernel mode in the (trusted) FirewallVM that is talking to and trusting the (untrusted) NetVM!</p>
<p>For example, as the Qubes developers point out in <a href="https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-023-2015.txt">Qubes Security Bulletin #23</a>, the netfront code that processes responses from netback uses the request ID quoted by netback as an index into an array without even checking if it's in range (they have fixed this in their fork).</p>
<p>What can an attacker do once they've exploited FirewallVM's trusting netfront driver?
Presumably they now have complete control of FirewallVM.
At this point, they can simply reuse the same exploit to take control of the client VMs, which are running the same trusting netfront code!</p>
<h2 id="a-unikernel-firewall">A Unikernel Firewall</h2>
<p>I decided to see whether I could replace the default firewall (&quot;sys-firewall&quot;) with a <a href="https://mirage.io/">MirageOS</a> unikernel.
A Mirage unikernel is an OCaml program compiled to run as an operating system kernel.
It pulls in just the code it needs, as libraries.
For example, my firewall doesn't require or use a hard disk, so it doesn't contain any code for dealing with block devices.</p>
<p>If you want to follow along, my code is on GitHub in my <a href="https://github.com/talex5/qubes-mirage-firewall">qubes-mirage-firewall</a> repository.
The README explains how to build it from source.
For testing, you can also just download the <a href="https://github.com/talex5/qubes-mirage-firewall/releases/download/0.1/mirage-firewall-bin-0.1.tar.bz2">mirage-firewall-bin-0.1.tar.bz2</a> binary kernel tarball.
dom0 doesn't have network access, but you can proxy the download through another VM:</p>
<pre><code>[tal@dom0 ~]$ cd /tmp
[tal@dom0 tmp]$ qvm-run -p sys-net 'wget -O - https://github.com/talex5/qubes-mirage-firewall/releases/download/0.1/mirage-firewall-bin-0.1.tar.bz2' &gt; mirage-firewall-bin-0.1.tar.bz2
[tal@dom0 tmp]$ tar tf mirage-firewall-bin-0.1.tar.bz2 
mirage-firewall/
mirage-firewall/vmlinuz
mirage-firewall/initramfs
mirage-firewall/modules.img
[tal@dom0 ~]$ cd /var/lib/qubes/vm-kernels/
[tal@dom0 vm-kernels]$ tar xf /tmp/mirage-firewall-bin-0.1.tar.bz2
</code></pre>
<p>The tarball contains <code>vmlinuz</code>, which is the unikernel itself, plus a couple of dummy files that Qubes requires to recognise it as a kernel (<code>modules.img</code> and <code>initramfs</code>).</p>
<p>Create a new ProxyVM named &quot;mirage-firewall&quot; to run the unikernel:</p>
<p><img src="/blog/images/qubes/create-mirage-firewall.png" class="center"/></p>
<ol>
<li>You can use any template, and make it standalone or not. It doesn't matter, since we don't use the hard disk.
</li>
<li>Set the type to <code>ProxyVM</code>.
</li>
<li>Select <code>sys-net</code> for networking (not <code>sys-firewall</code>).
</li>
<li>Click <code>OK</code> to create the VM.
</li>
<li>Go to the VM settings, and look in the &quot;Advanced&quot; tab.
<ul>
<li>Set the kernel to <code>mirage-firewall</code>.
</li>
<li>Turn off memory balancing and set the memory to 32 MB or so (you might have to fight a bit with the Qubes GUI to get it this low).
</li>
<li>Set VCPUs (number of virtual CPUs) to 1.
</li>
</ul>
</li>
</ol>
<p>(this installation mechanism is obviously not ideal; hopefully future versions of Qubes will be more unikernel-friendly)</p>
<p>You can run mirage-firewall alongside your existing sys-firewall and you can choose which AppVMs use which firewall using the GUI.
For example, to configure &quot;untrusted&quot; to use mirage-firewall:</p>
<p><img src="/blog/images/qubes/mirage-firewall.png" class="center"/></p>
<p>You can view the unikernel's log output from the GUI, or with <code>sudo xl console mirage-firewall</code> in dom0 if you want to see live updates.</p>
<p>If you want to explore <a href="https://github.com/talex5/qubes-mirage-firewall">the code</a> but don't know OCaml, a good tip is that most modules (<code>.ml</code> files) have a corresponding <code>.mli</code> interface file which describes the module's public API (a bit like a <code>.h</code> file in C).
It's usually worth reading those interface files first.</p>
<p>I tested initially with Qubes 3.0 and have just upgraded to the 3.1 alpha. Both seem to work.</p>
<h3 id="booting-a-unikernel-on-qubes">Booting a Unikernel on Qubes</h3>
<p>Qubes runs on Xen and a Mirage application can be compiled to a Xen kernel image using <code>mirage configure --xen</code>.
However, Qubes expects a VM to provide three Qubes-specific services and doesn't consider the VM to be running until it has connected to each of them. They are <a href="https://www.qubes-os.org/doc/qrexec3/">qrexec</a> (remote command execution), <a href="https://www.qubes-os.org/doc/gui/">gui</a> (displaying windows on the dom0 desktop) and QubesDB (a key-value store).</p>
<p>I wrote a little library, <a href="https://github.com/talex5/mirage-qubes">mirage-qubes</a>, to implement enough of these three protocols for the firewall (the GUI does nothing except handshake with dom0, since the firewall has no GUI).</p>
<p>Here's the full boot code in my firewall, showing how to connect the agents:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">start</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">start_time</span> <span class="o">=</span> <span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Log_reporter</span><span class="p">.</span><span class="n">init_logging</span> <span class="bp">()</span><span class="o">;</span>
</span><span class="line">  <span class="c">(* Start qrexec agent, GUI agent and QubesDB agent in parallel *)</span>
</span><span class="line">  <span class="k">let</span> <span class="n">qrexec</span> <span class="o">=</span> <span class="nn">RExec</span><span class="p">.</span><span class="n">connect</span> <span class="o">~</span><span class="n">domid</span><span class="o">:</span><span class="mi">0</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">gui</span> <span class="o">=</span> <span class="nn">GUI</span><span class="p">.</span><span class="n">connect</span> <span class="o">~</span><span class="n">domid</span><span class="o">:</span><span class="mi">0</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">qubesDB</span> <span class="o">=</span> <span class="nn">DB</span><span class="p">.</span><span class="n">connect</span> <span class="o">~</span><span class="n">domid</span><span class="o">:</span><span class="mi">0</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="c">(* Wait for clients to connect *)</span>
</span><span class="line">  <span class="n">qrexec</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">qrexec</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">agent_listener</span> <span class="o">=</span> <span class="nn">RExec</span><span class="p">.</span><span class="n">listen</span> <span class="n">qrexec</span> <span class="nn">Command</span><span class="p">.</span><span class="n">handler</span> <span class="k">in</span>
</span><span class="line">  <span class="n">gui</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">gui</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Lwt</span><span class="p">.</span><span class="n">async</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="nn">GUI</span><span class="p">.</span><span class="n">listen</span> <span class="n">gui</span><span class="o">);</span>
</span><span class="line">  <span class="n">qubesDB</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">qubesDB</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;agents connected in %.3f s (CPU time used since boot: %.3f s)&quot;</span>
</span><span class="line">    <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="o">(</span><span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="o">-.</span> <span class="n">start_time</span><span class="o">)</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span><span class="o">));</span>
</span><span class="line">  <span class="c">(* Watch for shutdown requests from Qubes *)</span>
</span><span class="line">  <span class="k">let</span> <span class="n">shutdown_rq</span> <span class="o">=</span> <span class="nn">OS</span><span class="p">.</span><span class="nn">Lifecycle</span><span class="p">.</span><span class="n">await_shutdown</span> <span class="bp">()</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="o">(`</span><span class="nc">Poweroff</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Reboot</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="c">(* Set up networking *)</span>
</span><span class="line">  <span class="k">let</span> <span class="n">net_listener</span> <span class="o">=</span> <span class="n">network</span> <span class="n">qubesDB</span> <span class="k">in</span>
</span><span class="line">  <span class="c">(* Run until something fails or we get a shutdown request. *)</span>
</span><span class="line">  <span class="nn">Lwt</span><span class="p">.</span><span class="n">choose</span> <span class="o">[</span><span class="n">agent_listener</span><span class="o">;</span> <span class="n">net_listener</span><span class="o">;</span> <span class="n">shutdown_rq</span><span class="o">]</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="c">(* Give the console daemon time to show any final log messages. *)</span>
</span><span class="line">  <span class="nn">OS</span><span class="p">.</span><span class="nn">Time</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">1</span><span class="o">.</span><span class="mi">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>After connecting the agents, we start a thread watching for shutdown requests (which arrive via XenStore, a second database) and then configure networking.</p>
<p><strong>Tips on reading OCaml</strong></p>
<ul>
<li><code>let x = ...</code> defines a variable.
</li>
<li><code>let fn args = ...</code> defines a function.
</li>
<li><code>Clock.time</code> is the <code>time</code> function in the <code>Clock</code> module.
</li>
<li><code>()</code> is the empty tuple (called &quot;unit&quot;). It's used for functions that don't take arguments, or return nothing useful.
</li>
<li><code>~foo</code> is a named argument. <code>connect ~domid:0</code> is like <code>connect(domid = 0)</code> in Python.
</li>
<li><code>promise &gt;&gt;= f</code> calls function <code>f</code> when the promise resolves. It's like <code>promise.then(f)</code> in JavaScript.
</li>
<li><code>foo () &gt;&gt;= fun result -&gt;</code> is the asynchronous version of <code>let result = foo () in</code>.
</li>
<li><code>return x</code> creates an already-resolved promise (it does <em>not</em> make the function return).
</li>
</ul>
<h3 id="networking">Networking</h3>
<p>The general setup is simple enough: we read various configuration settings (IP addresses, netmasks, etc) from QubesDB,
set up our two networks (the client-side one and the one with NetVM), and configure a router to send packets between them:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="c">(* Set up networking and listen for incoming packets. *)</span>
</span><span class="line">  <span class="k">let</span> <span class="n">network</span> <span class="n">qubesDB</span> <span class="o">=</span>
</span><span class="line">    <span class="c">(* Read configuration from QubesDB *)</span>
</span><span class="line">    <span class="k">let</span> <span class="n">config</span> <span class="o">=</span> <span class="nn">Dao</span><span class="p">.</span><span class="n">read_network_config</span> <span class="n">qubesDB</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">Logs</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;Client (internal) network is %a&quot;</span>
</span><span class="line">      <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="nn">Prefix</span><span class="p">.</span><span class="n">pp_hum</span> <span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">clients_prefix</span><span class="o">);</span>
</span><span class="line">    <span class="c">(* Initialise connection to NetVM *)</span>
</span><span class="line">    <span class="nn">Uplink</span><span class="p">.</span><span class="n">connect</span> <span class="n">config</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">uplink</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="c">(* Report success *)</span>
</span><span class="line">    <span class="nn">Dao</span><span class="p">.</span><span class="n">set_iptables_error</span> <span class="n">qubesDB</span> <span class="s2">&quot;&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="c">(* Set up client-side networking *)</span>
</span><span class="line">    <span class="k">let</span> <span class="n">client_eth</span> <span class="o">=</span> <span class="nn">Client_eth</span><span class="p">.</span><span class="n">create</span>
</span><span class="line">      <span class="o">~</span><span class="n">client_gw</span><span class="o">:</span><span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">clients_our_ip</span>
</span><span class="line">      <span class="o">~</span><span class="n">prefix</span><span class="o">:</span><span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">clients_prefix</span> <span class="k">in</span>
</span><span class="line">    <span class="c">(* Set up routing between networks and hosts *)</span>
</span><span class="line">    <span class="k">let</span> <span class="n">router</span> <span class="o">=</span> <span class="nn">Router</span><span class="p">.</span><span class="n">create</span>
</span><span class="line">      <span class="o">~</span><span class="n">client_eth</span>
</span><span class="line">      <span class="o">~</span><span class="n">uplink</span><span class="o">:(</span><span class="nn">Uplink</span><span class="p">.</span><span class="n">interface</span> <span class="n">uplink</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">    <span class="c">(* Handle packets from both networks *)</span>
</span><span class="line">    <span class="nn">Lwt</span><span class="p">.</span><span class="n">join</span> <span class="o">[</span>
</span><span class="line">      <span class="nn">Client_net</span><span class="p">.</span><span class="n">listen</span> <span class="n">router</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Uplink</span><span class="p">.</span><span class="n">listen</span> <span class="n">uplink</span> <span class="n">router</span>
</span><span class="line">    <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><strong>OCaml notes</strong></p>
<ul>
<li><code>config.Dao.clients_our_ip</code> means the <code>clients_our_ip</code> field of the <code>config</code> record, as defined in the <code>Dao</code> module.
</li>
<li><code>~client_eth</code> is short for <code>~client_eth:client_eth</code> - i.e. pass the value of the <code>client_eth</code> variable as a parameter also named <code>client_eth</code>.
</li>
</ul>
<h3 id="the-xen-virtual-network-layer">The Xen virtual network layer</h3>
<p>At the lowest level, networking requires the ability to send a blob of data from one VM to another.
This is the job of the Xen netback/netfront protocol.</p>
<p>For example, consider the case of a new AppVM (Xen domain ID 5) being connected to FirewallVM (4).
First, dom0 updates its XenStore database (which is shared with the VMs). It creates two directories:</p>
<ul>
<li><code>/local/domain/4/backend/vif/5/0/</code>
</li>
<li><code>/local/domain/5/device/vif/0/</code>
</li>
</ul>
<p>Each directory contains a <code>state</code> file (set to <code>1</code>, which means <code>initialising</code>) and information about the other end.</p>
<p>The first directory is monitored by the firewall (domain 4).
When it sees the new entry, it knows it has a new network connection to domain 5, interface 0.
It writes to the directory information about what features it supports and sets the state to 2 (<code>init-wait</code>).</p>
<p>The second directory will be seen by the new domain 5 when it boots.
It tells it that is has a network connection to dom 4.
The client looks in the dom 4's backend directory and waits for the state to change to <code>init-wait</code>, the checks the supported features.
It allocates memory to share with the firewall, tells Xen to grant access to dom 4, and writes the ID for the grant to the XenStore directory.
It sets its own state to 4 (<code>connected</code>).</p>
<p>When the firewall sees the client is connected, it reads the grant refs, tells Xen to map those pages of memory into its own address space, and sets its own state to <code>connected</code> too.
The two VMs can now use the shared memory to exchange messages (blocks of data up to 64 KB).</p>
<p>The reason I had to find out about all this is that the <a href="https://github.com/mirage/mirage-net-xen">mirage-net-xen</a> library only implemented the netfront side of the protocol.
Luckily, Dave Scott had already started adding support for netback and I was able to complete that work.</p>
<p>Getting this working with a Mirage client was fairly easy, but I spent a long time trying to figure out why my code was making Linux VMs kernel panic.
It turned out to be an <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2015-12/msg00079.html">amusing bug</a> in my netback serialisation code, which only worked with Mirage by pure luck.</p>
<p>However, this did alert me to a second bug in the Linux netfront driver: even if the ID netback sends is within the array bounds, that entry isn't necessarily valid.
Sending an unused ID would cause netfront to try to unmap someone else's grant-ref.
Not exploitable, perhaps, but another good reason to replace this code!</p>
<h3 id="the-ethernet-layer">The Ethernet layer</h3>
<p>It might seem like we're nearly done: we want to send IP (Internet Protocol) packets between VMs, and we have a way to send blocks of data.
However, we must now take a little detour down Legacy Lane...</p>
<p>Operating systems don't expect to send IP packets directly.
Instead, they expect to be connected to an Ethernet network, which requires each IP packet to be wrapped in an Ethernet &quot;frame&quot;.
Our virtual network needs to emulate an Ethernet network.</p>
<p>In an Ethernet network, each network interface device has a unique &quot;MAC address&quot; (e.g. <code>01:23:45:67:89:ab</code>).
An <a href="https://en.wikipedia.org/wiki/Ethernet_frame">Ethernet frame</a> contains source and destination MAC addresses, plus a type (e.g. &quot;IPv4 packet&quot;).</p>
<p>When a client VM wants to send an IP packet, it first broadcasts an Ethernet <a href="https://en.wikipedia.org/wiki/Address_Resolution_Protocol">ARP</a> request, asking for the MAC address of the target machine.
The target machine responds with its MAC address.
The client then transmits an Ethernet frame addressed to this MAC address, containing the IP packet inside.</p>
<p>If we were building our system out of physical machines, we'd connect everything via an Ethernet switch, like this:</p>
<p><img src="/blog/images/qubes/ethernet-bad.png" class="center"/></p>
<p>This layout isn't very good for us, though, because it means the VMs can talk to each other directly.
Normally you might trust all the machines behind the firewall, but the point of Qubes is to isolate the VMs from each other.</p>
<p>Instead, we want a separate Ethernet network for each client VM:</p>
<p><img src="/blog/images/qubes/ethernet.png" class="center"/></p>
<p>In this layout, the Ethernet addressing is completely pointless - a frame simply goes to the machine at the other end of the link.
But we still have to add an Ethernet frame whenever we send a packet and remove it when we receive one.</p>
<p>And we still have to implement the ARP protocol for looking up MAC addresses.
That's the job of the <a href="https://github.com/talex5/qubes-mirage-firewall/blob/master/client_eth.ml">Client_eth</a> module (dom0 puts the addresses in XenStore for us).</p>
<p>As well as sending queries, a VM can also broadcast a &quot;gratuitous ARP&quot; to tell other VMs its address without being asked.
Receivers of a gratuitous ARP may then update their ARP cache, although FirewallVM is configured not to do this (see <code>/proc/sys/net/ipv4/conf/all/arp_accept</code>).
For mirage-firewall, I just log what the client requested but don't let it update anything:</p>
<figure class="code"><figcaption><span>client_eth.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">input_gratuitous</span> <span class="n">t</span> <span class="n">frame</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="k">open</span> <span class="nc">Arpv4_wire</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">spa</span> <span class="o">=</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">of_int32</span> <span class="o">(</span><span class="n">get_arp_spa</span> <span class="n">frame</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">sha</span> <span class="o">=</span> <span class="nn">Macaddr</span><span class="p">.</span><span class="n">of_bytes_exn</span> <span class="o">(</span><span class="n">copy_arp_sha</span> <span class="n">frame</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">match</span> <span class="n">lookup</span> <span class="n">t</span> <span class="n">spa</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">real_mac</span> <span class="k">when</span> <span class="nn">Macaddr</span><span class="p">.</span><span class="n">compare</span> <span class="n">sha</span> <span class="n">real_mac</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;client suggests updating %s -&gt; %s (as expected)&quot;</span>
</span><span class="line">	<span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="o">(</span><span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">to_string</span> <span class="n">spa</span><span class="o">)</span> <span class="o">(</span><span class="nn">Macaddr</span><span class="p">.</span><span class="n">to_string</span> <span class="n">sha</span><span class="o">));</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">other_mac</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="nn">Log</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;client suggests incorrect update %s -&gt; %s (should be %s)&quot;</span>
</span><span class="line">	<span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="o">(</span><span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">to_string</span> <span class="n">spa</span><span class="o">)</span> <span class="o">(</span><span class="nn">Macaddr</span><span class="p">.</span><span class="n">to_string</span> <span class="n">sha</span><span class="o">)</span> <span class="o">(</span><span class="nn">Macaddr</span><span class="p">.</span><span class="n">to_string</span> <span class="n">other_mac</span><span class="o">));</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="nn">Log</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;client suggests incorrect update %s -&gt; %s (unexpected IP)&quot;</span>
</span><span class="line">	<span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="o">(</span><span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">to_string</span> <span class="n">spa</span><span class="o">)</span> <span class="o">(</span><span class="nn">Macaddr</span><span class="p">.</span><span class="n">to_string</span> <span class="n">sha</span><span class="o">))</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I'm not sure whether or not Qubes expects one client VM to be able to look up another one's MAC address.
It sets <code>/qubes-netmask</code> in QubesDB to <code>255.255.255.0</code>, indicating that all clients are on the same Ethernet network.
Therefore, I wrote my ARP responder to respond on behalf of the other clients to maintain this illusion.
However, it appears that my Linux VMs have ignored the QubesDB setting and used a netmask of <code>255.255.255.255</code>. Puzzling, but it should work either way.</p>
<p>Here's the code that connects a new client virtual interface (vif) to our router (in <a href="https://github.com/talex5/qubes-mirage-firewall/blob/master/client_net.ml">Client_net</a>):</p>
<figure class="code"><figcaption><span>client_net.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(** Connect to a new client&#39;s interface and listen for incoming frames. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">add_vif</span> <span class="o">{</span> <span class="nn">Dao</span><span class="p">.</span><span class="n">domid</span><span class="o">;</span> <span class="n">device_id</span><span class="o">;</span> <span class="n">client_ip</span> <span class="o">}</span> <span class="o">~</span><span class="n">router</span> <span class="o">~</span><span class="n">cleanup_tasks</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Netback</span><span class="p">.</span><span class="n">make</span> <span class="o">~</span><span class="n">domid</span> <span class="o">~</span><span class="n">device_id</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">backend</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;Client %d (IP: %s) ready&quot;</span> <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">f</span> <span class="n">domid</span> <span class="o">(</span><span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">to_string</span> <span class="n">client_ip</span><span class="o">));</span>
</span><span class="line">  <span class="nn">ClientEth</span><span class="p">.</span><span class="n">connect</span> <span class="n">backend</span> <span class="o">&gt;&gt;=</span> <span class="n">or_fail</span> <span class="s2">&quot;Can&#39;t make Ethernet device&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">eth</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">client_mac</span> <span class="o">=</span> <span class="nn">Netback</span><span class="p">.</span><span class="n">mac</span> <span class="n">backend</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">iface</span> <span class="o">=</span> <span class="k">new</span> <span class="n">client_iface</span> <span class="n">eth</span> <span class="n">client_ip</span> <span class="n">client_mac</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Router</span><span class="p">.</span><span class="n">add_client</span> <span class="n">router</span> <span class="n">iface</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Cleanup</span><span class="p">.</span><span class="n">on_cleanup</span> <span class="n">cleanup_tasks</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="nn">Router</span><span class="p">.</span><span class="n">remove_client</span> <span class="n">router</span> <span class="n">iface</span><span class="o">);</span>
</span><span class="line">  <span class="k">let</span> <span class="n">fixed_arp</span> <span class="o">=</span> <span class="nn">Client_eth</span><span class="p">.</span><span class="nn">ARP</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">net</span><span class="o">:</span><span class="n">router</span><span class="o">.</span><span class="nn">Router</span><span class="p">.</span><span class="n">client_eth</span> <span class="n">iface</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Netback</span><span class="p">.</span><span class="n">listen</span> <span class="n">backend</span> <span class="o">(</span><span class="k">fun</span> <span class="n">frame</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">match</span> <span class="nn">Wire_structs</span><span class="p">.</span><span class="n">parse_ethernet_frame</span> <span class="n">frame</span> <span class="k">with</span>
</span><span class="line">    <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="nn">Log</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;Invalid Ethernet frame&quot;</span> <span class="nn">Logs</span><span class="p">.</span><span class="n">unit</span><span class="o">;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">|</span> <span class="nc">Some</span> <span class="o">(</span><span class="n">typ</span><span class="o">,</span> <span class="o">_</span><span class="n">destination</span><span class="o">,</span> <span class="n">payload</span><span class="o">)</span> <span class="o">-&gt;</span>
</span><span class="line">        <span class="k">match</span> <span class="n">typ</span> <span class="k">with</span>
</span><span class="line">        <span class="o">|</span> <span class="nc">Some</span> <span class="nn">Wire_structs</span><span class="p">.</span><span class="nc">ARP</span> <span class="o">-&gt;</span> <span class="n">input_arp</span> <span class="o">~</span><span class="n">fixed_arp</span> <span class="o">~</span><span class="n">eth</span> <span class="n">payload</span>
</span><span class="line">        <span class="o">|</span> <span class="nc">Some</span> <span class="nn">Wire_structs</span><span class="p">.</span><span class="nc">IPv4</span> <span class="o">-&gt;</span> <span class="n">input_ipv4</span> <span class="o">~</span><span class="n">client_ip</span> <span class="o">~</span><span class="n">router</span> <span class="n">frame</span> <span class="n">payload</span>
</span><span class="line">        <span class="o">|</span> <span class="nc">Some</span> <span class="nn">Wire_structs</span><span class="p">.</span><span class="nc">IPv6</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">        <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="nn">Logs</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;Unknown Ethernet type&quot;</span> <span class="nn">Logs</span><span class="p">.</span><span class="n">unit</span><span class="o">;</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">return_unit</span>
</span><span class="line">  <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><strong>OCaml note:</strong> <code>{ x = 1; y = 2 }</code> is a record (struct). <code>{ x = x; y = y }</code> can be abbreviated to just <code>{ x; y }</code>. Here we pattern-match on a <code>Dao.client_vif</code> record passed to the function to extract the fields.</p>
<p>The <code>Netback.listen</code> at the end runs a loop that communicates with the netfront driver in the client.
Each time a frame arrives, we check the type and dispatch to either the ARP handler or, for IPv4 packets,
the firewall code.
We don't support IPv6, since Qubes doesn't either.</p>
<figure class="code"><figcaption><span>client_net.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">input_arp</span> <span class="o">~</span><span class="n">fixed_arp</span> <span class="o">~</span><span class="n">eth</span> <span class="n">request</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="nn">Client_eth</span><span class="p">.</span><span class="nn">ARP</span><span class="p">.</span><span class="n">input</span> <span class="n">fixed_arp</span> <span class="n">request</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">response</span> <span class="o">-&gt;</span> <span class="nn">ClientEth</span><span class="p">.</span><span class="n">write</span> <span class="n">eth</span> <span class="n">response</span>
</span><span class="line">
</span><span class="line"><span class="c">(** Handle an IPv4 packet from the client. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">input_ipv4</span> <span class="o">~</span><span class="n">client_ip</span> <span class="o">~</span><span class="n">router</span> <span class="n">frame</span> <span class="n">packet</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src</span> <span class="o">=</span> <span class="nn">Wire_structs</span><span class="p">.</span><span class="nn">Ipv4_wire</span><span class="p">.</span><span class="n">get_ipv4_src</span> <span class="n">packet</span> <span class="o">|&gt;</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">of_int32</span> <span class="k">in</span>
</span><span class="line">  <span class="k">if</span> <span class="n">src</span> <span class="o">=</span> <span class="n">client_ip</span> <span class="k">then</span> <span class="nn">Firewall</span><span class="p">.</span><span class="n">ipv4_from_client</span> <span class="n">router</span> <span class="n">frame</span>
</span><span class="line">  <span class="k">else</span> <span class="o">(</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;Incorrect source IP %a in IP packet from %a (dropping)&quot;</span>
</span><span class="line">      <span class="o">(</span><span class="k">fun</span> <span class="n">f</span> <span class="o">-&gt;</span> <span class="n">f</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">pp_hum</span> <span class="n">src</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nn">V4</span><span class="p">.</span><span class="n">pp_hum</span> <span class="n">client_ip</span><span class="o">);</span>
</span><span class="line">    <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><strong>OCaml note:</strong> <code>|&gt;</code> is the &quot;pipe&quot; operator. <code>x |&gt; fn</code> is the same as <code>fn x</code>, but sometimes it reads better to have the values flowing left-to-right. You can also think of it as the synchronous version of <code>&gt;&gt;=</code>.</p>
<p>Notice that we check the source IP address is the one we expect.
This means that our firewall rules can rely on client addresses.</p>
<p>There is similar code in <a href="https://github.com/talex5/qubes-mirage-firewall/blob/master/uplink.ml">Uplink</a>, which handles the NetVM side of things:</p>
<figure class="code"><figcaption><span>uplink.mk</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">connect</span> <span class="n">config</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">ip</span> <span class="o">=</span> <span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">uplink_our_ip</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Netif</span><span class="p">.</span><span class="n">connect</span> <span class="s2">&quot;tap0&quot;</span> <span class="o">&gt;&gt;=</span> <span class="n">or_fail</span> <span class="s2">&quot;Can&#39;t connect uplink device&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">net</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Eth</span><span class="p">.</span><span class="n">connect</span> <span class="n">net</span> <span class="o">&gt;&gt;=</span> <span class="n">or_fail</span> <span class="s2">&quot;Can&#39;t make Ethernet device for tap&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">eth</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Arp</span><span class="p">.</span><span class="n">connect</span> <span class="n">eth</span> <span class="o">&gt;&gt;=</span> <span class="n">or_fail</span> <span class="s2">&quot;Can&#39;t add ARP&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">arp</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Arp</span><span class="p">.</span><span class="n">add_ip</span> <span class="n">arp</span> <span class="n">ip</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">netvm_mac</span> <span class="o">=</span> <span class="nn">Arp</span><span class="p">.</span><span class="n">query</span> <span class="n">arp</span> <span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">uplink_netvm_ip</span> <span class="o">&gt;|=</span> <span class="k">function</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Timeout</span> <span class="o">-&gt;</span> <span class="n">failwith</span> <span class="s2">&quot;ARP timeout getting MAC of our NetVM&quot;</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="n">netvm_mac</span> <span class="o">-&gt;</span> <span class="n">netvm_mac</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">my_ip</span> <span class="o">=</span> <span class="nn">Ipaddr</span><span class="p">.</span><span class="nc">V4</span> <span class="n">ip</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">interface</span> <span class="o">=</span> <span class="k">new</span> <span class="n">netvm_iface</span> <span class="n">eth</span> <span class="n">netvm_mac</span> <span class="n">config</span><span class="o">.</span><span class="nn">Dao</span><span class="p">.</span><span class="n">uplink_netvm_ip</span> <span class="k">in</span>
</span><span class="line">  <span class="n">return</span> <span class="o">{</span> <span class="n">net</span><span class="o">;</span> <span class="n">eth</span><span class="o">;</span> <span class="n">arp</span><span class="o">;</span> <span class="n">interface</span><span class="o">;</span> <span class="n">my_ip</span> <span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">listen</span> <span class="n">t</span> <span class="n">router</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Netif</span><span class="p">.</span><span class="n">listen</span> <span class="n">t</span><span class="o">.</span><span class="n">net</span> <span class="o">(</span><span class="k">fun</span> <span class="n">frame</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="c">(* Handle one Ethernet frame from NetVM *)</span>
</span><span class="line">    <span class="nn">Eth</span><span class="p">.</span><span class="n">input</span> <span class="n">t</span><span class="o">.</span><span class="n">eth</span>
</span><span class="line">      <span class="o">~</span><span class="n">arpv4</span><span class="o">:(</span><span class="nn">Arp</span><span class="p">.</span><span class="n">input</span> <span class="n">t</span><span class="o">.</span><span class="n">arp</span><span class="o">)</span>
</span><span class="line">      <span class="o">~</span><span class="n">ipv4</span><span class="o">:(</span><span class="k">fun</span> <span class="o">_</span><span class="n">ip</span> <span class="o">-&gt;</span> <span class="nn">Firewall</span><span class="p">.</span><span class="n">ipv4_from_netvm</span> <span class="n">router</span> <span class="n">frame</span><span class="o">)</span>
</span><span class="line">      <span class="o">~</span><span class="n">ipv6</span><span class="o">:(</span><span class="k">fun</span> <span class="o">_</span><span class="n">ip</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span><span class="o">)</span>
</span><span class="line">      <span class="n">frame</span>
</span><span class="line">  <span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><strong>OCaml note:</strong> <code>Arp.input t.arp</code> is a partially-applied function. It's short for <code>fun x -&gt; Arp.input t.arp x</code>.</p>
<p>Here we just use the standard <code>Eth.input</code> code to dispatch on the frame.
It checks that the destination MAC matches ours and dispatches based on type.
We couldn't use it for the client code above because there we also want to
handle frames addressed to other clients, which <code>Eth.input</code> would discard.</p>
<p><code>Eth.input</code> extracts the IP packet from the Ethernet frame and passes that to our callback,
but the NAT library I used likes to work on whole Ethernet frames, so I ignore the IP packet
(<code>_ip</code>) and send the frame instead.</p>
<h3 id="the-ip-layer">The IP layer</h3>
<p>Once an IP packet has been received, it is sent to the <a href="https://github.com/talex5/qubes-mirage-firewall/blob/master/firewall.ml">Firewall</a> module
(either <code>ipv4_from_netvm</code> or <code>ipv4_from_client</code>, depending on where it came from).</p>
<p>The process is similar in each case:</p>
<ul>
<li>
<p>Check if we have an existing NAT entry for this packet. If so, it's part of a conversation we've already approved, so perform the translation and send it on its way. NAT support is provided by the handy <a href="https://github.com/yomimono/mirage-nat">mirage-nat</a> library.</p>
</li>
<li>
<p>If not, collect useful information about the packet (source, destination, protocol, ports) and check against the user's firewall rules, then take whatever action they request.</p>
</li>
</ul>
<p>Here's the code that takes a client IPv4 frame and applies the firewall rules:</p>
<figure class="code"><figcaption><span>firewall.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">ipv4_from_client</span> <span class="n">t</span> <span class="n">frame</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="nn">Memory_pressure</span><span class="p">.</span><span class="n">status</span> <span class="bp">()</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Memory_critical</span> <span class="o">-&gt;</span> <span class="c">(* TODO: should happen before copying and async *)</span>
</span><span class="line">      <span class="nn">Log</span><span class="p">.</span><span class="n">warn</span> <span class="s2">&quot;Memory low - dropping packet&quot;</span> <span class="nn">Logs</span><span class="p">.</span><span class="n">unit</span><span class="o">;</span>
</span><span class="line">      <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="c">(* Check for existing NAT entry for this packet *)</span>
</span><span class="line">  <span class="k">match</span> <span class="n">translate</span> <span class="n">t</span> <span class="n">frame</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">frame</span> <span class="o">-&gt;</span> <span class="n">forward_ipv4</span> <span class="n">t</span> <span class="n">frame</span>  <span class="c">(* Some existing connection or redirect *)</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="c">(* No existing NAT entry. Check the firewall rules. *)</span>
</span><span class="line">  <span class="k">match</span> <span class="n">classify</span> <span class="n">t</span> <span class="n">frame</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">info</span> <span class="o">-&gt;</span> <span class="n">apply_rules</span> <span class="n">t</span> <span class="nn">Rules</span><span class="p">.</span><span class="n">from_client</span> <span class="n">info</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Qubes provides a GUI that lets the user specify firewall rules.
It then encodes these as Linux iptables rules and puts them in QubesDB.
This isn't a very friendly format for non-Linux systems, so I ignore this and hard-code the rules in OCaml instead, in the <a href="https://github.com/talex5/qubes-mirage-firewall/blob/master/rules.ml">Rules</a> module:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(** Decide what to do with a packet from a client VM.</span>
</span><span class="line"><span class="c">    Note: If the packet matched an existing NAT rule then this isn&#39;t called. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">from_client</span> <span class="o">=</span> <span class="k">function</span>
</span><span class="line">  <span class="o">|</span> <span class="o">{</span> <span class="n">dst</span> <span class="o">=</span> <span class="o">(`</span><span class="nc">External</span> <span class="o">_</span> <span class="o">|</span> <span class="o">`</span><span class="nc">NetVM</span><span class="o">)</span> <span class="o">}</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">NAT</span>
</span><span class="line">  <span class="o">|</span> <span class="o">{</span> <span class="n">dst</span> <span class="o">=</span> <span class="o">`</span><span class="nc">Client_gateway</span><span class="o">;</span> <span class="n">proto</span> <span class="o">=</span> <span class="o">`</span><span class="nc">UDP</span> <span class="o">{</span> <span class="n">dport</span> <span class="o">=</span> <span class="mi">53</span> <span class="o">}</span> <span class="o">}</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">NAT_to</span> <span class="o">(`</span><span class="nc">NetVM</span><span class="o">,</span> <span class="mi">53</span><span class="o">)</span>
</span><span class="line">  <span class="o">|</span> <span class="o">{</span> <span class="n">dst</span> <span class="o">=</span> <span class="o">(`</span><span class="nc">Client_gateway</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Firewall_uplink</span><span class="o">)</span> <span class="o">}</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">Drop</span> <span class="s2">&quot;packet addressed to firewall itself&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="o">{</span> <span class="n">dst</span> <span class="o">=</span> <span class="o">`</span><span class="nc">Client</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">Drop</span> <span class="s2">&quot;prevent communication between client VMs&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="o">{</span> <span class="n">dst</span> <span class="o">=</span> <span class="o">`</span><span class="nc">Unknown_client</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">Drop</span> <span class="s2">&quot;target client not running&quot;</span>
</span><span class="line">
</span><span class="line"><span class="c">(** Decide what to do with a packet received from the outside world.</span>
</span><span class="line"><span class="c">    Note: If the packet matched an existing NAT rule then this isn&#39;t called. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">from_netvm</span> <span class="o">=</span> <span class="k">function</span>
</span><span class="line">  <span class="o">|</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="o">`</span><span class="nc">Drop</span> <span class="s2">&quot;drop by default&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>For packets from clients to the outside world we use the <code>NAT</code> action to rewrite the source address so the packets appear to come from the firewall (via some unused port).
DNS queries sent to the firewall get redirected to NetVM (UDP port 53 is DNS).
In both cases, the NAT actions update the NAT table so that we will forward any responses back to the client.
Everything else is dropped, with a log message.</p>
<p>I think it's rather nice the way we can use OCaml's existing support for pattern matching to implement the rules, without having to invent a new syntax.
Originally, I had a default-drop rule at the end of <code>from_client</code>, but OCaml helpfully pointed out that it wasn't needed, as the previous rules already covered every case.</p>
<p>The incoming policy is to drop everything that wasn't already allowed by a rule added by the out-bound NAT.</p>
<p>I don't know much about firewalls, but this scheme works for my needs.
For comparison, the Linux iptables rules currently in my sys-firewall are:</p>
<pre><code>[user@sys-firewall ~]$ sudo iptables -vL -n -t filter
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       udp  --  vif+   *       0.0.0.0/0            0.0.0.0/0            udp dpt:68
55336   83M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
    0     0 ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0           
    0     0 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

Chain FORWARD (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
35540   23M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
    0     0 ACCEPT     all  --  vif0.0 *       0.0.0.0/0            0.0.0.0/0           
    0     0 DROP       all  --  vif+   vif+    0.0.0.0/0            0.0.0.0/0           
  519 33555 ACCEPT     udp  --  *      *       10.137.2.12          10.137.1.1           udp dpt:53
   16  1076 ACCEPT     udp  --  *      *       10.137.2.12          10.137.1.254         udp dpt:53
    0     0 ACCEPT     tcp  --  *      *       10.137.2.12          10.137.1.1           tcp dpt:53
    0     0 ACCEPT     tcp  --  *      *       10.137.2.12          10.137.1.254         tcp dpt:53
    0     0 ACCEPT     icmp --  *      *       10.137.2.12          0.0.0.0/0           
    0     0 DROP       tcp  --  *      *       10.137.2.12          10.137.255.254       tcp dpt:8082
  264 14484 ACCEPT     all  --  *      *       10.137.2.12          0.0.0.0/0           
  254 16404 ACCEPT     udp  --  *      *       10.137.2.9           10.137.1.1           udp dpt:53
    2   130 ACCEPT     udp  --  *      *       10.137.2.9           10.137.1.254         udp dpt:53
    0     0 ACCEPT     tcp  --  *      *       10.137.2.9           10.137.1.1           tcp dpt:53
    0     0 ACCEPT     tcp  --  *      *       10.137.2.9           10.137.1.254         tcp dpt:53
    0     0 ACCEPT     icmp --  *      *       10.137.2.9           0.0.0.0/0           
    0     0 DROP       tcp  --  *      *       10.137.2.9           10.137.255.254       tcp dpt:8082
  133  7620 ACCEPT     all  --  *      *       10.137.2.9           0.0.0.0/0           

Chain OUTPUT (policy ACCEPT 32551 packets, 1761K bytes)
 pkts bytes target     prot opt in     out     source               destination         

[user@sys-firewall ~]$ sudo iptables -vL -n -t nat
Chain PREROUTING (policy ACCEPT 362 packets, 20704 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  829 50900 PR-QBS     all  --  *      *       0.0.0.0/0            0.0.0.0/0           
  362 20704 PR-QBS-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 116 packets, 7670 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      vif+    0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  *      lo      0.0.0.0/0            0.0.0.0/0           
  945 58570 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain PR-QBS (1 references)
 pkts bytes target     prot opt in     out     source               destination         
  458 29593 DNAT       udp  --  *      *       0.0.0.0/0            10.137.2.1           udp dpt:53 to:10.137.1.1
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            10.137.2.1           tcp dpt:53 to:10.137.1.1
    9   603 DNAT       udp  --  *      *       0.0.0.0/0            10.137.2.254         udp dpt:53 to:10.137.1.254
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            10.137.2.254         tcp dpt:53 to:10.137.1.254

Chain PR-QBS-SERVICES (1 references)
 pkts bytes target     prot opt in     out     source               destination         

[user@sys-firewall ~]$ sudo iptables -vL -n -t mangle
Chain PREROUTING (policy ACCEPT 12090 packets, 17M bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain INPUT (policy ACCEPT 11387 packets, 17M bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain FORWARD (policy ACCEPT 703 packets, 88528 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 6600 packets, 357K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 7303 packets, 446K bytes)
 pkts bytes target     prot opt in     out     source               destination         

[user@sys-firewall ~]$ sudo iptables -vL -n -t raw
Chain PREROUTING (policy ACCEPT 92093 packets, 106M bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  vif20.0 *      !10.137.2.9           0.0.0.0/0           
    0     0 DROP       all  --  vif19.0 *      !10.137.2.12          0.0.0.0/0           

Chain OUTPUT (policy ACCEPT 32551 packets, 1761K bytes)
 pkts bytes target     prot opt in     out     source               destination         

[user@sys-firewall ~]$ sudo iptables -vL -n -t security
Chain INPUT (policy ACCEPT 11387 packets, 17M bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain FORWARD (policy ACCEPT 659 packets, 86158 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 6600 packets, 357K bytes)
 pkts bytes target     prot opt in     out     source               destination      
</code></pre>
<p>I find it hard to tell, looking at these tables, exactly what sys-firewall's security policy will actually do.</p>
<h3 id="evaluation">Evaluation</h3>
<p>I timed start-up for the Linux-based &quot;sys-firewall&quot; and for &quot;mirage-firewall&quot; (after shutting them both down):</p>
<pre><code>[tal@dom0 ~]$ time qvm-start sys-firewall
--&gt; Creating volatile image: /var/lib/qubes/servicevms/sys-firewall/volatile.img...
--&gt; Loading the VM (type = ProxyVM)...
--&gt; Starting Qubes DB...
--&gt; Setting Qubes DB info for the VM...
--&gt; Updating firewall rules...
--&gt; Starting the VM...
--&gt; Starting the qrexec daemon...
Waiting for VM's qrexec agent......connected
--&gt; Starting Qubes GUId...
Connecting to VM's GUI agent: .connected
--&gt; Sending monitor layout...
--&gt; Waiting for qubes-session...
real    0m9.321s
user    0m0.163s
sys     0m0.262s

[tal@dom0 ~]$ time qvm-start mirage-firewall
--&gt; Loading the VM (type = ProxyVM)...
--&gt; Starting Qubes DB...
--&gt; Setting Qubes DB info for the VM...
--&gt; Updating firewall rules...
--&gt; Starting the VM...
--&gt; Starting the qrexec daemon...
Waiting for VM's qrexec agent.connected
--&gt; Starting Qubes GUId...
Connecting to VM's GUI agent: .connected
--&gt; Sending monitor layout...
--&gt; Waiting for qubes-session...
real    0m1.079s
user    0m0.130s
sys     0m0.192s
</code></pre>
<p>So, <code>mirage-firewall</code> starts in 1 second rather than 9. However, even most of this time is Qubes code running in dom0. <code>xl list</code> shows:</p>
<pre><code>[tal@dom0 ~]$ sudo xl list
Name                     ID   Mem VCPUs      State   Time(s)
dom0                      0  6097     4     r-----     623.8
sys-net                   4   294     4     -b----      79.2
sys-firewall             17  1293     4     -b----       9.9
mirage-firewall          18    30     1     -b----       0.0
</code></pre>
<p>I guess <code>sys-firewall</code> did more work after telling Qubes it was ready, because Xen reports it used 9.9 seconds of CPU time.
<code>mirage-firewall</code> uses too little time for Xen to report anything.</p>
<p>Notice also that sys-firewall is using 1293 MB with no clients (it's configured to balloon up or down; it could probably go down to 300 MB without much trouble). I gave mirage-firewall a fixed 30 MB allocation, which seems to be enough.</p>
<p>I'm not sure how it compares with Linux for transmission performance, but it can max out my 30 Mbit/s Internet connection with its single CPU, so it's unlikely to matter.</p>
<h3 id="exercises">Exercises</h3>
<p>I've only implemented the minimal features to let me use it as my firewall.
The great thing about having a simple unikernel is that you can modify it easily.
Here are some suggestions you can try at home (easy ones first):</p>
<ul>
<li>
<p>Change the policy to allow communication between client VMs.</p>
</li>
<li>
<p>Query the QubesDB <code>/qubes-debug-mode</code> key. If present and set, set logging to debug level.</p>
</li>
<li>
<p>Edit <code>command.ml</code> to provide a <a href="https://www.qubes-os.org/doc/qrexec3/">qrexec</a> command to add or remove rules at runtime.</p>
</li>
<li>
<p>When a packet is rejected, add the frame to a ring buffer. Edit <code>command.ml</code> to provide a &quot;dump-rejects&quot; command that returns the rejected packets in <code>pcap</code> format, ready to be loaded into wireshark. Hint: you can use the <a href="https://github.com/mirage/ocaml-pcap">ocaml-pcap</a> library to read and write the pcap format.</p>
</li>
<li>
<p>All client VMs are reported as <code>Client</code> to the policy. Add a table mapping IP addresses to symbolic names, so you can e.g. allow <code>DevVM</code> to talk to <code>TestVM</code> or control access to specific external machines.</p>
</li>
<li>
<p><a href="https://github.com/yomimono/mirage-nat">mirage-nat</a> doesn't do NAT for ICMP packets. Add support, so ping works (see <a href="https://github.com/yomimono/mirage-nat/issues/15">https://github.com/yomimono/mirage-nat/issues/15</a>).</p>
</li>
<li>
<p>Qubes allows each VM to have two DNS servers. I only implemented the primary. Read the <code>/qubes-secondary-dns</code> and <code>/qubes-netvm-secondary-dns</code> keys from QubesDB and proxy that too.</p>
</li>
<li>
<p>Implement <a href="https://en.wikipedia.org/wiki/Port_knocking">port knocking</a> for new connections.</p>
</li>
<li>
<p>Add a <code>Reject</code> action that sends an ICMP rejection message.</p>
</li>
<li>
<p>Find out what we're supposed to do when a domain shuts down. Currently, we set the netback state to closed, but the directory in XenStore remains. Who is responsible for deleting it?</p>
</li>
<li>
<p>Update the firewall to use the latest version of the <a href="https://github.com/yomimono/mirage-nat">mirage-nat</a> library, which has extra features such as expiry of old NAT table entries.</p>
</li>
</ul>
<p>Finally, <a href="https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-004-2012.txt">Qubes Security Bulletin #4</a> says:</p>
<blockquote>
<p>Due to a silly mistake made by the Qubes Team, the IPv6 filtering rules
have been set to ALLOW by default in all Service VMs, which results in
lack of filtering for IPv6 traffic originating between NetVM and the
corresponding FirewallVM, as well as between AppVMs and the
corresponding FirewallVM. Because the RPC services (rpcbind and
rpc.statd) are, by default, bound also to the IPv6 interfaces in all the
VMs by default, this opens up an avenue to attack a FirewallVM from a
corresponding NetVM or AppVM, and further attack another AppVM from the
compromised FirewallVM, using a hypothetical vulnerability in the above
mentioned RPC services (chained attack).</p>
</blockquote>
<p>What changes would be needed to mirage-firewall to reproduce this bug?</p>
<h2 id="summary">Summary</h2>
<p>QubesOS provides a desktop environment made from multiple virtual machines, isolated using Xen.
It runs the network drivers (which it doesn't trust) in a Linux &quot;NetVM&quot;, which it assumes may be compromised, and places a &quot;FirewallVM&quot; between that and the VMs running user applications.
This design is intended to protect users from malicious or buggy network drivers.</p>
<p>However, the Linux kernel code running in FirewallVM is written with the assumption that NetVM is trustworthy.
It is fairly likely that a compromised NetVM could successfully attack FirewallVM.
Since both FirewallVM and the client VMs all run Linux, it is likely that the same exploit would then allow the client VMs to be compromised too.</p>
<p>I used MirageOS to write a replacement FirewallVM in OCaml.
The new virtual machine contains almost no C code (little more than <code>malloc</code>, <code>printk</code>, the OCaml GC and <code>libm</code>), and should therefore avoid problems such as the unchecked array bounds problem that recently affected the Qubes firewall.
It also uses less than a tenth of the minimum memory of the Linux FirewallVM, boots several times faster, and when it starts handling network traffic it is already fully configured, avoiding e.g. any race setting up firewalls or DNS forwarding.</p>
<p>The code is around 1000 lines of OCaml, and makes it easy to follow the progress of a network frame from the point where the network driver reads it from a Xen shared memory ring, through the Ethernet handling, to the IP firewall code, to the user firewall policy, and then finally to the shared memory ring of the output interface.</p>
<p>The code has only been lightly tested (I've just started using it as the FirewallVM on my main laptop), but will hopefully prove easy to extend (and, if necessary, debug).</p>
]]></content>
  </entry>
  <entry>
    <title type="html">CueKeeper internals: Experiences with Irmin, React, TyXML and IndexedDB</title>
    <link href="https://roscidus.com/blog/blog/2015/06/22/cuekeeper-internals-irmin/"></link>
    <updated>2015-06-22T09:25:23+00:00</updated>
    <id>https://roscidus.com/blog/blog/2015/06/22/cuekeeper-internals-irmin</id>
    <content type="html"><![CDATA[<p>In <a href="/blog/blog/2015/04/28/cuekeeper-gitting-things-done-in-the-browser/">CueKeeper: Gitting Things Done in the browser</a>, I wrote about CueKeeper, a <a href="http://en.wikipedia.org/wiki/Getting_Things_Done">Getting Things Done</a> application that runs client-side in your browser.
It stores your actions in a Git-like data-store provided by <a href="https://github.com/mirage/irmin/">Irmin</a>, allowing you to browse the history, revert changes, and sync (between tabs and, once the server backend is available, between devices).
Several people asked about the technologies used to build it, so that's what this blog will cover.</p>
<!-- more -->
<p><a href="/blog/cuekeeper/"><span class="caption-wrapper center"><img src="/blog/images/cuekeeper/cuekeeper-0.1.png" title="CueKeeper screenshot. Click for interactive version." class="caption"/><span class="caption-text">CueKeeper screenshot. Click for interactive version.</span></span></a></p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#overview">Overview</a>
</li>
<li><a href="#compiling-to-javascript">Compiling to Javascript</a>
</li>
<li><a href="#using-javascript-apis">Using Javascript APIs</a>
</li>
<li><a href="#data-structures">Data structures</a>
</li>
<li><a href="#backwards-compatibility">Backwards compatibility</a>
</li>
<li><a href="#irmin">Irmin</a>
</li>
<li><a href="#indexeddb-backend">IndexedDB backend</a>
</li>
<li><a href="#revisions-and-merging">Revisions and merging</a>
</li>
<li><a href="#react">React</a>
</li>
<li><a href="#tyxml">TyXML</a>
</li>
<li><a href="#reactive-tyxml">Reactive TyXML</a>
</li>
<li><a href="#problems-with-react">Problems with React</a>
</li>
<li><a href="#debugging">Debugging</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
<li><a href="#next-steps">Next steps</a>
</li>
<li><a href="#acknowledgements">Acknowledgements</a>
</li>
</ul>
<p>( this post also appeared on <a href="http://www.reddit.com/r/ocaml/comments/3apdw7/cuekeeper_internals_experiences_with_irmin_react/">Reddit</a> )</p>
<h2 id="overview">Overview</h2>
<p>CueKeeper is written in <a href="http://ocaml.org/">OCaml</a> and compiled to Javascript using <a href="http://ocsigen.org/js_of_ocaml/">js_of_ocaml</a>.
The HTML is produced using <a href="http://ocsigen.org/tyxml/">TyXML</a>, and kept up-to-date with <a href="http://erratique.ch/software/react">React</a> (note: that's OCaml React, not Facebook React).
Records are serialised using <a href="https://github.com/janestreet/sexplib">Sexplib</a> and stored by <a href="https://github.com/mirage/irmin/">Irmin</a> in a local <a href="http://www.w3.org/TR/IndexedDB/">IndexedDB</a> database in your browser.
Action descriptions are written in <a href="http://daringfireball.net/projects/markdown/syntax">Markdown</a>, which is parsed using <a href="https://github.com/ocaml/omd">Omd</a>.</p>
<p>Here's a diagram of the main modules that make up CueKeeper:</p>
<p><img src="/blog/images/cuekeeper/modules.png" class="border center"/></p>
<ul>
<li><code>disk_node</code> defines the on-disk data types representing stored items such as actions and projects.
</li>
<li><code>rev</code> loads all the items in a single Git commit (revision), which together represent the state of the system at some point.
</li>
<li><code>update</code> keeps track of the current branch head, loading the new version when it updates. It is also responsible for writing changes to storage.
</li>
<li><code>merge</code> can merge any two branches using a 3-way merge.
</li>
<li><code>model</code> queries the state to extract the information to be displayed (e.g. list of current actions).
</li>
<li><code>template</code> renders the results to HTML. It also uses e.g. the Pikaday date-picker widget.
</li>
<li><code>client</code> is the main entry point for the Javascript (client-side) part of CueKeeper (the server is currently under development).
</li>
<li><code>git_storage</code> provides a Git-like interface to Irmin, which uses the <code>irmin_IDB</code> backend to store the data in the browser using IndexedDB (plus a little HTML storage for cross-tab notifications).
</li>
</ul>
<p>The full code is available at <a href="https://github.com/talex5/cuekeeper">https://github.com/talex5/cuekeeper</a>.</p>
<h2 id="compiling-to-javascript">Compiling to Javascript</h2>
<p>To generate Javascript code from OCaml, first compile to OCaml bytecode and then run <a href="http://ocsigen.org/js_of_ocaml/">js_of_ocaml</a> on the result, like this:</p>
<figure class="code"><figcaption><span>test.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">print_endline</span> <span class="s2">&quot;Hello from OCaml!&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><pre><code>$ ocamlc test.ml -o test.byte
$ js_of_ocaml test.byte -o test.js
</code></pre>
<p>To test it, create an HTML file to load the new <code>test.js</code> code and open the HTML file in a web browser:</p>
<figure class="code"><figcaption><span>test.html</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="html"><span class="line"><span class="cp">&lt;!DOCTYPE html&gt;</span>
</span><span class="line"><span class="p">&lt;</span><span class="nt">html</span><span class="p">&gt;</span>
</span><span class="line">  <span class="p">&lt;</span><span class="nt">body</span><span class="p">&gt;</span>
</span><span class="line">    <span class="p">&lt;</span><span class="nt">p</span><span class="p">&gt;</span>Open the browser&#39;s Javascript console to see the output.<span class="p">&lt;/</span><span class="nt">p</span><span class="p">&gt;</span>
</span><span class="line">    <span class="p">&lt;</span><span class="nt">script</span> <span class="na">src</span><span class="o">=</span><span class="s">&quot;test.js&quot;</span><span class="p">&gt;&lt;/</span><span class="nt">script</span><span class="p">&gt;</span>
</span><span class="line">  <span class="p">&lt;/</span><span class="nt">body</span><span class="p">&gt;</span>
</span><span class="line"><span class="p">&lt;/</span><span class="nt">html</span><span class="p">&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>OCaml bytecode statically includes any OCaml libraries it uses, so this method also works for complex real-world programs.
Many OCaml libraries can be used directly.
For example, I used <a href="https://github.com/mirage/ocaml-tar">ocaml-tar</a> to create <code>.tar</code> archives in the browser for the export feature, and
the <a href="https://github.com/ocaml/omd">omd</a> Markdown parser for the descriptions.</p>
<p>If the OCaml code uses external C functions (that aren't already provided) then you need to implement them in Javascript.
In the case of CueKeeper, I had to implement a few trivial functions for blitting blocks of memory between OCaml strings, bigarrays and <code>bin_prot</code> buffers.
I put these in a <a href="https://github.com/talex5/cuekeeper/blob/master/js/helpers.js">helpers.js</a> file and added it to the <code>js_of_ocaml</code> arguments.</p>
<h2 id="using-javascript-apis">Using Javascript APIs</h2>
<p>My first attempt at writing code for the browser was my <a href="/blog/blog/2014/10/27/visualising-an-asynchronous-monad/">Lwt trace visualiser</a>.
I initially wrote that for the desktop but it turned out that running it in the browser was just a matter of replacing calls to GTK's Cairo canvas with calls to the very similar HTML canvas.
Writing CueKeeper required learning a bit more about the mysterious world of the Javascript DOM.</p>
<p>I also needed to integrate with the <a href="https://github.com/dbushell/Pikaday">Pikaday</a> date picker widget.
To do this, you first declare an OCaml class type for each Javascript class (you only have to define the methods you want to use), like this:</p>
<figure class="code"><figcaption><span>pikaday.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">class</span> <span class="k">type</span> <span class="n">pikaday</span> <span class="o">=</span>
</span><span class="line">  <span class="k">object</span>
</span><span class="line">    <span class="k">method</span> <span class="n">getDate</span> <span class="o">:</span> <span class="nn">Js</span><span class="p">.</span><span class="n">date</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="nn">Opt</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="n">meth</span>
</span><span class="line">  <span class="k">end</span>
</span><span class="line">
</span><span class="line"><span class="k">class</span> <span class="k">type</span> <span class="n">config</span> <span class="o">=</span>
</span><span class="line">  <span class="k">object</span>
</span><span class="line">    <span class="k">method</span> <span class="n">container</span> <span class="o">:</span> <span class="nn">Dom_html</span><span class="p">.</span><span class="n">element</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="n">prop</span>
</span><span class="line">    <span class="k">method</span> <span class="n">onSelect</span> <span class="o">:</span> <span class="o">(</span><span class="n">pikaday</span><span class="o">,</span> <span class="nn">Js</span><span class="p">.</span><span class="n">date</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span><span class="o">)</span> <span class="nn">Js</span><span class="p">.</span><span class="n">meth_callback</span> <span class="nn">Js</span><span class="p">.</span><span class="n">prop</span>
</span><span class="line">    <span class="k">method</span> <span class="n">defaultDate</span> <span class="o">:</span> <span class="nn">Js</span><span class="p">.</span><span class="n">date</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="nn">Optdef</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="n">prop</span>
</span><span class="line">    <span class="k">method</span> <span class="n">setDefaultDate</span> <span class="o">:</span> <span class="kt">bool</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="nn">Js</span><span class="p">.</span><span class="n">prop</span>
</span><span class="line">  <span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that a <code>pikaday</code> object has a <code>getDate</code> method which returns an optional Javascript date object, and that
a <code>config</code> object provides properties such as <code>onSelect</code>, which is a callback of a <code>pikaday</code> object which takes a date and returns nothing.</p>
<p>The constructors are built using <code>Js.Unsafe</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">make_config</span> <span class="bp">()</span> <span class="o">:</span> <span class="n">config</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="nn">Js</span><span class="p">.</span><span class="nn">Unsafe</span><span class="p">.</span><span class="n">obj</span> <span class="o">[|</span> <span class="o">|]</span>
</span><span class="line"><span class="k">let</span> <span class="n">pikaday_constr</span> <span class="o">:</span> <span class="o">(</span><span class="n">config</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">pikaday</span> <span class="nn">Js</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="nn">Js</span><span class="p">.</span><span class="n">constr</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Js</span><span class="p">.</span><span class="nn">Unsafe</span><span class="p">.</span><span class="n">global</span><span class="o">##_</span><span class="n">Pikaday</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>They're &quot;unsafe&quot; because this isn't type checked; there's no way to know whether Pikaday really implements the interface we defined above.
However, from this point on everything we do with Pikaday is statically checked against our definitions.</p>
<p><code>js_of_ocaml</code> provides a syntax extension for OCaml to make using native Javascript objects easier.
<code>object##property</code> reads a property, <code>object##property &lt;- value</code> sets a property, and <code>object##method(args)</code> calls a method.
Note that parentheses around the arguments are required, unlike with regular OCaml method calls.
Note also that <code>js_of_ocaml</code> ignores underscores in various places to avoid differences between Javascript and OCaml naming conventions (properties can't start with an uppercase character, for example).</p>
<p>It's interesting the way OCaml's type inference is used here: <code>Js.Unsafe.global</code> can take any type, and OCaml infers that its type is &quot;object with a <code>Pikaday</code> property, which is a <code>pikaday</code> constructor taking a <code>config</code> argument&quot; because that's how we use it.</p>
<p>Finally, here's the code that creates a new Pikaday object:</p>
<figure class="code"><figcaption><span>pikaday.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">make</span> <span class="o">?(</span><span class="n">initial</span><span class="o">:</span><span class="nn">Ck_time</span><span class="p">.</span><span class="n">user_date</span> <span class="n">option</span><span class="o">)</span> <span class="o">~</span><span class="n">on_select</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">div</span> <span class="o">=</span> <span class="nn">Html5</span><span class="p">.</span><span class="n">div</span> <span class="bp">[]</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">elem</span> <span class="o">=</span> <span class="nn">Tyxml_js</span><span class="p">.</span><span class="nn">To_dom</span><span class="p">.</span><span class="n">of_div</span> <span class="n">div</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">config</span> <span class="o">=</span> <span class="n">make_config</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="n">config</span><span class="o">##</span><span class="n">container</span> <span class="o">&lt;-</span> <span class="n">elem</span><span class="o">;</span>
</span><span class="line">  <span class="n">config</span><span class="o">##</span><span class="n">onSelect</span> <span class="o">&lt;-</span> <span class="nn">Js</span><span class="p">.</span><span class="n">wrap_callback</span> <span class="o">(</span><span class="k">fun</span> <span class="n">d</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">on_select</span> <span class="o">(</span><span class="n">to_user_date</span> <span class="n">d</span><span class="o">)</span>
</span><span class="line">  <span class="o">);</span>
</span><span class="line">  <span class="k">begin</span> <span class="k">match</span> <span class="o">(</span><span class="n">initial</span> <span class="o">:&gt;</span> <span class="o">(</span><span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span><span class="o">)</span> <span class="n">option</span><span class="o">)</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="o">(</span><span class="n">y</span><span class="o">,</span> <span class="n">m</span><span class="o">,</span> <span class="n">d</span><span class="o">)</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="k">let</span> <span class="n">js_date</span> <span class="o">=</span> <span class="n">jsnew</span> <span class="nn">Js</span><span class="p">.</span><span class="n">date_day</span> <span class="o">(</span><span class="n">y</span><span class="o">,</span> <span class="n">m</span><span class="o">,</span> <span class="n">d</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="n">config</span><span class="o">##</span><span class="n">defaultDate</span> <span class="o">&lt;-</span> <span class="nn">Js</span><span class="p">.</span><span class="nn">Optdef</span><span class="p">.</span><span class="n">return</span> <span class="n">js_date</span><span class="o">;</span>
</span><span class="line">      <span class="n">config</span><span class="o">##</span><span class="n">setDefaultDate</span> <span class="o">&lt;-</span> <span class="nn">Js</span><span class="p">.</span><span class="n">_true</span><span class="o">;</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="bp">()</span> <span class="k">end</span><span class="o">;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">pd</span> <span class="o">=</span> <span class="n">jsnew</span> <span class="n">pikaday_constr</span> <span class="o">(</span><span class="n">config</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="o">(</span><span class="n">div</span><span class="o">,</span> <span class="n">pd</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, we create a <code>&lt;div&gt;</code> element and use it as the <code>container</code> field of the Pikaday config object.
<code>on_select</code> is an OCaml function to handle the result, which we wrap with <code>Js.wrap_callback</code> and set as the Javascript callback.
If an initial date is given, we construct a Javascript <code>Date</code> object and set that as the default.
Finally, we create the <code>Pikaday</code> object and return it, along with the containing <code>div</code>.</p>
<p>All this means that binding to Javascript APIs is very easy and, thanks to the extra type-checking, feels more pleasant even than using Javascript libraries directly from Javascript.</p>
<h2 id="data-structures">Data structures</h2>
<p>In CueKeeper, areas, projects and actions all share a common set of fields, which I defined using an OCaml record:</p>
<figure class="code"><figcaption><span>ck_disk_node.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="n">node_details</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">parent</span> <span class="o">:</span> <span class="nn">Ck_id</span><span class="p">.</span><span class="n">t</span> <span class="n">sexp_option</span><span class="o">;</span>
</span><span class="line">  <span class="n">name</span> <span class="o">:</span> <span class="kt">string</span><span class="o">;</span>
</span><span class="line">  <span class="n">description</span> <span class="o">:</span> <span class="kt">string</span><span class="o">;</span>
</span><span class="line">  <span class="n">ctime</span> <span class="o">:</span> <span class="kt">float</span><span class="o">;</span>
</span><span class="line">  <span class="n">contact</span> <span class="o">:</span> <span class="nn">Ck_id</span><span class="p">.</span><span class="n">t</span> <span class="n">sexp_option</span><span class="o">;</span>
</span><span class="line">  <span class="n">conflicts</span> <span class="o">:</span> <span class="kt">string</span> <span class="n">sexp_list</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span> <span class="k">with</span> <span class="n">sexp</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><img src="/blog/images/cuekeeper/area.png" class="center small"/></p>
<p>A <code>Ck_id.t</code> is a <a href="http://en.wikipedia.org/wiki/Universally_unique_identifier">UUID</a> (unique string).
I refer to other nodes using UUIDs so that renaming a node doesn't require updating everything that points to it.
This simplifies merging.
Each record is stored as a single file, and the name of the file is the item's UUID.
The <code>conflicts</code> field is used to store messages about any conflicts that had to be resolved during merging.</p>
<p>The <code>with sexp</code> annotation makes use of <a href="https://github.com/janestreet/sexplib">Sexplib</a> to auto-generate code for serialising and deserialising these structures.
I use <code>sexp_option</code> and <code>sexp_list</code> rather than <code>option</code> and <code>list</code> to provide slightly nicer output: these fields will be omitted if empty.</p>
<p>I also (rather lazily) reuse this structure for contacts and contexts, but always keep <code>parent</code> and <code>contact</code> as <code>None</code> for them.</p>
<p>For actions and projects, we also need to record some extra data:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="n">project_details</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">pstarred</span> <span class="o">:</span> <span class="kt">bool</span> <span class="k">with</span> <span class="n">default</span><span class="o">(</span><span class="bp">false</span><span class="o">);</span>
</span><span class="line">  <span class="n">pstate</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Active</span> <span class="o">|</span> <span class="o">`</span><span class="nc">SomedayMaybe</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Done</span> <span class="o">]</span>
</span><span class="line"><span class="o">}</span> <span class="k">with</span> <span class="n">sexp</span>
</span><span class="line">
</span><span class="line"><span class="k">type</span> <span class="n">astate</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[</span> <span class="o">`</span><span class="nc">Next</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Waiting</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Waiting_for_contact</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Waiting_until</span> <span class="k">of</span> <span class="nn">Ck_time</span><span class="p">.</span><span class="n">user_date</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Future</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Done</span> <span class="o">]</span> <span class="k">with</span> <span class="n">sexp</span>
</span><span class="line">
</span><span class="line"><span class="k">type</span> <span class="n">action_details</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">astarred</span> <span class="o">:</span> <span class="kt">bool</span> <span class="k">with</span> <span class="n">default</span><span class="o">(</span><span class="bp">false</span><span class="o">);</span>
</span><span class="line">  <span class="n">astate</span> <span class="o">:</span> <span class="n">astate</span><span class="o">;</span>
</span><span class="line">  <span class="n">context</span> <span class="o">:</span> <span class="nn">Ck_id</span><span class="p">.</span><span class="n">t</span> <span class="n">sexp_option</span><span class="o">;</span>
</span><span class="line">  <span class="n">repeat</span><span class="o">:</span> <span class="nn">Ck_time</span><span class="p">.</span><span class="n">repeat</span> <span class="n">sexp_option</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span> <span class="k">with</span> <span class="n">sexp</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here's what a project looks like when printed with <code>Sexplib.Sexp.to_string_hum</code>
(the <code>hum</code> suffix turns on pretty-printing; the real code uses plain <code>to_string</code>):</p>
<p><img src="/blog/images/cuekeeper/project.png" class="center small"/></p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="lisp"><span class="line"><span class="p">(</span><span class="nv">Project</span>
</span><span class="line"><span class="w"> </span><span class="p">(((</span><span class="nv">pstarred</span><span class="w"> </span><span class="nv">false</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nv">pstate</span><span class="w"> </span><span class="nv">Active</span><span class="p">))</span>
</span><span class="line"><span class="w">  </span><span class="p">((</span><span class="nv">parent</span><span class="w"> </span><span class="nv">3eba7466-dafd-4b96-9fad-c9859ef825f2</span><span class="p">)</span>
</span><span class="line"><span class="w">  </span><span class="p">(</span><span class="nv">name</span><span class="w"> </span><span class="s">&quot;Make a Mirage unikernel&quot;</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nv">description</span><span class="w"> </span><span class="s">&quot;&quot;</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nv">ctime</span><span class="w"> </span><span class="mf">1429555212.546</span><span class="p">))))</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note that the child nodes don't appear at all here.
Instead, we find them through their parent field.</p>
<p>Finally, I wrapped everything up in some polymorphic variants:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="n">action_node</span> <span class="o">=</span> <span class="n">action_details</span> <span class="o">*</span> <span class="n">node_details</span>
</span><span class="line"><span class="k">type</span> <span class="n">project_node</span> <span class="o">=</span> <span class="n">project_details</span> <span class="o">*</span> <span class="n">node_details</span>
</span><span class="line"><span class="k">type</span> <span class="n">area_node</span> <span class="o">=</span> <span class="n">node_details</span>
</span><span class="line"><span class="k">type</span> <span class="n">contact_node</span> <span class="o">=</span> <span class="n">node_details</span>
</span><span class="line"><span class="k">type</span> <span class="n">context_node</span> <span class="o">=</span> <span class="n">node_details</span>
</span><span class="line">
</span><span class="line"><span class="k">type</span> <span class="n">action</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Action</span> <span class="k">of</span> <span class="n">action_node</span><span class="o">]</span>
</span><span class="line"><span class="k">type</span> <span class="n">project</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Project</span> <span class="k">of</span> <span class="n">project_node</span><span class="o">]</span>
</span><span class="line"><span class="k">type</span> <span class="n">area</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Area</span> <span class="k">of</span> <span class="n">area_node</span><span class="o">]</span>
</span><span class="line"><span class="k">type</span> <span class="n">contact</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Contact</span> <span class="k">of</span> <span class="n">contact_node</span><span class="o">]</span>
</span><span class="line"><span class="k">type</span> <span class="n">context</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Context</span> <span class="k">of</span> <span class="n">context_node</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="k">type</span> <span class="n">generic</span> <span class="o">=</span> <span class="o">[</span> <span class="n">area</span> <span class="o">|</span> <span class="n">project</span> <span class="o">|</span> <span class="n">action</span> <span class="o">|</span> <span class="n">contact</span> <span class="o">|</span> <span class="n">context</span> <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is very useful, because other parts of the code often want to deal with subsets of the types.
The interface lists which types can be used in each operation:</p>
<figure class="code"><figcaption><span>ck_disk_node.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">parent</span> <span class="o">:</span> <span class="o">[&lt;</span> <span class="n">area</span> <span class="o">|</span> <span class="n">project</span> <span class="o">|</span> <span class="n">action</span> <span class="o">]</span> <span class="o">-&gt;</span> <span class="nn">Ck_id</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span>
</span><span class="line"><span class="k">val</span> <span class="n">starred</span> <span class="o">:</span> <span class="o">[&lt;</span> <span class="n">project</span> <span class="o">|</span> <span class="n">action</span><span class="o">]</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line"><span class="k">val</span> <span class="n">action_repeat</span> <span class="o">:</span> <span class="n">action</span> <span class="o">-&gt;</span> <span class="nn">Ck_time</span><span class="p">.</span><span class="n">repeat</span> <span class="n">option</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that only areas, projects and actions have parents; only projects and actions can be starred; and only actions can repeat.</p>
<p>Using variants makes it easy for other modules to match on the different types.
For example, here's the code for generating the <code>Process</code> tab's tree:</p>
<p><img src="/blog/images/cuekeeper/process.png" class="center"/></p>
<figure class="code"><figcaption><span>ck_model.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="k">let</span> <span class="n">make_process_tree</span> <span class="n">r</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="k">rec</span> <span class="n">aux</span> <span class="n">items</span> <span class="o">=</span>
</span><span class="line">      <span class="nn">M</span><span class="p">.</span><span class="n">fold</span> <span class="o">(</span><span class="k">fun</span> <span class="o">_</span><span class="n">key</span> <span class="n">item</span> <span class="n">acc</span> <span class="o">-&gt;</span>
</span><span class="line">	<span class="k">match</span> <span class="n">item</span> <span class="k">with</span>
</span><span class="line">	<span class="o">|</span> <span class="o">`</span><span class="nc">Action</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="n">acc</span>
</span><span class="line">	<span class="o">|</span> <span class="o">`</span><span class="nc">Project</span> <span class="o">_</span> <span class="k">as</span> <span class="n">p</span> <span class="k">when</span> <span class="nn">Node</span><span class="p">.</span><span class="n">project_state</span> <span class="n">p</span> <span class="o">&lt;&gt;</span> <span class="o">`</span><span class="nc">Active</span> <span class="o">-&gt;</span> <span class="n">acc</span>
</span><span class="line">	<span class="o">|</span> <span class="o">`</span><span class="nc">Area</span> <span class="o">_</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Project</span> <span class="o">_</span> <span class="o">-&gt;</span>
</span><span class="line">	    <span class="k">let</span> <span class="n">children</span> <span class="o">=</span> <span class="n">aux</span> <span class="o">(</span><span class="nn">R</span><span class="p">.</span><span class="n">child_nodes</span> <span class="n">item</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">	    <span class="n">acc</span> <span class="o">|&gt;</span> <span class="nn">TreeNode</span><span class="p">.</span><span class="n">add</span> <span class="o">(</span><span class="nn">TreeNode</span><span class="p">.</span><span class="n">unique_of_node</span> <span class="o">~</span><span class="n">children</span> <span class="n">item</span><span class="o">)</span>
</span><span class="line">      <span class="o">)</span> <span class="n">items</span> <span class="nn">TreeNode</span><span class="p">.</span><span class="nn">Child_map</span><span class="p">.</span><span class="n">empty</span> <span class="k">in</span>
</span><span class="line">    <span class="n">aux</span> <span class="o">(</span><span class="nn">R</span><span class="p">.</span><span class="n">roots</span> <span class="n">r</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Actions and inactive projects don't appear (they return the accumulator unmodified),
while for areas and active projects we add a node, including a recursive call to get the children.</p>
<p>One problem I had with this scheme was the return types for modifications:</p>
<figure class="code"><figcaption><span>ck_disk_node.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">val</span> <span class="n">with_conflict</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="o">([&lt;</span> <span class="n">generic</span><span class="o">]</span> <span class="k">as</span> <span class="k">&#39;</span><span class="n">a</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="k">&#39;</span><span class="n">a</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>with_conflict msg node</code> returns a copy of <code>node</code> with its conflict messages field extended with the given message.
The type says that it works with any subset of the node types, and that the result will be the same subset.
For example, when adding a conflict about repeats to an action, the result will be an action.
When adding a message to something that could be an area, project or action, the result will be another area, project or action.</p>
<p>However, I couldn't work out how to implement this signature without using <code>Obj.magic</code> (unsafe cast).
I asked on StackOverflow (<a href="http://stackoverflow.com/questions/29589479/map-a-subset-of-a-polymorphic-variant">Map a subset of a polymorphic variant</a>) and it seems there's no easy answer.</p>
<p>I also experimented with a couple of other approaches:</p>
<ul>
<li>Using GADTs didn't work because they don't support arbitrary subsets. A function either handles a specific node type, or all node types, but not e.g. just projects and actions.
</li>
<li>Using objects avoided the need for the unsafe cast, but required more code elsewhere. Objects work well when you have a fixed set of operations and you want to make it easy to add new kinds of thing, but in GTD the set of types is fixed, while the operations (report generation, rendering on different devices, merging) are more open-ended.
</li>
</ul>
<h2 id="backwards-compatibility">Backwards compatibility</h2>
<p>Although this is the 0.1-alpha release, I made various changes to the format during development and it's never too early to check that smooth upgrades are possible.
Besides, I've been recklessly using it as my action tracker during development and I don't like typing things in twice.</p>
<p>You'll notice some fields above have a <code>with_default</code> annotation.
This provides a default value when loading from earlier versions.
For more complex cases, it's possible to write custom code.
For example, I changed the date representation at one point from Unix timestamps to calendar dates (this provides more intuitive behaviour when moving between time-zones I think).
There is code in <code>Ck_time</code> to handle this:</p>
<figure class="code"><figcaption><span>ck_time.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="n">user_date</span> <span class="o">=</span> <span class="o">(</span><span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span><span class="o">)</span> <span class="k">with</span> <span class="n">sexp_of</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">user_date_of_sexp</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="k">open</span> <span class="nn">Sexplib</span><span class="p">.</span><span class="nc">Type</span> <span class="k">in</span> <span class="k">function</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Atom</span> <span class="o">_</span> <span class="k">as</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="o">&lt;:</span><span class="n">of_sexp</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;&gt;</span> <span class="n">x</span> <span class="o">|&gt;</span> <span class="n">of_unix_time</span> <span class="c">(* Old format *)</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">List</span> <span class="o">_</span> <span class="k">as</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="o">&lt;:</span><span class="n">of_sexp</span><span class="o">&lt;</span><span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span> <span class="o">*</span> <span class="kt">int</span><span class="o">&gt;&gt;</span> <span class="n">x</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In the implementation (<code>ck_time.ml</code>) I use <code>with sexp_of</code> so that only the serialisation code is created automatically, while it uses my custom code for deserialising.
In the interface (<code>ck_time.mli</code>), I just declare it as <code>with sexp</code> and code outside doesn't see anything special.</p>
<h2 id="irmin">Irmin</h2>
<p>The next step was to write the data to the Irmin repository.
Irmin itself provides a fairly traditional key/value store API with some extra features for version control.
That might be useful for existing applications, but I wanted a more Git-like API.
For example, Irmin allows you to read files directly from the branch head, but in the browser another tab might update the branch between the two reads, leading to inconsistent results.
I wanted something that would force me to use atomic operations.
Also, the Irmin API is still being finalised, so I wanted to provide an example of my &quot;ideal&quot; API.</p>
<p>Here's the API wrapper I used (it doesn't provide access to all Irmin's features, just the ones I needed):</p>
<p>A <code>Staging.t</code> corresponds to the Git staging area / working directory. It is mutable, and not shared with other tabs:</p>
<figure class="code"><figcaption><span>git_storage_s.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Staging</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="kt">list</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="kt">list</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">read_exn</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">update</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">remove</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">mem</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">path</span> <span class="o">-&gt;</span> <span class="kt">bool</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A <code>Commit.t</code> represents a single (immutable) Git commit.
You can check out a commit to get a staging area, modify that, and then commit it to create a new <code>Commit.t</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Commit</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">checkout</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Staging</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">commit</span> <span class="o">:</span> <span class="o">?</span><span class="n">parents</span><span class="o">:</span><span class="n">t</span> <span class="kt">list</span> <span class="o">-&gt;</span> <span class="nn">Staging</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">msg</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">equal</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">history</span> <span class="o">:</span> <span class="o">?</span><span class="n">depth</span><span class="o">:</span><span class="kt">int</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Log_entry</span><span class="p">.</span><span class="n">t</span> <span class="nn">Log_entry_map</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">export_tar</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A branch is a mutable pointer to a commit.
The head is represented as a reactive signal (more on React later), making it easy to follow updates.
The only thing you can do with a branch is fast-forward it to a new commit.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Branch</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">head</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">fast_forward_to</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Ok</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Not_fast_forward</span> <span class="o">]</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Finally, a <code>Repository.t</code> represents a repository as a whole.
You can look up a branch by name or a commit by hash:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Repository</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">branch</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">if_new</span><span class="o">:(</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lazy</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="n">branch_name</span> <span class="o">-&gt;</span> <span class="nn">Branch</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="c">(** Get the named branch.</span>
</span><span class="line"><span class="c">   * If the branch does not exist yet, [if_new] is called to get the initial commit. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">commit</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Irmin</span><span class="p">.</span><span class="nn">Hash</span><span class="p">.</span><span class="nn">SHA1</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="c">(** Look up a commit by its hash. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">empty</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Staging</span><span class="p">.</span><span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="c">(** Create an empty checkout with no parent. *)</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Thomas Gazagnaire provided many useful updates to Irmin to let me implement this API: atomic operations (needed for a reliable <code>fast_forward_to</code>), support for creating commits with arbitrary parents (needed for custom merging, described below), and performance improvements (very important for running in a browser!).</p>
<h2 id="indexeddb-backend">IndexedDB backend</h2>
<p>Irmin provides a Git backend that supports normal Git repositories, as well as a simpler filesystem backend, a remote HTTP backend, and an in-memory-only backend.
To run Irmin in the browser, I initially added a backend for HTML 5 storage.</p>
<p>However, HTML 5 storage is limited to 5 MB of data and since my backend lacked compression, it eventually ran out, so I then replaced it with support for <a href="http://www.w3.org/TR/IndexedDB/">IndexedDB</a>.</p>
<p><code>js_of_ocaml</code> supports most (standardised) HTML features, but IndexedDB had only just come out, so I had to write my own bindings (as for Pikaday, above).</p>
<p>IndexedDB is rather complicated compared to local storage, so I split it across several modules.
I first defined the Javascript API (<a href="https://github.com/talex5/cuekeeper/blob/master/js/indexedDB.mli">indexedDB.mli</a>), then wrapped it in a nicer OCaml API, providing asynchronous operations with Lwt threading rather than callbacks (<a href="https://github.com/talex5/cuekeeper/blob/master/js/indexedDB_lwt.mli">indexedDB_lwt.mli</a>).
I then made an Irmin backend that uses it (<a href="https://github.com/talex5/cuekeeper/blob/master/js/irmin_IDB.ml">irmin_IDB.ml</a>).</p>
<p>The Irmin API for backends can be a little confusing at first.
An Irmin &quot;branch consistent&quot; (Git-like) repository internally consists of two simpler stores: an <a href="https://github.com/mirage/irmin/blob/master/lib/ir_ao.mli">append-only</a> store that stores immutable blobs (files, directories and commits), indexed by their SHA1 hash, and a <a href="https://github.com/mirage/irmin/blob/master/lib/ir_rw.mli">read-write</a> store that is used to record which commit each branch currently points to.
If you can provide implementations of these two APIs, Irmin can automatically provide the full <a href="https://github.com/mirage/irmin/blob/master/lib/ir_bc.mli">branch-consistent database</a> API itself.</p>
<p>One problem with moving to IndexedDB is that it doesn't support notifications.
To get around this, when CueKeeper updates the <code>master</code> branch to point at a new commit, it also writes the SHA1 hash to local storage.
Other open windows or tabs get notified of this and then read the new data from IndexedDB.</p>
<p>I also found a couple of browser bugs while testing this.
Firefox seems to <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1147942">not clean up IndexedDB transactions</a>, though this doesn't cause any obvious problems in practice.</p>
<p>Safari, however, has a more serious problem: if two threads (tabs) try to read from the database at the same time, one of the transactions will fail!
I was able to reproduce the error with a few lines of JavaScript (see <a href="http://test.roscidus.com/static/idb_reads.html">idb_reads.html</a>).</p>
<p>The page will:</p>
<ol>
<li>Open a test database, creating a single key=value pair if it doesn't exist.
</li>
<li>Attempt to read the value of the key ten times, once per second.
</li>
</ol>
<p>If you open this in two windows in Safari at once, one of them will likely fail with AbortError.
I reported it to Apple, but their feedback form says they don't respond to feedback, and they were as good as their word.
In the end, I added some code to sleep for a random period and retry on aborted reads.</p>
<h2 id="revisions-and-merging">Revisions and merging</h2>
<p>Each object (project, action, contact, etc) in CueKeeper is a file in Irmin and each change creates a new commit.
This model tends to avoid race conditions.
For example, when you edit the title of an action and press Return, CueKeeper will:</p>
<ol>
<li>Create a new commit with the updates, whose parent is the commit you started editing.
</li>
<li>Merge this commit with the <code>master</code> branch.
</li>
</ol>
<p>Usually nothing else has changed since you started editing and the merge is a trivial &quot;fast-forward&quot; merge.
However, if you had edited something else about that action at the same time then instead of overwriting the changes, CueKeeper will merge them.</p>
<p>If you change the same field in two tabs at once, CueKeeper will pick one value and add a merge conflict note telling you the change it discarded.
You can try it here (click the image for an interactive page running two copies of CueKeeper split-screen):</p>
<p><a href="/blog/cuekeeper/sync.html"><img src="/blog/images/cuekeeper/sync.png" class="center"/></a></p>
<p>The merge code takes three commits (the tips of the two branches being merged and an optional-but-usually-present &quot;least common ancestor&quot;), and produces a resulting commit (which may include merge conflict notes for the user to check):</p>
<figure class="code"><figcaption><span>ck_merge.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Make</span> <span class="o">(</span><span class="nc">Git</span> <span class="o">:</span> <span class="nn">Git_storage_s</span><span class="p">.</span><span class="nc">S</span><span class="o">)</span> <span class="o">(</span><span class="nc">R</span> <span class="o">:</span> <span class="nn">Ck_rev</span><span class="p">.</span><span class="nc">S</span> <span class="k">with</span> <span class="k">type</span> <span class="n">commit</span> <span class="o">=</span> <span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">val</span> <span class="n">merge</span> <span class="o">:</span> <span class="o">?</span><span class="n">base</span><span class="o">:</span><span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">theirs</span><span class="o">:</span><span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="o">[`</span><span class="nc">Ok</span> <span class="k">of</span> <span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Nothing_to_do</span><span class="o">]</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="c">(* [merge ?base ~theirs ours] merges changes from [base] to [ours] into [theirs] and</span>
</span><span class="line"><span class="c">   * returns the resulting merge commit. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">revert</span> <span class="o">:</span> <span class="n">repo</span><span class="o">:</span><span class="nn">Git</span><span class="p">.</span><span class="nn">Repository</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">master</span><span class="o">:</span><span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Git_storage_s</span><span class="p">.</span><span class="nn">Log_entry</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="o">[`</span><span class="nc">Ok</span> <span class="k">of</span> <span class="nn">Git</span><span class="p">.</span><span class="nn">Commit</span><span class="p">.</span><span class="n">t</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Nothing_to_do</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="k">of</span> <span class="kt">string</span><span class="o">]</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="c">(** [revert ~master log_entry] returns a new commit on [master] which reverts the changes in [log_entry]. *)</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A <code>revert</code> operation is essentially the same as a merge, except that the base commit is the commit being undone, its (single) parent is one branch and the current state is the other.
It's a separate operation in the API because the commit it generates has a different format (it has only one parent and gives the commit being reverted in the log message).</p>
<p>I wanted to make sure that the merge code would always produce a valid result (e.g. the <code>parent</code> field of a node should point to a node that exists in the merged version).
I wrote a unit-test that performs many merges at random and checks the result loads without error.</p>
<p>My first thought was to perform edits at random to get a base commit, then make two branches from that and perform more random edits on each one.
After a while, I realised that you can edit any valid state into pretty-much any other valid state, so a simpler approach is to generate three commits <a href="https://github.com/talex5/cuekeeper/blob/90a12e71834ae10416e0ec86ce15408ec25d33e6/test.ml#L177">at random</a>.</p>
<p>You have to be a little bit careful here, however.
If the three commits are completely random then they won't have any UUIDs in common and the merges will be trivial.
Therefore, the UUIDs (and all field values) are chosen from a small set of candidates to ensure they're often the same.</p>
<p>I wrote the tests before the merge code, and as I wrote the merge code I deliberately failed to implement the required features first to check the tests caught each possible failure.
The tests found these problems automatically:</p>
<ul>
<li>Commit refers to a contact or context that doesn't exist.
</li>
<li>Project has area as child.
</li>
<li>Repeating action marked as done.
</li>
<li>Action, project or area has a missing parent.
</li>
</ul>
<p>It was easy enough to check which cases I'd missed because each possible failure corresponds to a call to <code>bug</code> in <a href="https://github.com/talex5/cuekeeper/blob/master/lib/ck_rev.ml">ck_rev.ml</a>.
These problems weren't detected initially by the tests:</p>
<dl><dt>Cycles in the parent relation</dt>
<dd>
Initially, my random state generator created nodes with random IDs, but at the time there was a bug in Irmin (since fixed) where it didn't sort the entries before writing them, which confused the Git tools I was using to examine the test results.
To work around this, I changed the test code to create the nodes with monotonically increasing IDs.
However, it would only set the parent to a previously-created node, so this meant that e.g. node 1 could never have a parent of node 2, which meant it could never generate two commits with the parent relation the other way around.
Easily fixed.
</dd>
<dt>A <code>Waiting_for_contact</code> action has no contact</dt>
<dd>
I was running 1000 iterations of the tests with a fixed seed while writing them.
This particular case only triggered after 1500 iterations, but it would have been found eventually when I removed the fixed seed.
To help things along, I added a <code>slow_test</code> make target that compiles the tests to native code and runs 10,000 iterations, and set this to run on the Travis build (it still only takes 14 seconds, but that's too long to do on every build, and long enough that the extra couple of seconds compiling to native code is worth it).
</dd>
<dt>An action is a parent of another node</dt>
<dd>
This one was a bit surprising.
To trigger it, you start with a base containing an action and a project.
On one branch, make the action a child of the project (only the action changes).
On the other, convert the project to an action (only the project changes).
If the bug is present, you end up with one action being a child of the other.
This wasn't picked up because it only happens if the project doesn't change at all in the first branch.
If it does change, the code for merging nodes gets called, and that copes with trying to merge a project with an action by converting the action to a project, which avoids the bug.
Because the nodes were being generated at random, the chance that every field in the base and the first branch would be identical was very low.
To fix it, I now generate a random number at the start of each test iteration and use it to bias the creation of the three states so that many of the fields will be shared.
This is enough to trigger detection of the bug.
</dd>
</dl>
<h2 id="react">React</h2>
<p>The OCaml <a href="http://erratique.ch/software/react">React</a> library provides support for <a href="http://en.wikipedia.org/wiki/Functional_reactive_programming">Functional reactive programming</a>.
The idea here is to represent a (mutable) variable as a &quot;signal&quot;.
Instead of operating on the current value of the variable, you operate on the signal as a whole.</p>
<p>Say you want to show a live display of the number of actions.
A traditional approach might be:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">actions</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">update</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">show</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;There are %d actions&quot;</span> <span class="o">!</span><span class="n">actions</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line"><span class="n">update</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then you have to remember to call <code>update</code> whenever you change <code>actions</code>.
Instead, in FRP you work on the signal as a whole:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">actions</span><span class="o">,</span> <span class="n">set_actions</span> <span class="o">=</span> <span class="nn">S</span><span class="p">.</span><span class="n">create</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="n">actions</span> <span class="o">|&gt;</span> <span class="nn">S</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;There are %d actions&quot;</span><span class="o">)</span> <span class="o">|&gt;</span> <span class="n">show</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>actions</code> is a signal, the <code>S.map</code> creates a new (string valued) signal from the old (int valued) one, and <code>show</code> ensures that the current value of the signal is displayed on the screen.</p>
<p>It's a bit like using a spreadsheet: you just enter the formulae, and the system ensures everything stays up-to-date.
CueKeeper uses signals all over the place: the Git commit at the tip of a branch, the description of an action, the currently selected tab, etc.</p>
<h2 id="tyxml">TyXML</h2>
<p>I initially tried to generate the HTML using <a href="https://github.com/mirage/ocaml-cow">Caml on the Web (COW)</a>.
This provides a syntax extension for embedding HTML in your code.
For example, I wrote some code to render a tree to HTML, something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">render_list</span> <span class="n">items</span> <span class="o">=</span>
</span><span class="line">  <span class="o">&lt;:</span><span class="n">html</span><span class="o">&lt;</span>
</span><span class="line">    <span class="o">&lt;</span><span class="n">ul</span><span class="o">&gt;</span>
</span><span class="line">      <span class="o">$</span><span class="kt">list</span><span class="o">:</span><span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="n">render_item</span> <span class="n">items</span><span class="o">$</span>
</span><span class="line">    <span class="o">&lt;/</span><span class="n">ul</span><span class="o">&gt;</span>
</span><span class="line">  <span class="o">&gt;&gt;</span>
</span><span class="line"><span class="ow">and</span> <span class="n">render_item</span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="nc">Node</span> <span class="n">children</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="o">&lt;:</span><span class="n">html</span><span class="o">&lt;</span>
</span><span class="line">    <span class="o">&lt;</span><span class="n">li</span><span class="o">&gt;$</span><span class="n">str</span><span class="o">:</span><span class="n">name</span><span class="o">$&lt;/</span><span class="n">li</span><span class="o">&gt;</span>
</span><span class="line">    <span class="o">$</span><span class="n">render_list</span> <span class="n">children</span><span class="o">$</span>
</span><span class="line">  <span class="o">&gt;&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It wasn't really suitable for what I wanted, though, because there was no obvious way to make it update (except by regenerating the whole thing and setting the <code>innerHTML</code> DOM attribute).
Also, while embedding another language with its own syntax is usually a nice feature, in the case of HTML I'm happy to make an exception.</p>
<p>I'd come across <a href="http://ocsigen.org/tyxml/">TyXML</a> before, but had given up after being baffled by the documentation.
However, spurred on by the promise of React integration, I started reading the source code and it turned out to be fairly simple.</p>
<p>For every HTML element, TyXML provides a function with the same name.
The function takes a list of child nodes as its argument and, optionally, a list of attributes.
Written this way, the above code looks something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nn">Tyxml_js</span><span class="p">.</span><span class="nc">Html5</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">render_list</span> <span class="n">items</span> <span class="o">=</span>
</span><span class="line">  <span class="n">ul</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="n">render_item</span> <span class="n">items</span> <span class="o">|&gt;</span> <span class="nn">List</span><span class="p">.</span><span class="n">concat</span><span class="o">)</span>
</span><span class="line"><span class="ow">and</span> <span class="n">render_item</span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="nc">Node</span> <span class="n">children</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[</span>
</span><span class="line">    <span class="n">li</span> <span class="o">[</span><span class="n">pcdata</span> <span class="n">name</span><span class="o">];</span>
</span><span class="line">    <span class="n">render_list</span> <span class="n">children</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It didn't compile, though, with a typically complicated error:</p>
<pre><code>Error: This expression has type
     ([&gt; Html5_types.ul ] as 'a) Tyxml_js.Html5.elt
   but an expression was expected of type
     Html5_types.li Tyxml_js.Html5.elt
   Type 'a = [&gt; `Ul ] is not compatible with type
     Html5_types.li = [ `Li of Html5_types.li_attrib ] 
   The second variant type does not allow tag(s) `Ul
</code></pre>
<p>Eventually, I realised what it was saying.
My COW code above was wrong: it output each item as <code>&lt;li&gt;name&lt;/li&gt;&lt;ul&gt;...&lt;/ul&gt;</code>.
The browser accepted this, but it's not valid HTML - the <code>&lt;ul&gt;</code> needs to go inside the <code>&lt;li&gt;</code>.
In fact, all we need is:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">render_list</span> <span class="n">items</span> <span class="o">=</span>
</span><span class="line">  <span class="n">ul</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="n">render_item</span> <span class="n">items</span><span class="o">)</span>
</span><span class="line"><span class="ow">and</span> <span class="n">render_item</span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="nc">Node</span> <span class="n">children</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="n">li</span> <span class="o">[</span><span class="n">pcdata</span> <span class="n">name</span><span class="o">;</span> <span class="n">render_list</span> <span class="n">children</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Shorter and more correct - a win for TyXML!
It also type-checks attributes.
For example, if you provide an <code>onclick</code> attribute then you can't provide a handler function with the wrong type (or get the name of the attribute wrong, or use a non-standard attribute, at least without explicit use of &quot;unsafe&quot; features).</p>
<h2 id="reactive-tyxml">Reactive TyXML</h2>
<p>The <code>Tyxml_js.Html5</code> module provides static elements, while <code>Tyxml_js.R.Html5</code> provides reactive ones.
These take signals for attribute values and child lists and update the display automatically as the signal changes.
You can mix them freely (e.g. a static element with a reactive attribute).</p>
<p>For example, here's a (slightly simplified) version of the code that displays the tabs along to the top:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Tyxml_js</span>
</span><span class="line"><span class="k">open</span> <span class="nn">Tyxml_js</span><span class="p">.</span><span class="nc">Html5</span>
</span><span class="line"><span class="k">open</span> <span class="nc">React</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">current_mode</span><span class="o">,</span> <span class="n">set_mode</span> <span class="o">=</span> <span class="nn">S</span><span class="p">.</span><span class="n">create</span> <span class="o">`</span><span class="nc">Work</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="o">(&gt;|~=)</span> <span class="n">x</span> <span class="n">f</span> <span class="o">=</span> <span class="nn">S</span><span class="p">.</span><span class="n">map</span> <span class="n">f</span> <span class="n">x</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">tab</span> <span class="n">name</span> <span class="n">mode</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">clicked</span> <span class="o">_</span><span class="n">ev</span> <span class="o">=</span> <span class="n">set_mode</span> <span class="n">mode</span><span class="o">;</span> <span class="bp">false</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">classes</span> <span class="o">=</span> <span class="n">current_mode</span> <span class="o">&gt;|~=</span> <span class="k">fun</span> <span class="n">current</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">if</span> <span class="n">current</span> <span class="o">=</span> <span class="n">mode</span> <span class="k">then</span> <span class="o">[</span><span class="s2">&quot;active&quot;</span><span class="o">]</span> <span class="k">else</span> <span class="bp">[]</span> <span class="k">in</span>
</span><span class="line">  <span class="n">li</span> <span class="o">~</span><span class="n">a</span><span class="o">:[</span><span class="nn">R</span><span class="p">.</span><span class="nn">Html5</span><span class="p">.</span><span class="n">a_class</span> <span class="n">classes</span><span class="o">]</span> <span class="o">[</span>
</span><span class="line">    <span class="n">a</span> <span class="o">~</span><span class="n">a</span><span class="o">:[</span><span class="n">a_onclick</span> <span class="n">clicked</span><span class="o">]</span> <span class="o">[</span><span class="n">pcdata</span> <span class="n">name</span><span class="o">]</span>
</span><span class="line">  <span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">mode_switcher</span> <span class="o">=</span>
</span><span class="line">  <span class="n">ul</span> <span class="o">~</span><span class="n">a</span><span class="o">:[</span><span class="n">a_class</span> <span class="o">[</span><span class="s2">&quot;ck-mode-selector&quot;</span><span class="o">]]</span> <span class="o">[</span>
</span><span class="line">    <span class="n">tab</span> <span class="s2">&quot;Process&quot;</span> <span class="o">`</span><span class="nc">Process</span><span class="o">;</span>
</span><span class="line">    <span class="n">tab</span> <span class="s2">&quot;Work&quot;</span> <span class="o">`</span><span class="nc">Work</span><span class="o">;</span>
</span><span class="line">    <span class="n">tab</span> <span class="s2">&quot;Contact&quot;</span> <span class="o">`</span><span class="nc">Contact</span><span class="o">;</span>
</span><span class="line">    <span class="n">tab</span> <span class="s2">&quot;Schedule&quot;</span> <span class="o">`</span><span class="nc">Schedule</span><span class="o">;</span>
</span><span class="line">    <span class="n">tab</span> <span class="s2">&quot;Review&quot;</span> <span class="o">`</span><span class="nc">Review</span><span class="o">;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The tabs are an HTML <code>&lt;ul&gt;</code> element with one <code>&lt;li&gt;</code> for each tab.
<code>current_mode</code> is a reactive signal for the currently selected mode, which is initially <code>Work</code>.
Each <code>&lt;li&gt;</code> has a reactive <code>class</code> attribute which is <code>&quot;active&quot;</code> when the tab's mode is equal to the current mode.
Clicking the tab sets the mode.</p>
<div id='ck-mode-div'></div>
<script src="/blog/javascripts/tabs-demo.js"></script>
<h2 id="problems-with-react">Problems with React</h2>
<p>My experience with using react is that it's very easy to write code that is short, clear, and subtly wrong.
Consider this (slightly contrived) example, which shows up-to-date information about how many of our actions have been completed:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">total_actions</span><span class="o">,</span> <span class="n">set_total</span> <span class="o">=</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">create</span> <span class="mi">0</span>
</span><span class="line"><span class="k">let</span> <span class="n">complete_actions</span><span class="o">,</span> <span class="n">set_complete</span> <span class="o">=</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">create</span> <span class="mi">0</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="o">(&gt;&gt;~=)</span> <span class="o">=</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">bind</span>
</span><span class="line"><span class="k">let</span> <span class="o">(&gt;|~=)</span> <span class="n">x</span> <span class="n">f</span> <span class="o">=</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">map</span> <span class="n">f</span> <span class="n">x</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="o">_</span> <span class="o">=</span>
</span><span class="line">    <span class="n">complete_actions</span> <span class="o">&gt;&gt;~=</span> <span class="o">(</span><span class="k">function</span>
</span><span class="line">    <span class="o">|</span> <span class="mi">0</span> <span class="o">-&gt;</span> <span class="nn">React</span><span class="p">.</span><span class="nn">S</span><span class="p">.</span><span class="n">const</span> <span class="s2">&quot;(nothing complete)&quot;</span>
</span><span class="line">    <span class="o">|</span> <span class="n">complete</span> <span class="o">-&gt;</span>
</span><span class="line">        <span class="n">total_actions</span> <span class="o">&gt;|~=</span> <span class="k">fun</span> <span class="n">total</span> <span class="o">-&gt;</span>
</span><span class="line">          <span class="nn">Thread</span><span class="p">.</span><span class="n">delay</span> <span class="mi">0</span><span class="o">.</span><span class="mi">1</span><span class="o">;</span>
</span><span class="line">          <span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;%d/%d (%d%%)&quot;</span>
</span><span class="line">            <span class="n">complete</span> <span class="n">total</span>
</span><span class="line">            <span class="o">(</span><span class="mi">100</span> <span class="o">*</span> <span class="n">complete</span> <span class="o">/</span> <span class="n">total</span><span class="o">)</span>
</span><span class="line">    <span class="o">)</span> <span class="o">&gt;|~=</span> <span class="nn">Printf</span><span class="p">.</span><span class="n">printf</span> <span class="s2">&quot;Update: %s</span><span class="se">\n</span><span class="s2">%!&quot;</span> <span class="k">in</span>
</span><span class="line">  <span class="k">while</span> <span class="bp">true</span> <span class="k">do</span>
</span><span class="line">    <span class="k">let</span> <span class="n">t</span> <span class="o">=</span> <span class="nn">Random</span><span class="p">.</span><span class="n">int</span> <span class="mi">1000</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">step</span> <span class="o">=</span> <span class="nn">React</span><span class="p">.</span><span class="nn">Step</span><span class="p">.</span><span class="n">create</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="n">set_total</span> <span class="o">~</span><span class="n">step</span> <span class="n">t</span><span class="o">;</span>
</span><span class="line">    <span class="n">set_complete</span> <span class="o">~</span><span class="n">step</span> <span class="o">(</span><span class="nn">Random</span><span class="p">.</span><span class="n">int</span> <span class="o">(</span><span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="o">));</span>
</span><span class="line">    <span class="nn">React</span><span class="p">.</span><span class="nn">Step</span><span class="p">.</span><span class="n">execute</span> <span class="n">step</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We have two signals, representing the total number of actions and the number of them that are complete.
We connect the <code>complete_actions</code> signal to a function that outputs one of two signals: either a constant &quot;(nothing complete)&quot; signal if there are no complete actions, or a signal that shows the number and percentage complete.
This string signal is then connected up to an output function which, in this case, just prints it to the console with <code>Update: </code> prepended.</p>
<p>The loop at the end sets the total to a random number and the number complete to a random number less than or equal to that.
We use a &quot;step&quot; to ensure that the two signals are updated atomically.
Looks reasonable, right?
Running it, it works for a bit, but gets slower and slower as it runs, before eventually failing in one of two ways:</p>
<p>First, if the <code>total_actions</code> signal ever becomes zero again after being non-zero, it will crash with <code>Division_by_zero</code>:</p>
<pre><code>Update: (nothing complete)
Update: 25/44 (56%)
Update: 27/82 (32%)
Update: 20/39 (51%)
Update: (nothing complete)
Update: 8/21 (38%)
Update: 11/17 (64%)
Update: 28/49 (57%)
Update: 12/15 (80%)
Update: (nothing complete)
Update: 24/28 (85%)
Update: 43/89 (48%)
Update: 3/14 (21%)
Update: 8/23 (34%)
Update: 87/96 (90%)
Fatal error: exception Division_by_zero
</code></pre>
<p>The reason is that callbacks are not removed immediately.
When <code>complete_actions</code> is non-zero, we attach a callback to track <code>complete</code> and show the percentage.
When <code>complete_actions</code> becomes zero again, this callback continues to run, even though its output is no longer used.</p>
<p>If it doesn't crash, the garbage collector will eventually  be run and the old callbacks will be removed.
Unfortunately, this will also garbage collect the callback that prints the status updates, and the program will simply stop producing any new output at this point.</p>
<p>At least, that's what happens with native code.
Javascript doesn't have weak references, so old callbacks are never removed there.</p>
<p>Early versions of CueKeeper uses React extensively, but I had to scale back my use of it due to these kinds of problems with callback lifetimes.
My general work-around is to break the reactive signal chains into disconnected sub-graphs, which can be garbage-collected individually.
For example, each panel in the display (e.g. showing the details of an action) contains a number of signals (name, parent, children, etc) which are used to keep the display up-to-date, but these signals are updated using imperative code, not by connecting them to the signal of the Irmin branch head.
When you close the panel, the functions for updating these signals become unreachable, allowing them to be GC'd, and they immediately stop being called.
Thus, we make leak a few callbacks while the panel is open, but closing it returns us to a clean state.</p>
<p>In a similar way, the tree view in the left column is a collection of signals that are updated manually.
Switching to a different tab will allow them to be freed.
It's not ideal, but it works.</p>
<p>I think that to complete its goal of having well-defined semantics, React needs to stop relying on weak references.
I imagine it would be possible to define a &quot;global sink&quot; object of some sort, such that a signal is live if and only if it is connected to that sink, or is a dependency of something else that is.
Then the <code>let _ =</code> above could be replaced with a connection to the global sink and the rest of the program would behave as expected.
I haven't thought too much about exactly how this would work, though.</p>
<h2 id="debugging">Debugging</h2>
<p>There was another problem, which I hit twice.
OCaml always optimises tail calls, but Javascript doesn't (I'm not sure about ES6).
In most cases where it matters, <code>js_of_ocaml</code> turns the code into a loop, but it doesn't handle continuation-passing style.
Both Sexplib and Omd failed to parse larger documents, and did so unpredictably.
I suspect that Firefox's JIT may be affecting things, because my <a href="https://github.com/janestreet/sexplib/pull/14">test case</a> didn't always trigger at the same point.
In both cases, I was able to modify the code to avoid the problem.</p>
<h2 id="conclusions">Conclusions</h2>
<p>For CueKeeper, I used <code>js_of_ocaml</code> to let me write reliable type-checked OCaml code and compile it to Javascript.
<code>js_of_ocaml</code> is surprisingly easy to use, provides most of the standard DOM APIs and is easy to extend to other APIs such as Pikaday or IndexedDB.
TyXML provides a pleasant way to generate HTML output, checking at compile time that it will produce valid (not just well-formed) HTML.</p>
<p>Functional reactive programming makes it easy to define user interfaces that always show up-to-date information.
However, I had problems with signals leaking or running after they were no longer needed due to the React library's reliance on weak references and the garbage collector to clean up old signals.
If this problem could be fixed, this would be an ideal way to write interactive applications.
As it is, it is still useful but must be used with care.</p>
<p>Data structures are defined in OCaml and (de)serialised automatically using the Sexplib library.
Sexplib is easy to extend with custom behaviour, for example to support changes in the format, but required a minor patch to work reliably in the browser.</p>
<p>Most applications store data using filesystems or relational databases.
CueKeeper uses Irmin to store data in a Git-like repository.
Writing the merge code can be somewhat tricky, but you have to do this anyway if you want your application to support off-line use or multiple users, and once done you get race-free operation, multi-tab support, history and revert for free.</p>
<p>Irmin can be extended with new backends and I created one that uses IndexedDB to store the data client-side in the browser.
The standard is rather new and there are still browser bugs to watch out for, but it seems to be working reliably now.</p>
<p>The full code is available at <a href="https://github.com/talex5/cuekeeper">https://github.com/talex5/cuekeeper</a>.</p>
<h2 id="next-steps">Next steps</h2>
<p>I hope to get back to working on sync between devices.
I made a start on the <code>server</code> branch, which runs a sync service as a Mirage unikernel, but there's no access control yet, so don't use it unless you want to share your TODO list with the whole world!</p>
<p>However, I got distracted by an interesting TCP bug, where a connection would sometimes hang, and wondering what caused that made me think there should be a way to ask the system why a thread didn't resolve, which resulted in some <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2015-06/msg00079.html">interesting improvements to the tracing and visualisation system</a>...</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>Some of the research leading to these results has received funding from the European Union's Seventh Framework Programme FP7/2007-2013 under the UCN project, grant agreement no 611001.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">CueKeeper: Gitting Things Done in the browser</title>
    <link href="https://roscidus.com/blog/blog/2015/04/28/cuekeeper-gitting-things-done-in-the-browser/"></link>
    <updated>2015-04-28T10:18:23+00:00</updated>
    <id>https://roscidus.com/blog/blog/2015/04/28/cuekeeper-gitting-things-done-in-the-browser</id>
    <content type="html"><![CDATA[<p>Git repositories store data with history, supporting replication, merging and revocation.
The Irmin library lets applications use Git-style storage for their data.
To try it out, I've written a GTD-based action tracker that runs entirely client-side in the browser.</p>
<p>CueKeeper uses Irmin to handle history and merges, with state saved in the browser using the new IndexedDB standard (requires a recent browser; Firefox 37, Chromium 41 and IE 11.0.9600 all work, but Safari apparently has problems if you open the page in multiple tabs).</p>
<div style='text-align: center; margin-bottom: 1em;'><strong><a href='/blog/cuekeeper/'>Open interactive version full screen</a></strong></div>
<!-- more -->
<p><a href="/blog/cuekeeper/"><img src="/blog/images/cuekeeper/cuekeeper-0.1.png" class="center"/></a></p>
<p>In the future, I plan to have the browser sync to a master Git repository and use the browser storage only for off-line use, but for now note that:</p>
<ul>
<li>All data is stored only in your browser.
</li>
<li>There is no server communication.
</li>
<li>Any changes you make will persist for you, but will not affect other users.
</li>
<li><a href="https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Basic_Concepts_Behind_IndexedDB">Mozilla's IndexedDB docs</a> say that &quot;the general philosophy of the browser vendors is to make the best effort to keep the data when possible&quot;, but vaguely notes that your data may be deleted if you run out of space! If someone can clarify things, that would be great. I've been using it for 5 weeks on Firefox, and haven't lost anything, but it would be nice to know the exact conditions for safety.
</li>
<li>Take backups! On my Linux/Firefox system, the data is stored here: <code>$HOME/.mozilla/firefox/SALT.default/storage/default/http+++roscidus.com/idb</code>
</li>
<li>This is version 0.1 alpha ;-)
</li>
</ul>
<p>This post contains a brief introduction to using GTD and CueKeeper, followed by a look at some nice features that result from using Irmin.
The code is available at <a href="https://github.com/talex5/cuekeeper">https://github.com/talex5/cuekeeper</a>.
Alpha testers welcome!</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#background">Background</a>
<ul>
<li><a href="#getting-things-done-gtd">Getting Things Done (GTD)</a>
</li>
<li><a href="#irmin">Irmin</a>
</li>
<li><a href="#mgsd">mGSD</a>
</li>
<li><a href="#nymote-mirageos-and-ucn">Nymote, MirageOS and UCN</a>
</li>
</ul>
</li>
<li><a href="#using-cuekeeper">Using CueKeeper</a>
<ul>
<li><a href="#core-concepts">Core concepts</a>
</li>
<li><a href="#editing-items">Editing items</a>
</li>
<li><a href="#processing">Processing</a>
</li>
<li><a href="#work">Work</a>
</li>
<li><a href="#contact">Contact</a>
</li>
<li><a href="#schedule">Schedule</a>
</li>
<li><a href="#the-weekly-review">The weekly review</a>
</li>
<li><a href="#the-top-right-controls">The top-right controls</a>
</li>
</ul>
</li>
<li><a href="#interesting-irmin-features">Interesting Irmin features</a>
<ul>
<li><a href="#sync">Sync</a>
</li>
<li><a href="#history">History</a>
</li>
<li><a href="#revert">Revert</a>
</li>
<li><a href="#check-before-merge">Check-before-merge</a>
</li>
<li><a href="#out-of-date-ui-actions">Out-of-date UI actions</a>
</li>
</ul>
</li>
<li><a href="#next-steps">Next steps</a>
</li>
<li><a href="#acknowledgements">Acknowledgements</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=9451595">Hacker News</a> and
<a href="http://www.reddit.com/r/programming/comments/3457y4/cuekeeper_gitting_things_done_in_the_browser/">Reddit</a> )</p>
<h2 id="background">Background</h2>
<h3 id="getting-things-done-gtd">Getting Things Done (GTD)</h3>
<p>The core idea behind David Allen's <a href="http://en.wikipedia.org/wiki/Getting_Things_Done">GTD</a> is: the human brain is terrible at remembering things at the right time:</p>
<ol>
<li>You go into work in the morning thinking about a phone call you need to make this evening.
</li>
<li>As you read through your emails, you keep reminding yourself to remember the call.
</li>
<li>You're in a meeting and someone is speaking. You're thinking you shouldn't forget the call.
</li>
<li>etc
</li>
</ol>
<p>Maybe you end up remembering and maybe you don't, but either way you've distracted yourself all day from the other things you wanted to work on.</p>
<p>The goal of using GTD is to have a system where:</p>
<ol>
<li>the system reminds you of things <em>when you need to be reminded about them</em>, and
</li>
<li>you trust it enough that your brain can <em>stop thinking about them</em> until then.
</li>
</ol>
<blockquote>
<p>There is no reason ever to have the same thought twice, unless you like having that thought.</p>
<footer><strong>David Allen</strong> <cite>Getting Things Done</cite></footer>
</blockquote>
<h3 id="irmin">Irmin</h3>
<p><a href="https://github.com/mirage/irmin/">Irmin</a> is &quot;a library for persistent stores with built-in snapshot, branching and reverting mechanisms&quot;. It has multiple backends (including one that uses a regular Git repository, allowing you to view and modify your application's data using the real <code>git</code> commands).</p>
<p>Git's storage model is useful for many applications because it gives you race-free updates (each worker writes to its own branch and then merges), disconnected operation, history, remote sync and incremental backups.</p>
<p>Using <a href="http://ocsigen.org/js_of_ocaml/">js_of_ocaml</a> I was able to compile Irmin to JavaScript and run it in the browser, adding a <a href="https://github.com/talex5/cuekeeper/blob/6e3c2bf3f5e9a117a997fa452cc22f8a4c10fe1d/js/irmin_IDB.ml">new IndexedDB backend</a>.</p>
<h3 id="mgsd">mGSD</h3>
<p>Simon Baird's <a href="http://mgsd.tiddlyspot.com/">mGSD</a> is an excellent GTD system, which I've been using for the last few years.
It's a set of extensions built on the <a href="http://tiddlywiki.com/">TiddlyWiki</a> &quot;personal wiki&quot; system.
Like CueKeeper, mGSD runs entirely in your browser and doesn't require a server.
It's implemented as a piece of self-modifying HTML that writes itself back to your local disk when you save.
That's pretty scary, but I've found it surprisingly robust.</p>
<p><a href="http://mgsd.tiddlyspot.com/demo3.html"><img src="/blog/images/cuekeeper/mGSD.png" class="center"/></a></p>
<p>However, it's largely unmaintained and there were various areas I wanted to improve:</p>
<dl><dt>Browser security</dt>
<dd>
Over the years, browsers have become more locked down, and no longer allow web-pages to write to the disk,
requiring a browser plugin to override the check.
CueKeeper uses the IndexedDB support that modern browsers provide to store data (mGSD pre-dates IndexedDB).
</dd>
<dt>History</dt>
<dd>
Sometimes you click on a button by mistake and have no idea what changed.
Thanks to Irmin, CueKeeper logs all changes and provides the ability to view earlier states.
Also, CueKeeper uses brief animations to make it easier to see what changed.
</dd>
<dt>Navigation</dt>
<dd>
Navigation in mGSD can be awkward because overview panels and details are all mixed in together as wiki pages.
With CueKeeper, I'm experimenting with a two-column layout to separate overview pages from the details.
</dd>
<dt>Safe multi-tab use</dt>
<dd>
If you accidentally open mGSD in two tabs, changes in one tab will overwrite changes made in the other. CueKeeper uses Irmin to keep multiple tabs in sync, merging changes between them automatically.
</dd>
<dt>Sync between devices</dt>
<dd>
There's no easy way to <a href="http://stackoverflow.com/questions/85994/how-do-you-keep-a-personal-wiki-tiddlywiki-current-and-in-sync-in-multiple-loc">Sync multiple mGSD instances</a>. CueKeeper doesn't implement sync yet either, but it should be easy to add (it can sync between tabs already, so the core logic is there).
</dd>
<dt>Escaping bugs</dt>
<dd>
mGSD has various bugs related to escaping (e.g. things will go wrong if you use square brackets in a title). CueKeeper uses type-safe <a href="http://ocsigen.org/tyxml/">TyXML</a> to avoid such problems.
</dd>
<dt>Stale-display bugs</dt>
<dd>
mGSD mostly does a good job of keeping all elements of the display up-to-date, but there are some flaws.
For example, if you add a new contact in one panel, then open the contacts menu from another, the new contact doesn't show up.
CueKeeper uses <a href="http://en.wikipedia.org/wiki/Functional_reactive_programming">Functional reactive programming</a> with the <a href="http://erratique.ch/software/react">React</a> library to make sure everything is current.
</dd>
<dt>Clean separation of code and data</dt>
<dd>
As a self-modifying <code>.html</code> file, updating mGSD is terrifying! CueKeeper can be recompiled and reloaded like any other program.
</dd>
</dl>
<h3 id="nymote-mirageos-and-ucn">Nymote, MirageOS and UCN</h3>
<p><a href="http://nymote.org/">The Nymote project</a> describes itself as &quot;Lifelong control of your networked personal data&quot;:</p>
<blockquote>
<p>By adopting large centralised services we've answered the call of the siren servers and made an implicit trade. That we will share our habits and data with them in exchange for something useful. In doing so we've empowered internet behemoths while simultaneously reducing our ability to influence them. We risk becoming slaves to the current system unless we can create alternatives that compete. It's time to work on those alternatives.</p>
</blockquote>
<p>The idea here is to provide services that people can run in their own homes (e.g. on a PC, a low-powered ARM board, or the house router).
The three key pieces of infrastructure it needs are Mirage, Irmin and Signpost.</p>
<p>I've talked about <a href="http://openmirage.org/">MirageOS</a> before (see <a href="/blog/blog/2014/07/28/my-first-unikernel/">My first unikernel</a>): it allows you to run extremely small, highly secure services as Xen guests (a few MB in size, written in type-safe OCaml, rather than 100s of MB you would have with a Linux guest).
I haven't looked at Signpost yet.
Irmin is the subject of this blog post.</p>
<p><a href="http://usercentricnetworking.eu/">UCN</a> (User Centric Networking) is an EC-funded project that is building a &quot;Personal Information Hub&quot; (PIH), responsible for storing users' personal data in their home, and then using that data for content recommendation.
If you use Google to manage your ToDo-list then when you add &quot;Book holiday&quot; to it, Google can show you relevant ads.
But what if you want good recommendations without sharing personal data with third parties?
Tools such as CueKeeper could be configured to sync with a local PIH to provide input for its recommendations without the data leaving your home.</p>
<h2 id="using-cuekeeper">Using CueKeeper</h2>
<p>You can either use the <a href="/blog/cuekeeper/">example on roscidus.com</a>, or download the standalone release <a href="https://github.com/talex5/cuekeeper/releases/download/v0.1/cuekeeper-bin-0.1.zip">cuekeeper-bin-0.1.zip</a>.
To use the release, unzip the directory and open <code>index.html</code> in a browser (no need for a web-server).
If you do this, note that the database is tied to the path of the file, so if you move or rename the directory, it will show a different database (which might make it look like your items have disappeared).</p>
<h3 id="core-concepts">Core concepts</h3>
<p>There are five kinds of &quot;thing&quot; in CueKeeper:</p>
<dl><dt>Action</dt>
<dd>
<p>Something you will do (e.g. &quot;Follow Mirage tutorial&quot;).</p>
<p><img src="/blog/images/cuekeeper/action.png" class="center small"/>
Beside each action you will see some toggles showing its state:
The tick means done,
&quot;n&quot; is a next action (something you could start now),
&quot;w&quot; means waiting-for (something you can't start now),
&quot;f&quot; means future (something you don't want to think about yet).
The star is for whatever you want.
Repeating actions can't be completed, so for those the tick box will be blank.</p>
</dd>
<dt>Project</dt>
<dd>
<p>Something you want to achieve (e.g. &quot;Make a Mirage unikernel&quot;).</p>
<p><img src="/blog/images/cuekeeper/project.png" class="center small"/>
A project may require several actions to be taken.
The possible states are done (the tick),
&quot;a&quot; for active projects,
and &quot;sm&quot; for &quot;Someday/Maybe&quot; (a project you don't plan to work on yet).</p>
</dd>
<dt>Area</dt>
<dd>
<p>An &quot;Area of responsibility&quot; is a way of grouping things (e.g. &quot;Personal/Hobbies&quot; or &quot;Job/Accounts&quot;).</p>
<p><img src="/blog/images/cuekeeper/area.png" class="center small"/>
Unlike projects, areas generally cannot be completed.
One thing that confused me when I started with GTD was that what my organisation called &quot;projects&quot; were actually areas.
If your boss says &quot;You're working on project X until further notice&quot; then &quot;X&quot; is probably an &quot;area&quot; in GTD terms.</p>
</dd>
<dt>Contact</dt>
<dd>
<p>Someone you work with.</p>
<p><img src="/blog/images/cuekeeper/contact.png" class="center small"/>
You can associate any area, project or action with a contact, which provides a quick way to find all the things you need to discuss with someone when you meet them.
If an action is being performed by someone else, you can also mark it as waiting for them.
It will then appear on the <code>Review/Waiting</code> list.</p>
</dd>
<dt>Context</dt>
<dd>
<p>Another way of grouping actions, by what kind of activity it is, or where it will occur.</p>
<p><img src="/blog/images/cuekeeper/context.png" class="center small"/>
Assigning a context to an action is an important check that the action isn't too vague.
Your eye will tend to glide over vague actions like &quot;Sort out car&quot;; choosing a context &quot;Phone&quot; (garage) or &quot;Shopping&quot; (buy tools) forces you to clarify things.</p>
</dd>
</dl>
<p>Notes:</p>
<ul>
<li>GTD also has the concept of a &quot;tickler&quot;. In CueKeeper this is just an action waiting until some time.
</li>
<li>GTD also has &quot;reference material&quot;, but I never used this in mGSD, so I didn't implement it.
Regular files on your computer seem to work fine for this.
</li>
<li>mGSD has the concept of &quot;realms&quot; to group areas. CueKeeper uses sub-areas for this instead (e.g. CueKeeper's &quot;Personal/Health&quot; sub-area corresponds to an mGSD &quot;Health&quot; area within a &quot;Personal&quot; realm).
</li>
</ul>
<h3 id="editing-items">Editing items</h3>
<p>Clicking on an item or creating a new one opens a panel showing its details in the right column.
There are various things you can edit here:</p>
<ul>
<li>The toggles here work just as elsewhere (see above).
</li>
<li>Click the title to rename.
</li>
<li>Click <code>(edit)</code> to edit the notes.
These can be whatever you like.
They're in <a href="http://daringfireball.net/projects/markdown/syntax">Markdown</a> format, so you can add structure, links, etc.
</li>
<li>Click <code>(add log entry)</code> to start editing with today's date added at the end.
This is convenient to add date-stamped notes quickly.
</li>
<li>The <code>(delete)</code> button at the bottom will remove it (without confirmation; use <code>Show history</code> to revert accidental deletions, as explained later).
</li>
</ul>
<p>For areas, projects and actions:</p>
<ul>
<li>You can convert between these types by clicking on the type (e.g. &quot;An <strong>action</strong> in ...&quot;).
This is useful if e.g. you realise that an action is really a project with multiple steps.
</li>
<li>Click on the parent to move it to a different parent.
</li>
<li>You can set the contact field for any of these types too.
</li>
</ul>
<p>For actions, you can also set the context, which is useful for grouping actions on the Work page, and helps to make sure the action is well-defined.</p>
<p>You can also make an action repeat.
Setting the repeat for an action will move it to the waiting state until the given date.
There are only two differences between repeating actions and regular (one-shot) scheduled actions:</p>
<ul>
<li>You can't mark a repeating action as done (clear the repeat first if you want to).
</li>
<li>When you click on the &quot;w&quot; on a repeating action, the next repeat date after it was last scheduled is highlighted by default. If that date has already arrived, it keeps moving it forward by the specified interval until it's in the future.
</li>
</ul>
<h3 id="processing">Processing</h3>
<p>There are several stages to applying GTD, corresponding to the tabs along the top.
The first is processing, which is about going through your various inboxes (email, paper, voicemail, etc) and determining what actions each item requires.
After processing, your inbox should be empty and everything you need to do either done (for quick items) or recorded in CueKeeper.
Also, see if you can think of any projects or actions that are only in your head and add those too.</p>
<ol>
<li>Click the <code>+</code> next to an area and enter a name for the new project.<br />
(the Work tab will go red at this point, indicating an alert: &quot;Active project with no next action&quot;)
</li>
<li>Click <code>(edit)</code> in the new project panel to add some details, if desired.
</li>
<li>Click <code>+action</code> to add the next action to perform towards this project.
</li>
</ol>
<p>Note that it is not necessary to add all the actions needed to complete the project.
Just add the next thing that you can do now.
When you later mark the action as done, CueKeeper will then prompt you to think about a new next action.</p>
<p>If a project will only require a single action (e.g. &quot;Buy milk&quot;), then instead of adding a project and an action, you can just convert the new project to an action and not bother about having a project at all.</p>
<p>If you don't plan to work on the project soon, click &quot;sm&quot; to convert it to a &quot;Someday/Maybe&quot; project.</p>
<h3 id="work">Work</h3>
<p>This is the default view, showing all the things you could be working on now.</p>
<p>The filters just below the tab allow you to hide top-level areas (e.g. if you don't want to see any personal actions while you're at work).</p>
<p>When an item is done, click on the tick mark.</p>
<p>If it's not possible to start it now, click on the &quot;w&quot; to mark it as waiting:</p>
<ul>
<li>If an action is waiting for someone else, first add them as the contact, then click the <code>w</code> and select &quot;Waiting for <em>name</em>&quot; from the menu.
</li>
<li>If an action can't be started until some date, click the &quot;w&quot; and choose the date from the popup calendar.
</li>
<li>Otherwise, you can mark it as &quot;Waiting (reason unspecified)&quot;.
</li>
</ul>
<p>If you're not going to do it this week, click on the &quot;f&quot; (future) to defer it until the next review.</p>
<h3 id="contact">Contact</h3>
<p>This view lists your contacts and any actions you're waiting for them to do.
It's useful if someone phones and you want to see everything you need to discuss with them, for example.
The list only shows actions you're actually waiting for, but if you open up a particular contact then you'll also see things they're merely associated with.</p>
<h3 id="schedule">Schedule</h3>
<p>Lists actions than can't be done until some date.</p>
<p>When due, scheduled actions will appear highlighted on the <code>Work</code> tab (even if their area is filtered out).
If you pin the browser tab showing the CueKeeper page, the tab icon will also go red to indicate attention is needed.
If you want to test the effect, schedule an action for a date in the past.
Click <code>n</code> to acknowledge a due action and convert it to a next action.</p>
<h3 id="the-weekly-review">The weekly review</h3>
<p>GTD only works if you trust yourself to look at the system regularly.
There are various reports available under the <em>Review</em> tab to help with this.</p>
<p><img src="/blog/images/cuekeeper/review.png" class="center"/></p>
<p>The available reports are:</p>
<ul>
<li><strong>Done</strong> shows completed actions and projects and provides a button to delete them all.
If you're the sort of person who likes to write weekly summaries, this might be useful input to that.
</li>
<li><strong>Waiting</strong> shows actions that are waiting for someone or something (but not scheduled actions).
You might want to check up on the status of these, or do something to unblock them.
</li>
<li><strong>Future</strong> shows all actions you marked as &quot;Future&quot; and all projects you marked as &quot;Someday/Maybe&quot;.
</li>
<li><strong>Areas</strong> lists all your areas of responsibility.
</li>
<li><strong>Everything</strong> shows every item in the system in one place (you don't need to review this; it's just handy sometimes to see everything).
</li>
</ul>
<p>mGSD has more reports, but these are the ones I use.
The default configuration has a repeating action scheduled for next Sunday to review things.
This is what I do:</p>
<ul>
<li><strong>Process</strong> tab<br />
Empty inboxes, adding any actions to CueKeeper:
<ul>
<li>email inbox
</li>
<li>paper inbox
</li>
</ul>
</li>
<li><strong>Review/Done</strong>
<ul>
<li>Admire done list, then delete all.
</li>
</ul>
</li>
<li><strong>Review/Waiting</strong>
<ul>
<li>Any reminders needed?
</li>
</ul>
</li>
<li><strong>Review/Future</strong>
<ul>
<li>Make any of these current?
</li>
<li>Delete any that will never get done.
</li>
</ul>
</li>
<li><strong>Review/Areas</strong>
<ul>
<li>Any areas that need new projects?
</li>
</ul>
</li>
<li><strong>Work</strong>
<ul>
<li>Make sure each action is obvious (not vague).
</li>
<li>Could it be started now? Set to <strong>Waiting</strong> if not.
</li>
<li>List too long? Mark some actions as <strong>Future</strong>, or their projects as <strong>Someday/Maybe</strong>.
</li>
</ul>
</li>
</ul>
<p>It's important to look at all these items during the review.
Knowing you're going to look at each waiting or future item soon is what allows you to forget about them during the rest of the week!</p>
<h3 id="the-top-right-controls">The top-right controls</h3>
<p><img src="/blog/images/cuekeeper/top-actions.png" class="border center small"/></p>
<p>To search, enter some text (or a regular expression) into the box and select from the drop-down menu that appears.
Pressing Return opens the first result.</p>
<p>To create a new items, enter a label for it and select one of the &quot;Add&quot; items from the menu.
Pressing Return when there are no search results will create a new action.</p>
<p><code>Export</code> allows you to save the current state (without history) as a tar file.
There's no import feature currently, though.</p>
<p><code>Show history</code> shows some recent entries from the Irmin log (see below).</p>
<h2 id="interesting-irmin-features">Interesting Irmin features</h2>
<p>So, what benefits do we get from using Irmin?</p>
<h3 id="sync">Sync</h3>
<p>The first benefit, of course, is that we can synchronise between multiple instances.
You may have already tried opening CueKeeper in two windows (of the same browser) and observed that changes made in one propagate to the other.
Here's an easier way to experiment with sync (click the screenshot for the interactive version):</p>
<p><a href="/blog/cuekeeper/sync.html"><img src="/blog/images/cuekeeper/sync.png" class="center"/></a></p>
<p>This page has two instances of CueKeeper running, representing two separate devices such as a laptop and mobile phone.
You can edit them separately and then click the buttons in the middle to see how the changes are merged.</p>
<p>Clicking <code>Upper to lower</code> pushes all changes from the upper pane to the lower (the lower instance will merge them with its current state). Clicking <code>Lower to upper</code> does the reverse. A full sync would do these two in sequence, but of course it could be interrupted part way through.</p>
<p>The &quot;Criss-cross&quot; button can be used to test the unusual-but-interesting case of merging in both directions simultaneously (i.e. each instance merges with the <em>previous</em> state of the other instance, generating two new merges). CueKeeper tries to merge deterministically, so that both instances should end up in the same state, avoiding unnecessary conflicts on future merges.</p>
<p>Where you make conflicting edits, CueKeeper will pick a suitable resolution and add a conflict note to say what it did.
For example, if you edit the title of the &quot;Try OCaml tutorials&quot; action to different strings in each instance and then sync, you'll see something like:</p>
<p><img src="/blog/images/cuekeeper/conflict.png" class="center"/></p>
<p>CueKeeper uses a <a href="http://en.wikipedia.org/wiki/Merge_%28revision_control%29#Three-way_merge">three-way merge</a> - the merge algorithm takes the states of the two branches to be merged and their most recent common ancestor, and generates a new commit from these.
The common ancestor is used to determine which branch changed which things (anything that is the same as in the common ancestor wasn't changed on that branch).
If there are multiple possible ancestors (which can happen after a criss-cross merge) we just pick one of them.</p>
<p>CueKeeper has a unit test for merging that repeatedly generates three commits at random and ensures the merge code produces a valid (loadable) result.
This should ensure that we can merge any pair of states, but it can't check that the result will necessarily seem sensible to a human, so let me know if you spot anything odd!</p>
<h3 id="history">History</h3>
<p>We have the full history, which you can view with the <strong>Show history</strong> button:</p>
<p><img src="/blog/images/cuekeeper/history.png" class="center"/></p>
<p>The history view is useful if you clicked on something by accident and you're not sure what you did.
Click on an entry to see the state of the system just after that change.
A box appears at the top of the page to indicate that you're in &quot;time travel&quot; mode - close the box to return to the present.</p>
<p>If you edit anything while viewing a historical version, CueKeeper will commit against that version and then merge the changes to master and return to the present.</p>
<p>You might like to open each instance's history panel while trying the sync demo above.</p>
<h3 id="revert">Revert</h3>
<p>When in time-travel mode, you can click on the <strong>Revert this change</strong> button there to undo the change.</p>
<p>Reverting was easy to add, as it reuses the existing three-way merge code.
The only difference is that the &quot;common ancestor&quot; is the commit being reverted and the parent of that commit is used as the &quot;branch&quot; to be merged.</p>
<p>Because CueKeeper can merge any three commits, it can also revert any commit (with a single parent), although you'll get the most sensible results if you revert the most recent changes first.</p>
<p>For example, if you create an action and then modify it, and then revert the creation then CueKeeper will see that as:</p>
<ul>
<li>One branch that modified the action (the main branch).
</li>
<li>One branch that deleted the action (the revert of the creation).
</li>
</ul>
<p>When something is modified and deleted, CueKeeper will opt to keep it, so the effect of the &quot;revert&quot; will simply be to add a note that it decided to keep it.
Of course, the sensible way to delete something is to use the regular <code>(delete)</code> button.</p>
<h3 id="check-before-merge">Check-before-merge</h3>
<p>It's important to make sure that the system doesn't get into an inconsistent state, and Irmin can help here.
Whenever CueKeeper updates the database, it first generates the new commit, then it loads the new commit to check it works, then it updates the master branch to point at the new commit.</p>
<p>This means that CueKeeper will never put the master branch into a state that it can't itself load.</p>
<h3 id="out-of-date-ui-actions">Out-of-date UI actions</h3>
<p>Perhaps the most interesting effect of using Irmin is that it eliminates various edge cases related to out-of-date UI elements.
Consider this example:</p>
<ol>
<li>You open a menu in one tab to set the contact for an action.
</li>
<li>You delete one of the contacts in another tab.
</li>
<li>You choose the deleted contact from the menu in the first tab.
</li>
</ol>
<p>With a regular database, this would probably result in some kind of error that you'd need to handle.
These edge cases don't occur often and are hard to test.</p>
<p>With CueKeeper though, we record which revision each UI element came from and commit against that revision.
We then merge the new commit with the master branch, using the existing merge logic to deal with any problems (normally, there is nothing to merge and we do a trivial &quot;fast-forward&quot; merge here).
This means we never have to worry about concurrent updates.</p>
<p>A similar system is used with editable fields.
When you click on a panel's title to edit it, make some changes and press Return, we commit against the version you started editing, not the current one.
This means that CueKeeper won't silently overwrite changes, even if you edit something in two tabs at the same time (you'll get a merge conflict note containing the version it discarded instead).</p>
<h2 id="next-steps">Next steps</h2>
<p>If you'd like to help out, there's still plenty more to do, both coding and testing.
For example:</p>
<ul>
<li>Editing doesn't work well on mobile phones. Menus and input boxes should fill the screen in this case.
</li>
<li>I've had reports that merging between tabs is unreliable on Safari for some reason (<code>AbortError</code> from IndexedDB).
</li>
<li>It would be good to use pack files for compression. Needs a JavaScript compression library.
</li>
<li>It should be possible for an action to be marked as waiting for some other action or project to be completed.
</li>
<li>Remote sync needs to be implemented.
</li>
<li>The UI needs some work. In particular, could someone find a tasteful way to style the fields in the panels so they look like drop-downs? I keep clicking on the item's name instead of the <code>(show)</code> button by mistake (although this might be because I used to have it the other way around, with a <code>(change)</code> button, but that was worse).
</li>
<li>CueKeeper's IndexedDB Irmin backend should be split off so other people can use it easily.
</li>
</ul>
<p>If you'd like to help out, the code is available at <a href="https://github.com/talex5/cuekeeper">https://github.com/talex5/cuekeeper</a> and discussion happens on the <a href="http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel">MirageOS-devel</a> mailing list.
If there's interest, I may write a follow-up post documenting my experiences implementing CueKeeper (using <a href="https://github.com/mirage/irmin/">Irmin</a>, <a href="http://erratique.ch/software/react">React</a>, <a href="http://ocsigen.org/js_of_ocaml/">js_of_ocaml</a> and <a href="http://www.w3.org/TR/IndexedDB/">IndexedDB</a>).</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>Some of the research leading to these results has received funding from the European Union's Seventh Framework Programme FP7/2007-2013 under the UCN project, grant agreement no 611001.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Securing the Unikernel</title>
    <link href="https://roscidus.com/blog/blog/2015/01/21/securing-the-unikernel/"></link>
    <updated>2015-01-21T10:03:08+00:00</updated>
    <id>https://roscidus.com/blog/blog/2015/01/21/securing-the-unikernel</id>
    <content type="html"><![CDATA[<p>Back in July, I used <a href="http://openmirage.org/">MirageOS</a> to create <a href="/blog/blog/2014/07/28/my-first-unikernel/">my first unikernel</a>, a simple REST service for queuing file uploads, deployable as a virtual machine.
While a traditional VM would be a complete Linux system (kernel, init system, package manager, shell, etc), a Mirage unikernel is a single OCaml program which pulls in just the features (network driver, TCP stack, web server, etc) it needs as libraries.
Now it's time to look at securing the system with HTTPS and access controls, ready for deployment.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#ocaml">OCaml</a>
</li>
<li><a href="#xen">Xen</a>
</li>
<li><a href="#transport-layer-security">Transport Layer Security</a>
<ul>
<li><a href="#c-stubs-for-xen">C stubs for Xen</a>
</li>
<li><a href="#ethernet-frame-alignment">Ethernet frame alignment</a>
</li>
<li><a href="#http-api">HTTP API</a>
</li>
<li><a href="#the-private-key">The private key</a>
</li>
<li><a href="#the-partition-code">The partition code</a>
</li>
<li><a href="#entropy">Entropy</a>
</li>
</ul>
</li>
<li><a href="#access-control">Access control</a>
<ul>
<li><a href="#python-client">Python client</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=8922568">Hacker News</a> and
<a href="http://www.reddit.com/r/netsec/comments/2t67e2/securing_the_unikernel/">Reddit</a> )</p>
<h2 id="introduction">Introduction</h2>
<p>As a quick reminder, the service (&quot;Incoming queue&quot;) accepts uploads from various contributors and queues them until the (firewalled) repository software downloads them, checks the GPG signatures, and merges them into the public software repository, signed with the repository's key:</p>
<p><img src="/blog/images/0repo-multi.png" class="center"/></p>
<p>Although the queue service isn't security critical, since the GPG signatures are made and checked elsewhere, I would like to ensure it has a few properties:</p>
<ul>
<li>Only the repository can fetch items from the queue.
</li>
<li>Only authorised users can upload to it.
</li>
<li>I can see where an upload came from and revoke access if necessary.
</li>
<li>An attacker cannot take control of the system and use it to attack other systems.
</li>
</ul>
<p>We often think of security as a set of things we want to <em>prevent</em> - taking away possible actions from a fundamentally vulnerable underlying system (such as my original implementation, which had no security features).
But ideally I'd like every component of the system to be isolated by default, with allowed interactions (shown here by arrows) specified explicitly.
We should then be able to argue (informally) that the system will meet the goals above without having to verify that every line of code is correct.</p>
<p>My unikernel is written in OCaml and runs as a guest OS under the Xen hypervisor, so let's look at how well those technologies support isolation first...</p>
<h2 id="ocaml">OCaml</h2>
<p>I want to isolate the components of my unikernel, giving each just the access it requires.
When writing an OS, some unsafe code will occasionally be needed, but it should be clear which components use unsafe features (so they can be audited more carefully), and unsafe features shouldn't be needed often.</p>
<p>For example, the code for handling an HTTP upload request should only be able to use our on-disk queue's <a href="https://github.com/0install/0repo-queue/blob/6ff713d353316447eda66b310adc42634accf98a/upload_queue.mli#L22">Uploader interface</a> and its own HTTP connection.
Then we would know that an attacker with upload permission can only cause new items to be added to the queue, no matter how buggy that code is.
It should not be able to read the web server's private key, establish new out-bound connections, corrupt the disk, etc.</p>
<p>Like most modern languages, OCaml is memory-safe, so components can't interfere with each other through buggy pointer arithmetic or unsafe casts of the kind commonly found in C code.</p>
<p>But we also need to avoid global variables, which would allow two components to communicate without us explicitly connecting them.
I can't reason about the security of the system by looking at arrows in the architecture diagrams if unconnected components can magically create new arrows by themselves!
I've seen a few interesting approaches to this problem (please correct me if I've got this wrong):</p>
<dl><dt>Haskell</dt>
<dd>
Haskell avoids all side-effects, which makes global variables impossible (without using &quot;unsafe&quot; features), since updating them would be a side-effect.
</dd>
<dt>Rust</dt>
<dd>
Rust almost avoids the problem of globals through its ownership types.
Since only one thread can have a pointer to a mutable value at a time, mutable values can't normally be global.
Rust does allow &quot;mutable statics&quot; (which contain only pointers to fixed data), but requires an explicit &quot;unsafe&quot; block to use them, which is good.
</dd>
<dt>E</dt>
<dd>
E allows modules to declare top-level variables, but each time the module is imported it creates a fresh copy.
</dd>
</dl>
<p>OCaml does allow global variables, but by convention they are generally not used.</p>
<p>A second problem is controlling access to the outside world, including the network and disks (which you could consider to be more global variables):</p>
<dl><dt>Haskell</dt>
<dd>
Haskell doesn't allow functions to access the outside world at all, but they can return an <code>IO</code> type if they want to do something (the caller must then pass this value up to the top level).
This makes it easy to see that e.g. evaluating a function &quot;<code>uriPath :: URI -&gt; String</code>&quot; cannot access the network.
However, it appears that all IO gets lumped in together: a value of type <code>IO String</code> may cause any side-effects at all (disk, network, etc), so the entire side-effecting part of the program needs to be audited.
</dd>
<dt>Rust</dt>
<dd>
Rust allows all code full access to the system via its standard library.
For example, any code can read or write any file.
</dd>
<dt>E</dt>
<dd>
E passes all access to the outside world to the program's entry point.
For example, <code>&lt;file&gt;</code> grants access to the file system and <code>&lt;unsafe&gt;</code> grants access to all unsafe features.
These can be passed to libraries to grant them access, and can be attenuated (wrapped) to provide limited access.
</dd>
</dl>
<p>For example:</p>
<figure class="code"><figcaption><span>main.e</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">def</span><span class="w"> </span><span class="n">queue</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="o">&lt;</span><span class="n">import</span><span class="p">:</span><span class="n">makeQueue</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&lt;</span><span class="n">file</span><span class="p">:</span><span class="o">/</span><span class="n">var</span><span class="o">/</span><span class="n">myprog</span><span class="o">/</span><span class="n">queue</span><span class="o">&gt;</span><span class="p">)</span>
</span><span class="line"><span class="p">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>queue</code> has read-write access to the <code>/var/myprog/queue</code> sub-tree (and nothing else).
It also has no way to share data with any other parts of the program, including other queues.</p>
<p>Like Rust, OCaml does not limit access to the outside world.
However, Mirage itself uses E-style dependency injection everywhere, with the unikernel's <code>start</code> function being passed all external resources as arguments:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">B</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">BLOCK</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">H</span> <span class="o">:</span> <span class="nn">Cohttp_lwt</span><span class="p">.</span><span class="nc">Server</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="k">struct</span>
</span><span class="line">    <span class="k">module</span> <span class="nc">Q</span> <span class="o">=</span> <span class="nn">Upload_queue</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">B</span><span class="o">)</span>
</span><span class="line">
</span><span class="line">    <span class="k">let</span> <span class="n">start</span> <span class="n">console</span> <span class="n">block</span> <span class="n">http</span> <span class="o">=</span>
</span><span class="line">      <span class="n">lwt</span> <span class="n">queue</span> <span class="o">=</span> <span class="nn">Q</span><span class="p">.</span><span class="n">create</span> <span class="n">block</span> <span class="k">in</span>
</span><span class="line">      <span class="o">...</span>
</span><span class="line">  <span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Because everything in Mirage is defined using abstract types, libraries always expect to be passed the things they need explicitly.
We know that <code>Upload_queue</code> above won't access a block device directly because it needs to support different kinds of block device.</p>
<p>OCaml <em>does</em> enforce its abstractions.
There's no way for the <code>Upload_queue</code> to discover that <code>block</code> is really a Xen block device with some extra functionality
(as a Java program might do with <code>if (block instanceof XenBlock)</code>, for example).
This means that we can reason about the limits of what functions may do by looking only at their type signatures.</p>
<p>The use of functors means you can attenuate access as desired.
For example, if we want to grant just part of <code>block</code> to the queue then we can create our own module implementing the <code>BLOCK</code> type that exposes just some partition of the device, and pass that to the <code>Upload_queue.Make</code> functor.</p>
<p>In summary then, we can reason about security fairly well in Mirage if we assume the libraries are not malicious, but we cannot make hard guarantees.
It should be possible to check with automatic static analysis that we're not using any &quot;unsafe&quot; features such as global variables, direct access to devices, or allocating uninitialised memory, but I don't know of any tools to do that (except <a href="http://www.skyhunter.com/marcs/emilyWalnut.html">Emily</a>, but that seems to be just a proof-of-concept).
But these issues are minor: any reasonably safe modern language will be a huge improvement over legacy C or C++ systems!</p>
<h2 id="xen">Xen</h2>
<p>The Xen hypervisor allows multiple guest operating systems to run on a single physical machine.
It is used by many cloud hosting providers, including Amazon's AWS.
I run it on my CubieTruck - a small ARM board.
Xen allows me to run my unikernel on the same machine as other services, but ideally with the same security properties as if it had its own dedicated machine.
If some other guest on the machine is compromised, it shouldn't affect my unikernel, and if the unikernel is compromised then it shouldn't affect other guests.</p>
<p><span class="caption-wrapper center"><img src="/blog/images/xen.png" title="A typical Xen deployment, running four domains." class="caption"/><span class="caption-text">A typical Xen deployment, running four domains.</span></span></p>
<p>The diagram above shows a deployment with Linux and Mirage guests. Only dom0 has access to the physical hardware; the other guests only see virtual devices, provided by dom0.</p>
<p>How secure is Xen?
The <a href="http://xenbits.xen.org/xsa/">Xen security advisories</a> page shows that there are about 3 new Xen advisories each month.
However, it's hard to compare programs this way because the number of vulnerabilities reported depends greatly on the number of people using the program, whether they use it for security-critical tasks, and how seriously the project takes problems (e.g. whether a denial-of-service attack is considered a security bug).</p>
<p>I started using Xen in April 2014.
These are the security problems I've found myself so far:</p>
<dl><dt><a href="http://xenbits.xen.org/xsa/advisory-93.html">XSA-93</a> Hardware features unintentionally exposed to guests on ARM</dt>
<dd>
While trying to get Mini-OS to boot, I tried implementing ARM's recommended boot code for invalidating the cache
(at startup, you should invalidate the cache, dropping any previous contents).
When running under a hypervisor this should be a null-op since the cache is always valid by the time a guest is running, but Xen allowed the operation to go ahead, discarding pending writes by the hypervisor and other guests and crashing the system.
This could probably be used for a successful attack.
</dd>
<dt><a href="http://xenbits.xen.org/xsa/advisory-94.html">XSA-94</a> : Hypervisor crash setting up GIC on arm32</dt>
<dd>
I tried to program an out-of-range register, causing the hypervisor to dereference a null pointer and panic.
This is just a denial of service (the host machine needs to be rebooted); it shouldn't allow access to other VMs.
</dd>
<dt><a href="http://xenbits.xen.org/xsa/advisory-95.html">XSA-95</a> : Input handling vulnerabilities loading guest kernel on ARM</dt>
<dd>
I got the length wrong when creating the zImage for the unikernel (I included the .bss section in the length).
The <code>xl create</code> tool didn't notice and tried to read the extra data, causing the tool to segfault.
You could use this to read a bit of private data from the <code>xl</code> process, but it's unlikely there would be anything useful there.
</dd>
</dl>
<p>Although that's more bugs than you might expect, note that they're all specific to the relatively new ARM support.
The second and third are both due to using C, and would have been avoided in a safer language.
I'm not really sure why the &quot;xl&quot; tool needs to be in C - that seems to be asking for trouble.</p>
<p>To drive the physical hardware, Xen runs the first guest (dom0) with access to everything.
This is usually Linux, and I had various problems with that. For example:</p>
<dl><dt><a href="http://lists.xen.org/archives/html/xen-devel/2014-08/msg00053.html">Bug sharing the same page twice</a></dt>
<dd>
Linux got confused when the unikernel shared the same page twice (it split a single page of RAM into multiple TCP segments).
This wasn't a security bug (I think), but after it was fixed, I then got:
</dd>
<dt><a href="http://lists.xen.org/archives/html/xen-devel/2014-08/msg00596.html">Page still granted</a></dt>
<dd>
If my unikernel sent a network packet and then exited quickly, dom0 would get stuck and I'd have to reboot.
</dd>
<dt><a href="https://github.com/talex5/linux/commit/6b6dcc2857d84070c94fe4e3498486337d292870">Oops if network is used too quickly</a></dt>
<dd>
The Linux dom0 initialises the virtual network device while the VM is booting.
My unikernel booted fast enough to send packets before the device structure had been fully filled in, leading to an oops.
</dd>
<dt><a href="http://lists.xen.org/archives/html/xen-devel/2014-09/msg02254.html">Linux Dom0 oops and reboot on indirect block requests</a></dt>
<dd>
Sending lots of block device requests to Linux dom0 from the unikernel would cause Linux to oops in <code>swiotlb_tbl_unmap_single</code> and reboot the host.
I wasn't the first to find this though, and backporting the patch from Linux 3.17 seemed to fix it (I don't actually know what the problem was).
</dd>
</dl>
<p>So it might seem that using Xen doesn't get us very far.
We're still running Linux in dom0, and it still has full access to the machine.
For example, a malicious network packet from outside or from a guest might still give an attacker full control of the machine.
Why not just use <a href="http://www.linux-kvm.org/">KVM</a> and run the guests under Linux directly?</p>
<p>The big (potential) advantage of Xen here is <a href="http://wiki.xen.org/wiki/Dom0_Disaggregation">Dom0 Disaggregation</a>.
With this, Dom0 gives control of different pieces of physical hardware to different VMs rather than driving them itself.
For example, <a href="https://wiki.qubes-os.org/">Qubes</a> (a security-focused desktop OS using Xen) runs a separate &quot;<a href="https://wiki.qubes-os.org/wiki/QubesNet">NetVM</a>&quot; Linux guest just to handle the network device.
This is connected only to the <a href="https://qubes-os.org/wiki/QubesFirewall">FirewallVM</a> - another Linux guest that just routes packets to other VMs.</p>
<p>This is interesting for two reasons.
First, if an attacker exploits a bug in the network device driver, they're still outside your firewall.
Secondly, it provides a credible path to replacing parts of Linux with alternative implementations, possibly written in safer languages.
You could, for example, have Linux running dom0 but use FreeBSD to drive the network card, Mirage to provide the firewall, and OpenBSD to handle USB.</p>
<p>Finally, it's worth noting that Mirage is not tied to Xen, but can target various systems (mainly Unix and Xen currently, but there is some JavaScript support too).
If it turns out that e.g. <a href="http://genode.org/about/challenges">Genode on seL4</a> (a formally verified microkernel) provides better security, we should be able to support that too.</p>
<h2 id="transport-layer-security">Transport Layer Security</h2>
<p>We won't get far securing the system while attackers can read and modify our communications.
The <a href="https://github.com/mirleft/ocaml-tls/">ocaml-tls</a> project provides an OCaml implementation of TLS (<a href="https://en.wikipedia.org/wiki/Transport_Layer_Security">Transport Layer Security</a>), and
in September <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2014-09/msg00100.html">Hannes Mehnert showed it running on Mirage/Xen/ARM devices</a>.
Given the various flaws exposed recently in popular C TLS libraries, an OCaml implementation is very welcome.
Getting the Xen support in a state where it could be widely used took a bit of work, but
I've submitted all the patches I made, so it should be easier for other people now - see <a href="https://github.com/mirage/mirage-dev/pull/52">https://github.com/mirage/mirage-dev/pull/52</a>.</p>
<h3 id="c-stubs-for-xen">C stubs for Xen</h3>
<p>TLS needs some C code for the low-level cryptographic functions, which have to be constant time to avoid leaking information about the key, so first I had to make packages providing versions of libgmp, ctypes, zarith and nocrypto compiled to run in kernel mode.</p>
<p>The reason you need to compile C programs specially to run in kernel mode is because on x86 processors user mode code can assume the existence of a <a href="http://en.wikipedia.org/wiki/Red_zone_%28computing%29">red zone</a>, which allows some optimisations that aren't safe in kernel mode.</p>
<h3 id="ethernet-frame-alignment">Ethernet frame alignment</h3>
<p>The Mirage network driver sends Ethernet frames to dom0 by sharing pages of memory.
Each frame must therefore be contained in a single page.
The TLS code was (correctly) passing a large buffer to the TCP layer, which <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2015-01/msg00029.html">incorrectly</a> asked the network device to send each TCP-sized chunk of it.
Chunks overlapping page boundaries then got rejected.</p>
<p>My previous experiments with <a href="/blog/blog/2014/10/27/visualising-an-asynchronous-monad/#udp-transmission">tracing the network layer</a> had shown that we actually share two pages for each packet: one for the IP header and one for the payload.
Doing this avoids the need to copy the data to a new buffer, but adds the overhead of granting and revoking access to both pages.
I modified the network driver to copy the data into a single block inside a single page and got <a href="https://github.com/mirage/mirage-net-xen/pull/17">a large speed boost</a>.
Indeed, it got so much faster that it triggered <a href="https://github.com/mirage/mirage-net-xen/pull/16">a bug handling full transmit buffers</a> - which made it initially appear slower!</p>
<p>In addition to fixing the alignment problem when using TLS, and being faster, this has a nice security benefit:
the only data shared with the network driver domain is data explicitly sent to it.
Before, we had to share the entire page of memory containing the application's buffer, and there was no way to know what else might have been there.
This offers some protection if the network driver domain is compromised.</p>
<h3 id="http-api">HTTP API</h3>
<p>My original code configured a plain HTTP server on port 8080 like this:</p>
<figure class="code"><figcaption><span>config.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">server</span> <span class="o">=</span>
</span><span class="line">  <span class="n">http_server</span> <span class="o">(`</span><span class="nc">TCP</span> <span class="o">(`</span><span class="nc">Port</span> <span class="mi">8080</span><span class="o">))</span> <span class="o">(</span><span class="n">conduit_direct</span> <span class="o">(</span><span class="n">stack</span> <span class="n">default_console</span><span class="o">))</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">register</span> <span class="s2">&quot;queue&quot;</span> <span class="o">[</span>
</span><span class="line">    <span class="n">queue</span> <span class="o">$</span> <span class="n">default_console</span> <span class="o">$</span> <span class="n">storage</span> <span class="o">$</span> <span class="n">server</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>stack</code> creates TCP/IP stack.
<code>conduit_direct</code> can dynamically select different transports (e.g. http or vchan).
<code>http_server</code> applies the configuration to the conduit to get an HTTP server using plain HTTP.</p>
<p>I added support to <code>Conduit_mirage</code> to let it wrap any underlying conduit with TLS.
However, the configuration needed for TLS is fairly complicated, and involves a secret key which must be protected.
Therefore, I switched to creating only the <code>conduit</code> in <code>config.ml</code> and having the unikernel itself load the key and certificate by copying a local &quot;keys&quot; directory into the unikernel image as a &quot;crunch&quot; filesystem:</p>
<figure class="code"><figcaption><span>config.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">conduit</span> <span class="o">=</span> <span class="n">conduit_direct</span>
</span><span class="line">    <span class="o">~</span><span class="n">tls</span><span class="o">:(</span><span class="n">tls_over_conduit</span> <span class="n">default_entropy</span><span class="o">)</span>
</span><span class="line">    <span class="o">(</span><span class="n">stack</span> <span class="n">default_console</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">register</span> <span class="s2">&quot;queue&quot;</span> <span class="o">[</span>
</span><span class="line">    <span class="n">queue</span> <span class="o">$</span> <span class="n">default_console</span> <span class="o">$</span> <span class="n">storage</span> <span class="o">$</span> <span class="n">conduit</span> <span class="o">$</span> <span class="n">crunch</span> <span class="s2">&quot;keys&quot;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">    <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="n">b</span> <span class="n">conduit</span> <span class="n">kv</span> <span class="o">=</span>
</span><span class="line">      <span class="n">lwt</span> <span class="n">certificate</span> <span class="o">=</span> <span class="nn">X509</span><span class="p">.</span><span class="n">certificate</span> <span class="n">kv</span> <span class="o">`</span><span class="nc">Default</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">tls_config</span> <span class="o">=</span> <span class="nn">Tls</span><span class="p">.</span><span class="nn">Config</span><span class="p">.</span><span class="o">(</span><span class="n">of_server</span> <span class="o">(</span><span class="n">server</span> <span class="o">~</span><span class="n">certificate</span> <span class="bp">()</span><span class="o">))</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">http</span> <span class="n">spec</span> <span class="o">=</span>
</span><span class="line">        <span class="k">let</span> <span class="n">ctx</span> <span class="o">=</span> <span class="n">conduit</span> <span class="k">in</span>
</span><span class="line">        <span class="k">let</span> <span class="n">mode</span> <span class="o">=</span> <span class="o">`</span><span class="nc">TLS</span> <span class="o">(</span><span class="n">tls_config</span><span class="o">,</span> <span class="o">`</span><span class="nc">TCP</span> <span class="o">(`</span><span class="nc">Port</span> <span class="mi">8443</span><span class="o">))</span> <span class="k">in</span>
</span><span class="line">        <span class="nn">Conduit</span><span class="p">.</span><span class="n">serve</span> <span class="o">~</span><span class="n">ctx</span> <span class="o">~</span><span class="n">mode</span> <span class="o">(</span><span class="nn">H</span><span class="p">.</span><span class="n">listen</span> <span class="n">spec</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This runs a secure HTTPS server on port 8443.
The rest of the code is as before.</p>
<h3 id="the-private-key">The private key</h3>
<p>The next question is where to store the real private key.
The examples provided by the TLS package compile it into the unikernel image using crunch, but it's common to keep unikernel binaries in Git repositories and people don't expect kernel images to contain secrets.
In a traditional Linux system, we'd store the private key on the disk, so I decided to try the same here.
(I did think about storing the key encrypted in the unikernel and storing the password on the disk so you'd need both to get the key, but the TLS library doesn't support encrypted keys yet.)</p>
<p>I don't use a regular filesystem for my queuing service, and I wouldn't want to share it with the key if I did, so instead I reserved a separate 4KB partition of the disk for the key.
It turned out that Mirage already has partitioning support in the form of the <a href="https://github.com/mirage/ocaml-mbr">ocaml-mbr</a> library.
I didn't actually create an MBR at the start, but just used the <a href="https://github.com/talex5/ocaml-mbr/blob/master/lib/mbr_partition.mli"><code>Mbr_partition</code></a> functor to wrap the underlying block device into two parts.
The configuration looks like this:</p>
<p><img src="/blog/images/queue-baseline.png" class="border center"/></p>
<p>How safe is this?
I don't want to audit all the code for handling the queue, and I shouldn't have to: we can see from the diagram that the only components with access to the key are the disk, the partitions and the TLS library.
We need to trust that the TLS library will protect the key (not easy, but that's its job) and that <code>queue_partition</code> won't let <code>queue</code> access the part of the disk with the key.</p>
<p>We also need to trust the disk, but if the partitions are only allowing correct requests through, that shouldn't be too much to ask.</p>
<h3 id="the-partition-code">The partition code</h3>
<p>Before relying on the partition code, we'd better take a look at it because it may not be designed to enforce security.
Indeed, a quick look at the code shows that it isn't:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="k">let</span> <span class="n">read</span> <span class="n">t</span> <span class="n">start_sector</span> <span class="n">buffers</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">length</span> <span class="o">=</span> <span class="nn">Int64</span><span class="p">.</span><span class="n">add</span> <span class="n">start_sector</span> <span class="o">(</span><span class="n">length</span> <span class="n">buffers</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">    <span class="k">if</span> <span class="n">length</span> <span class="o">&gt;</span> <span class="n">t</span><span class="o">.</span><span class="n">length_sectors</span>
</span><span class="line">    <span class="k">then</span> <span class="n">return</span> <span class="o">(`</span><span class="nc">Error</span> <span class="o">(`</span><span class="nc">Unknown</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;read %Ld %Ld out of range&quot;</span> <span class="n">start_sector</span> <span class="n">length</span><span class="o">)))</span>
</span><span class="line">    <span class="k">else</span> <span class="nn">B</span><span class="p">.</span><span class="n">read</span> <span class="n">t</span><span class="o">.</span><span class="n">b</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">add</span> <span class="n">start_sector</span> <span class="n">t</span><span class="o">.</span><span class="n">start_sector</span><span class="o">)</span> <span class="n">buffers</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">write</span> <span class="n">t</span> <span class="n">start_sector</span> <span class="n">buffers</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">length</span> <span class="o">=</span> <span class="nn">Int64</span><span class="p">.</span><span class="n">add</span> <span class="n">start_sector</span> <span class="o">(</span><span class="n">length</span> <span class="n">buffers</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">    <span class="k">if</span> <span class="n">length</span> <span class="o">&gt;</span> <span class="n">t</span><span class="o">.</span><span class="n">length_sectors</span>
</span><span class="line">    <span class="k">then</span> <span class="n">return</span> <span class="o">(`</span><span class="nc">Error</span> <span class="o">(`</span><span class="nc">Unknown</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;write %Ld %Ld out of range&quot;</span> <span class="n">start_sector</span> <span class="n">length</span><span class="o">)))</span>
</span><span class="line">    <span class="k">else</span> <span class="nn">B</span><span class="p">.</span><span class="n">write</span> <span class="n">t</span><span class="o">.</span><span class="n">b</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">add</span> <span class="n">start_sector</span> <span class="n">t</span><span class="o">.</span><span class="n">start_sector</span><span class="o">)</span> <span class="n">buffers</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It checks only that the requested start sector plus the length of the results buffer is less than the length of the partition.
To (hopefully) make this bullet-proof, I:</p>
<ul>
<li>moved the checks into a single function so we don't have to check two copies,
</li>
<li>added a check that the start sector isn't negative,
</li>
<li>modified the end check to avoid integer overflow, and
</li>
<li>added some unit-tests.
</li>
</ul>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="k">let</span> <span class="n">adjust_start</span> <span class="n">name</span> <span class="n">op</span> <span class="n">t</span> <span class="n">start_sector</span> <span class="n">buffers</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">buffers_len_sectors</span> <span class="o">=</span> <span class="n">length</span> <span class="n">t</span> <span class="n">buffers</span> <span class="k">in</span>
</span><span class="line">    <span class="k">if</span> <span class="n">start_sector</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="n">L</span> <span class="o">||</span>
</span><span class="line">       <span class="n">start_sector</span> <span class="o">&gt;</span> <span class="n">t</span><span class="o">.</span><span class="n">id</span><span class="o">.</span><span class="n">length_sectors</span> <span class="o">--</span> <span class="n">buffers_len_sectors</span>
</span><span class="line">    <span class="k">then</span> <span class="n">return</span> <span class="o">(`</span><span class="nc">Error</span> <span class="o">(`</span><span class="nc">Unknown</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span>
</span><span class="line">      <span class="s2">&quot;%s %Ld+%Ld out of range&quot;</span> <span class="n">name</span> <span class="n">start_sector</span> <span class="n">buffers_len_sectors</span><span class="o">)))</span>
</span><span class="line">    <span class="k">else</span> <span class="n">op</span> <span class="n">t</span><span class="o">.</span><span class="n">id</span><span class="o">.</span><span class="n">b</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">add</span> <span class="n">start_sector</span> <span class="n">t</span><span class="o">.</span><span class="n">id</span><span class="o">.</span><span class="n">start_sector</span><span class="o">)</span> <span class="n">buffers</span>
</span><span class="line">    
</span><span class="line">  <span class="k">let</span> <span class="n">read</span> <span class="o">=</span> <span class="n">adjust_start</span> <span class="s2">&quot;read&quot;</span> <span class="nn">B</span><span class="p">.</span><span class="n">read</span>
</span><span class="line">  <span class="k">let</span> <span class="n">write</span> <span class="o">=</span> <span class="n">adjust_start</span> <span class="s2">&quot;write&quot;</span> <span class="nn">B</span><span class="p">.</span><span class="n">write</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The need to protect against overflow is an annoyance.
OCaml's <code>Int64.(add max_int one)</code> doesn't abort, but returns <code>Int64.min_int</code>.
That's disappointing, but not surprising.
I wrote a unit-test that tried to read sector <code>Int64.max_int</code> and ran it (before updating the code) to check it detected the problem.
I was expecting the partition code to pass the request to the underlying block device, which I expected to return an error about the invalid sector, but it didn't!
It turns out, <code>Int64.to_int</code> (used by my in-memory test block device) silently truncates out-of-range integers:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">#</span> <span class="nn">Int64</span><span class="p">.</span><span class="n">max_int</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="n">int64</span> <span class="o">=</span> <span class="mi">9223372036854775807</span><span class="n">L</span>
</span><span class="line"><span class="o">#</span> <span class="nn">Int64</span><span class="p">.</span><span class="o">(</span><span class="n">add</span> <span class="n">max_int</span> <span class="n">one</span><span class="o">);;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="n">int64</span> <span class="o">=</span> <span class="o">-</span><span class="mi">9223372036854775808</span><span class="n">L</span>
</span><span class="line"><span class="o">#</span> <span class="nn">Int64</span><span class="p">.</span><span class="n">min_int</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="n">int64</span> <span class="o">=</span> <span class="o">-</span><span class="mi">9223372036854775808</span><span class="n">L</span>
</span><span class="line"><span class="o">#</span> <span class="nn">Int64</span><span class="p">.</span><span class="o">(</span><span class="n">to_int</span> <span class="n">min_int</span><span class="o">);;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="kt">int</span> <span class="o">=</span> <span class="mi">0</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>So, if the queue can be tricked into asking for sector 9223372036854775807 then the partition would accept it as valid and the block device would truncate it and give access to sector 0 - the sector with the private key!</p>
<p>Still, this is a nice demonstration of how we can add security in Mirage by inserting a new module (<code>Mbr_partition</code>) between two existing ones.
Rather than having some complicated fixed policy language (e.g. SELinux), we can build whatever security abstractions we like.
Here I just limited which parts of the disk the queue could access, but we could do many other things: make a partition read-only, make it readable only until the unikernel finishes initialising, apply rate-limiting on reads, etc.</p>
<p>Here's the final code. It:</p>
<ol>
<li>Takes a block device, a conduit, and a KV store as inputs.
</li>
<li>Creates two partitions (views) onto the block device.
</li>
<li>Creates a queue on one partition.
</li>
<li>Reads the private key from the other, and the certificate from the KV store.
</li>
<li>Begins serving HTTPS requests.
</li>
</ol>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">B</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">BLOCK</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">Conduit</span> <span class="o">:</span> <span class="nn">Conduit_mirage</span><span class="p">.</span><span class="nc">S</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">CertStore</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">KV_RO</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Part</span> <span class="o">=</span> <span class="nn">Mbr_partition</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">B</span><span class="o">)</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Q</span> <span class="o">=</span> <span class="nn">Upload_queue</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">Part</span><span class="o">)</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Http</span> <span class="o">=</span> <span class="nn">HTTP</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">Conduit</span><span class="o">)</span>
</span><span class="line">
</span><span class="line">  <span class="o">[...]</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="n">b</span> <span class="n">conduit</span> <span class="n">cert_store</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">B</span><span class="p">.</span><span class="n">get_info</span> <span class="n">b</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">info</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">key_sectors</span> <span class="o">=</span> <span class="o">(</span><span class="n">key_partition_size</span> <span class="o">+</span> <span class="n">info</span><span class="o">.</span><span class="nn">B</span><span class="p">.</span><span class="n">sector_size</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="o">/</span>
</span><span class="line">                      <span class="n">info</span><span class="o">.</span><span class="nn">B</span><span class="p">.</span><span class="n">sector_size</span> <span class="o">|&gt;</span> <span class="nn">Int64</span><span class="p">.</span><span class="n">of_int</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">queue_sectors</span> <span class="o">=</span> <span class="n">info</span><span class="o">.</span><span class="nn">B</span><span class="p">.</span><span class="n">size_sectors</span> <span class="o">--</span> <span class="n">key_sectors</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="nn">Part</span><span class="p">.</span><span class="o">(</span><span class="n">connect</span> <span class="o">{</span><span class="n">b</span><span class="o">;</span> <span class="n">start_sector</span> <span class="o">=</span> <span class="mi">0</span><span class="n">L</span><span class="o">;</span>
</span><span class="line">                      <span class="n">length_sectors</span> <span class="o">=</span> <span class="n">key_sectors</span><span class="o">})</span>
</span><span class="line">    <span class="o">&gt;&gt;|=</span> <span class="k">fun</span> <span class="n">key_partition</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="nn">Part</span><span class="p">.</span><span class="o">(</span><span class="n">connect</span> <span class="o">{</span><span class="n">b</span><span class="o">;</span> <span class="n">start_sector</span> <span class="o">=</span> <span class="n">key_sectors</span><span class="o">;</span>
</span><span class="line">                      <span class="n">length_sectors</span> <span class="o">=</span> <span class="n">queue_sectors</span><span class="o">})</span>
</span><span class="line">    <span class="o">&gt;&gt;|=</span> <span class="k">fun</span> <span class="n">queue_partition</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="nn">Q</span><span class="p">.</span><span class="n">create</span> <span class="n">queue_partition</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">q</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="n">lwt</span> <span class="n">certs</span> <span class="o">=</span>
</span><span class="line">      <span class="n">read_from_kv</span> <span class="n">cert_store</span> <span class="s2">&quot;tls/server.pem&quot;</span>
</span><span class="line">      <span class="o">&gt;|=</span> <span class="nn">X509</span><span class="p">.</span><span class="nn">Cert</span><span class="p">.</span><span class="n">of_pem_cstruct</span> <span class="k">in</span>
</span><span class="line">    <span class="n">lwt</span> <span class="n">pk</span> <span class="o">=</span>
</span><span class="line">      <span class="n">read_from_partition</span> <span class="n">key_partition</span> <span class="o">~</span><span class="n">len</span><span class="o">:</span><span class="n">private_key_len</span>
</span><span class="line">      <span class="o">&gt;|=</span> <span class="nn">X509</span><span class="p">.</span><span class="nn">PK</span><span class="p">.</span><span class="n">of_pem_cstruct1</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">tls_config</span> <span class="o">=</span> <span class="nn">Tls</span><span class="p">.</span><span class="nn">Config</span><span class="p">.</span><span class="o">(</span><span class="n">of_server</span>
</span><span class="line">      <span class="o">(</span><span class="n">server</span> <span class="o">~</span><span class="n">certificate</span><span class="o">:(</span><span class="n">certs</span><span class="o">,</span> <span class="n">pk</span><span class="o">)</span> <span class="bp">()</span><span class="o">))</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="k">let</span> <span class="n">spec</span> <span class="o">=</span> <span class="nn">Http</span><span class="p">.</span><span class="nn">Server</span><span class="p">.</span><span class="n">make</span> <span class="o">~</span><span class="n">callback</span><span class="o">:(</span><span class="n">callback</span> <span class="n">q</span><span class="o">)</span> <span class="o">~</span><span class="n">conn_closed</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">mode</span> <span class="o">=</span> <span class="o">`</span><span class="nc">TLS</span> <span class="o">(</span><span class="n">tls_config</span><span class="o">,</span> <span class="o">`</span><span class="nc">TCP</span> <span class="o">(`</span><span class="nc">Port</span> <span class="mi">8443</span><span class="o">))</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">Conduit</span><span class="p">.</span><span class="n">serve</span> <span class="o">~</span><span class="n">ctx</span><span class="o">:</span><span class="n">conduit</span> <span class="o">~</span><span class="n">mode</span> <span class="o">(</span><span class="nn">Http</span><span class="p">.</span><span class="nn">Server</span><span class="p">.</span><span class="n">listen</span> <span class="n">spec</span><span class="o">)</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="entropy">Entropy</h3>
<p>Another issue is getting good random numbers, which is required for the cryptography.
On start-up, the unikernel displayed:</p>
<pre><code>Entropy_xen_weak: using a weak entropy source seeded only from time.
</code></pre>
<p>To fix this, you need to use <a href="https://github.com/djs55/mirage-entropy">Dave Scott's version</a> (with a slight patch from me):</p>
<pre><code>opam pin add mirage-entropy-xen 'https://github.com/talex5/mirage-entropy.git#handshake'
</code></pre>
<p>You should then see:</p>
<pre><code>Entropy_xen: attempting to connect to Xen entropy source org.openmirage.entropy.1
Console.connect org.openmirage.entropy.1: doesn't currently exist, waiting for hotplug
</code></pre>
<p>Now run <a href="https://github.com/mirage/xentropyd">xentropyd</a> in dom0 to share the host's entropy with guests.</p>
<p>The interesting question here is what Linux guests do for entropy, especially on ARM where there's no <code>RdRand</code> instruction.</p>
<h2 id="access-control">Access control</h2>
<p>Traditional means of access control involve issuing users with passwords or X.509 client certificates, which they share with the software they're running.
All requests sent by the client can then be authenticated as coming from them and approved based on some access control policy.
This approach leads to all the well-known problems with traditional access control: the <a href="http://en.wikipedia.org/wiki/Confused_deputy_problem">confused deputy problem</a>, Cross-Site Request Forgery, Clickjacking, etc, so I want to avoid that kind of &quot;ambient authority&quot;.</p>
<p>The previous diagram let us reason about how the different components within the unikernel could interact with each other, showing the possible (initial) interactions with arrows.
Now I want to stretch arrows across the Internet, so I can reason in the same way about the larger distributed system that includes my queue service with the uploaders and downloaders.</p>
<p>Like C pointers, traditional web URLs do not give us what we want: a compromised CA anywhere in the world will allow an attacker to impersonate our service, and our URLs may be guessable.
Instead, I decided to try a <a href="http://www.waterken.com/dev/YURL/Definition/">YURL</a>:</p>
<blockquote>
<p>&quot;[...] the identifier MUST provide enough information to: locate the target site; authenticate the target site; and, if required, establish a private communication channel with the target site. A URL that meets these requirements is a YURL.&quot;</p>
</blockquote>
<p>The latest version of this (draft) scheme I could find was some brief notes in <a href="http://iiw.idcommons.net/HTTPSY_%E2%80%93_Leave_the_Certificate_Authority_Behind">HTTPSY</a> (2014), which uses the format:</p>
<pre><code>httpsy://algorithm:fingerprint@domain:port/path1/!redactedPath2/…
</code></pre>
<p>There are two parts we need to consider: how the client determines that it is connected to the real service, and how the service determines what the client can do.</p>
<p>To let the client authenticate the server without relying on the CA system, YURLs include a hash (fingerprint) of the server's public key.
You can get the fingerprint of an X509 certificate like this:</p>
<pre><code>$ openssl x509 -in server.pem -fingerprint -sha256 -noout
SHA256 Fingerprint=3F:27:2D:E6:D6:3D:7C:08:E0:E3:EF:02:A8:DA:9A:74:62:84:57:21:B4:72:39:FD:D0:72:0E:76:71:A5:E9:94
</code></pre>
<p>Base32-encoding shortens this to <code>h4ts3zwwhv6aryhd54bkrwu2orriivzbwrzdt7oqoihhm4nf5gka</code>.</p>
<p>Alternatively, to get the value with OCaml, use:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Certificate</span><span class="p">.</span><span class="n">cs_of_cert</span> <span class="n">cert</span> <span class="o">|&gt;</span> <span class="nn">Nocrypto</span><span class="p">.</span><span class="nn">Hash</span><span class="p">.</span><span class="n">digest</span> <span class="o">`</span><span class="nc">SHA256</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To control what each user of the service can do, we give each user a unique YURL containing a <a href="http://wiki.erights.org/wiki/Swiss_number">Swiss number</a>, which is like a password except that it applies only to a specific resource, not to the whole site.
The Swiss number comes after the <code>!</code> in the URL, which indicates to browsers that it shouldn't be displayed, included in Referer headers, etc.
You can use any unguessably long random string here (I used <code>pwgen 32 1</code>).
After checking the server's fingerprint, the client requests the path with the Swiss number included.</p>
<p>Putting it all together, then, a sample URL to give to the downloader looks like this:</p>
<pre><code>httpsy://sha256:h4ts3zwwhv6aryhd54bkrwu2orriivzbwrzdt7oqoihhm4nf5gka@10.0.0.2:8443/downloader/!eequuthieyexahzahShain0abeiwaej4
</code></pre>
<p>The old code for handling requests looked like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">match</span> <span class="nn">Uri</span><span class="p">.</span><span class="n">path</span> <span class="n">request</span><span class="o">.</span><span class="nn">Cohttp</span><span class="p">.</span><span class="nn">Request</span><span class="p">.</span><span class="n">uri</span> <span class="k">with</span>
</span><span class="line"><span class="o">|</span> <span class="s2">&quot;/uploader&quot;</span> <span class="o">-&gt;</span> <span class="n">handle_uploader</span> <span class="n">q</span> <span class="n">request</span> <span class="n">body</span>
</span><span class="line"><span class="o">|</span> <span class="s2">&quot;/downloader&quot;</span> <span class="o">-&gt;</span> <span class="n">handle_downloader</span> <span class="n">q</span> <span class="n">request</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This becomes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">re_endpoint</span> <span class="o">=</span> <span class="nn">Str</span><span class="p">.</span><span class="n">regexp</span> <span class="s2">&quot;^/</span><span class="se">\\</span><span class="s2">(uploader</span><span class="se">\\</span><span class="s2">|downloader</span><span class="se">\\</span><span class="s2">)/!</span><span class="se">\\</span><span class="s2">(.*</span><span class="se">\\</span><span class="s2">)$&quot;</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">path</span> <span class="o">=</span> <span class="nn">Uri</span><span class="p">.</span><span class="n">path</span> <span class="n">request</span><span class="o">.</span><span class="nn">Cohttp</span><span class="p">.</span><span class="nn">Request</span><span class="p">.</span><span class="n">uri</span> <span class="k">in</span>
</span><span class="line"><span class="k">if</span> <span class="nn">Str</span><span class="p">.</span><span class="n">string_match</span> <span class="n">re_endpoint</span> <span class="n">path</span> <span class="mi">0</span> <span class="k">then</span> <span class="o">(</span>
</span><span class="line">  <span class="k">let</span> <span class="n">label</span> <span class="o">=</span> <span class="nn">Str</span><span class="p">.</span><span class="n">matched_group</span> <span class="mi">1</span> <span class="n">path</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">swiss_hash</span> <span class="o">=</span> <span class="nn">Str</span><span class="p">.</span><span class="n">matched_group</span> <span class="mi">2</span> <span class="n">path</span> <span class="o">|&gt;</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">of_string</span>
</span><span class="line">    <span class="o">|&gt;</span> <span class="nn">Nocrypto</span><span class="p">.</span><span class="nn">Hash</span><span class="p">.</span><span class="n">digest</span> <span class="o">`</span><span class="nc">SHA256</span> <span class="o">|&gt;</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">to_string</span>
</span><span class="line">    <span class="o">|&gt;</span> <span class="nn">B64</span><span class="p">.</span><span class="n">encode</span> <span class="k">in</span>
</span><span class="line">  <span class="k">match</span> <span class="n">label</span><span class="o">,</span> <span class="n">swiss_hash</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="s2">&quot;uploader&quot;</span><span class="o">,</span> <span class="s2">&quot;kW8VOKYP1eu/cWInpJx/jzYDSJzo1RUR14GoxV/CImM=&quot;</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">handle_uploader</span> <span class="n">q</span> <span class="n">request</span> <span class="n">body</span> <span class="o">~</span><span class="n">user</span><span class="o">:</span><span class="s2">&quot;Alice&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="s2">&quot;downloader&quot;</span><span class="o">,</span> <span class="s2">&quot;PEj3nuboy3BktVGzi9y/UBmgkrGuhHD1i6WsXDw1naI=&quot;</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">handle_downloader</span> <span class="n">q</span> <span class="n">request</span> <span class="o">~</span><span class="n">user</span><span class="o">:</span><span class="s2">&quot;0repo&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="o">_</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">bad_path</span> <span class="n">path</span>
</span><span class="line"><span class="o">)</span> <span class="k">else</span> <span class="n">bad_path</span> <span class="n">path</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I hashed the Swiss number here so that the unikernel doesn't have to contain any secrets and I therefore don't have to worry about timing attacks.
Even if the attacker knows the hash we're looking for, they still shouldn't be able to generate a URL which hashes to that value.</p>
<p>By giving each user of the service a different Swiss number we can keep records of who authorised each request and revoke access individually if needed (here the <code>~user:&quot;Alice&quot;</code> indicates this is the uploader URL we gave to Alice).</p>
<p>Of course, the YURLs need to be sent to users securely too.
In my case, the users already have known GPG keys, so I can just email them an encrypted version.</p>
<h3 id="python-client">Python client</h3>
<p>The downloader (0repo) is written in Python, so the next step was to check that it could still access the service.
The Python SSL API was rather confusing, but this seems to work:</p>
<figure class="code"><figcaption><span>client.py</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="ch">#!/usr/bin/python3</span>
</span><span class="line">
</span><span class="line"><span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">urlparse</span>
</span><span class="line"><span class="kn">from</span> <span class="nn">http</span> <span class="kn">import</span> <span class="n">client</span>
</span><span class="line">
</span><span class="line"><span class="kn">import</span> <span class="nn">ssl</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">hashlib</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">base64</span>
</span><span class="line">
</span><span class="line"><span class="c1"># Note: MUST check fingerprint BEFORE sending URL path with Swiss number</span>
</span><span class="line"><span class="k">class</span> <span class="nc">FingerprintContext</span><span class="p">:</span>
</span><span class="line">    <span class="n">required_fingerprint</span> <span class="o">=</span> <span class="kc">None</span>
</span><span class="line">    <span class="n">verify_mode</span> <span class="o">=</span> <span class="n">ssl</span><span class="o">.</span><span class="n">CERT_REQUIRED</span>
</span><span class="line">    <span class="n">check_hostname</span> <span class="o">=</span> <span class="kc">False</span>
</span><span class="line">
</span><span class="line">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fingerprint</span><span class="p">):</span>
</span><span class="line">        <span class="bp">self</span><span class="o">.</span><span class="n">required_fingerprint</span> <span class="o">=</span> <span class="n">fingerprint</span>
</span><span class="line">
</span><span class="line">    <span class="k">def</span> <span class="nf">wrap_socket</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sock</span><span class="p">,</span> <span class="n">server_hostname</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</span><span class="line">        <span class="n">wrapped</span> <span class="o">=</span> <span class="n">ssl</span><span class="o">.</span><span class="n">wrap_socket</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
</span><span class="line">        <span class="n">cert</span> <span class="o">=</span> <span class="n">wrapped</span><span class="o">.</span><span class="n">getpeercert</span><span class="p">(</span><span class="n">binary_form</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
</span><span class="line">        <span class="n">actual_fingerprint</span> <span class="o">=</span> <span class="n">base64</span><span class="o">.</span><span class="n">b32encode</span><span class="p">(</span><span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">cert</span><span class="p">)</span><span class="o">.</span><span class="n">digest</span><span class="p">())</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">&#39;ascii&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;=&#39;</span><span class="p">)</span>
</span><span class="line">        <span class="k">if</span> <span class="n">actual_fingerprint</span> <span class="o">!=</span> <span class="bp">self</span><span class="o">.</span><span class="n">required_fingerprint</span><span class="p">:</span>
</span><span class="line">            <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s2">&quot;Expected server certificate to have fingerprint:</span><span class="se">\n</span><span class="si">%s</span><span class="s2"> but got:</span><span class="se">\n</span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">required_fingerprint</span><span class="p">,</span> <span class="n">actual_fingerprint</span><span class="p">))</span>
</span><span class="line">        <span class="k">return</span> <span class="n">wrapped</span>
</span><span class="line">
</span><span class="line"><span class="c1"># Testing...</span>
</span><span class="line"><span class="n">url</span> <span class="o">=</span> <span class="s2">&quot;httpsy://sha256:h4ts3zwwhv6aryhd54bkrwu2orriivzbwrzdt7oqoihhm4nf5gka@localhost:8443/downloader/!eequuthieyexahzahShain0abeiwaej4&quot;</span>
</span><span class="line"><span class="n">parsed</span> <span class="o">=</span> <span class="n">urlparse</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</span><span class="line"><span class="k">assert</span> <span class="n">parsed</span><span class="o">.</span><span class="n">scheme</span> <span class="o">==</span> <span class="s2">&quot;httpsy&quot;</span>
</span><span class="line"><span class="k">assert</span> <span class="n">parsed</span><span class="o">.</span><span class="n">username</span> <span class="o">==</span> <span class="s2">&quot;sha256&quot;</span>
</span><span class="line">
</span><span class="line"><span class="n">conn</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">HTTPSConnection</span><span class="p">(</span>
</span><span class="line">    <span class="n">host</span> <span class="o">=</span> <span class="n">parsed</span><span class="o">.</span><span class="n">hostname</span><span class="p">,</span>
</span><span class="line">    <span class="n">port</span> <span class="o">=</span> <span class="n">parsed</span><span class="o">.</span><span class="n">port</span><span class="p">,</span>
</span><span class="line">    <span class="n">context</span> <span class="o">=</span> <span class="n">FingerprintContext</span><span class="p">(</span><span class="n">parsed</span><span class="o">.</span><span class="n">password</span><span class="p">),</span>
</span><span class="line">    <span class="n">check_hostname</span> <span class="o">=</span> <span class="kc">False</span><span class="p">)</span>
</span><span class="line"><span class="n">conn</span><span class="o">.</span><span class="n">request</span><span class="p">(</span><span class="s2">&quot;GET&quot;</span><span class="p">,</span> <span class="n">parsed</span><span class="o">.</span><span class="n">path</span><span class="p">)</span>
</span><span class="line"><span class="n">resp</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">getresponse</span><span class="p">()</span>
</span><span class="line"><span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Fetching item of size </span><span class="si">%s</span><span class="s2">...&quot;</span> <span class="o">%</span> <span class="n">resp</span><span class="o">.</span><span class="n">getheader</span><span class="p">(</span><span class="s1">&#39;Content-Length&#39;</span><span class="p">))</span>
</span><span class="line"><span class="n">d</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</span><span class="line"><span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Fetched </span><span class="si">%d</span><span class="s2"> bytes&quot;</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">d</span><span class="p">))</span>
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="conclusions">Conclusions</h2>
<p>MirageOS should allow us to build systems that are far more secure than traditional operating systems.
By starting with isolated components and then connecting them together in a controlled way we can feel some confidence that our security goals will be met.</p>
<p>At the language level, OCaml's abstract types and functors make it easy to reason informally about how the components of our system will interact. Mirage passes values granting access to the outside world (disks, network cards, etc) to our unikernel's <code>start</code> function.
Our code can then delegate these permissions to the rest of the code in a controlled fashion.
For example, we can grant the queuing code access only to its part of the disk (and not the bit containing the TLS private key) by wrapping the disk in a partition functor.
Although OCaml doesn't actually prevent us from bypassing this system and accessing devices directly, code that does so would not be able to support the multiple different back-ends (e.g. Unix and Xen) that Mirage requires and so could not be written accidentally.
It should be possible for a static analysis tool to verify that modules don't do this.</p>
<p>Moving up a level from separating the components of our unikernel, Xen allows us to isolate multiple unikernels and other VMs running on a single physical host.
Just as we interposed a disk partition between the queue and the disk within the unikernel, we can use Xen to interpose a firewall VM between the physical network device and our unikernel.</p>
<p>Finally, the use of transport layer security and YURLs allows us to continue this pattern of isolation to the level of networks, so that we can reason in the same way about distributed systems.
My current code mixes the handling of YURLs with the existing application logic, but it should be possible to abstract this and make it reusable, so that remote services appear just like any local service.
In many systems this is awkward because local APIs are used synchronously while remote ones are asynchronous, but in Mirage everything is non-blocking anyway, so there is no difference.</p>
<p>I feel I should put some kind of warning here about these very new security features not being ready for real use and how you should instead use mature, widely deployed systems such as Linux and OpenSSL.
But really, can it be any worse?</p>
<p>If you've spotted any flaws in my reasoning or code, please add comments!
The code for this unikernel can be found on the <code>tls</code> branch at <a href="https://github.com/0install/0repo-queue/tree/tls">https://github.com/0install/0repo-queue/tree/tls</a>.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Visualising an asynchronous monad</title>
    <link href="https://roscidus.com/blog/blog/2014/10/27/visualising-an-asynchronous-monad/"></link>
    <updated>2014-10-27T07:21:04+00:00</updated>
    <id>https://roscidus.com/blog/blog/2014/10/27/visualising-an-asynchronous-monad</id>
    <content type="html"><![CDATA[<p>Many asynchronous programs make use of <a href="http://en.wikipedia.org/wiki/Promise_%28programming%29">promises</a> (also known as using <em>light-weight threads</em> or an <em>asynchronous monad</em>) to manage concurrency.
I've been working on tools to collect trace data from such programs and visualise the results, to help with profiling and debugging.</p>
<p>The diagram below shows a trace from a <a href="http://openmirage.org/">Mirage unikernel</a> reading data from disk in a loop.
You should be able to pan around by dragging in the diagram, and zoom by using your mouse's scroll wheel.
If you're on a mobile device then pinch-to-zoom should work if you follow the full-screen link, although it will probably be slow.
If nothing else works, the ugly zoom buttons at the bottom zoom around the last point clicked.</p>
<!-- more -->
<p>
<canvas style='width: 100%; height: 500px' id='canvas-intro'>
  <img src='/blog/images/mirage-profiling/trace-viewer-console.png' width='622' height='484' alt='Mirage Trace Viewer screenshot'><br/>
  <strong>WARNING: No HTML canvas support (this is just a static image)! Try a newer browser...</strong>
</canvas>
</p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=intro">View full screen</a></p>
<p>The web viewer requires JavaScript and HTML canvas support.
If it doesn't work, you can also build the <a href="https://github.com/talex5/mirage-trace-viewer">trace viewer</a> as a (much faster) native GTK application.</p>
<p>In this post I'll explain how to read these diagrams, and how to trace your own programs.</p>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=8514923">Hacker News</a> and <a href="http://www.reddit.com/r/programming/comments/2kgp50/visualising_an_asynchronous_monad/">Reddit</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
<ul>
<li><a href="#bind">Bind</a>
</li>
<li><a href="#join">Join</a>
</li>
<li><a href="#choose">Choose</a>
</li>
<li><a href="#pick">Pick</a>
</li>
<li><a href="#exceptions">Exceptions</a>
</li>
</ul>
</li>
<li><a href="#making-your-own-traces">Making your own traces</a>
</li>
<li><a href="#examples">Examples</a>
<ul>
<li><a href="#profiling-the-console">Profiling the console</a>
</li>
<li><a href="#udp-transmission">UDP transmission</a>
</li>
<li><a href="#tcp-transmission">TCP transmission</a>
</li>
<li><a href="#disk-access">Disk access</a>
</li>
</ul>
</li>
<li><a href="#implementation-notes">Implementation notes</a>
</li>
<li><a href="#summary">Summary</a>
</li>
</ul>
<h2 id="introduction">Introduction</h2>
<p>Many asynchronous programs make use of <em>promises</em> (also known as using <em>light-weight threads</em> or an <em>asynchronous monad</em>).
A promise/thread is a place-holder for a value that will arrive in the future.</p>
<p>Here's a really simple example (an OCaml program using <a href="http://ocsigen.org/lwt/">Lwt</a>).
It creates a thread that resolves to unit (void) after one second:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Lwt</span>
</span><span class="line"><span class="k">open</span> <span class="nc">Lwt_unix</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">example_1</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">sleep</span> <span class="mi">1</span><span class="o">.</span><span class="mi">0</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="c">(* Run the main loop until example_1 resolves. *)</span>
</span><span class="line">  <span class="nn">Lwt_unix</span><span class="p">.</span><span class="n">run</span> <span class="o">(</span><span class="n">example_1</span> <span class="bp">()</span><span class="o">);</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><canvas style='width: 100%; height: 106px' id='canvas-example_1'></canvas></p>
<p>In the diagram, time runs from left to right.
Threads (promises) are shown as horizontal lines.
The diagram shows:</p>
<ul>
<li>Initially, only the main thread (&quot;0&quot;) is present.
</li>
<li>The main thread then creates a new thread, labelled &quot;sleep&quot;.
</li>
<li>The sleep thread is a &quot;task&quot; thread (this is just a helpful label added when the thread is created).
</li>
<li>The whole process then goes to sleep for one second, which is shown by the darker background.
</li>
<li>At the end of the second, the process wakes up and resolves the sleep thread to its value (unit), shown by the green arrow.
</li>
<li>The main thread, which was waiting for the sleep thread, reads the value (the blue arrow) and exits the program.
</li>
</ul>
<p>If you zoom in on the arrows (go down to a grid division of about 10 microseconds), you'll also see a white segment on the main thread, which shows when it was running (only one thread runs at a time).</p>
<p>Because thread 0 is actually the main event loop (rather than a Lwt thread), things are a little more complicated than normal.
When the process has nothing to do, thread 0 puts the process to sleep until the next scheduled timer.
When the OS wakes the process, thread 0 resumes, determines that the &quot;sleep&quot; thread can be resolved, and does so.
This causes any callbacks registered on the sleep thread to be called, but in this case there aren't any and control returns to thread 0.
Thread 0 then checks the sleep thread (because that determines when to finish), and ends the loop because it's resolved.</p>
<h3 id="bind">Bind</h3>
<p>Callbacks can be attached to a promise/thread to process the value when it arrives.
Attaching the callback immediately creates a new promise for the final result of the callback.</p>
<p>Here's a program that sleeps twice in series.
The <code>&gt;&gt;=</code> (bind) operator attaches the callback function to the first thread.
I've made the sleeps very short so you can see the process waking up without having to zoom.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example_2</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">00001</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">00001</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><canvas style='width: 100%; height: 136px' id='canvas-example_2'></canvas></p>
<p>In this case, the main thread creates two new threads at the start: one for the result of the first sleep and a second (&quot;bind&quot;) for the result of running the callback on the result.
It's easier to see how the first thread is resolved here: the main thread handling the polling loop wakes up and resolves the sleep thread, which then causes the bind thread to resume.</p>
<p>You might wonder why the bind thread disappears when the second sleep starts.
It hasn't finished, but when the bind's callback function returns the second sleep thread as its result, the bind thread is merged with the sleep thread.
This is the asynchronous equivalent of a tail call optimisation, allowing us to create loops without needing an unbounded number of threads.</p>
<p>Actually, displaying binds in this way tends to clutter up the diagrams, so the viewer has a simplification rule that is enabled by default: if the first event on a bind thread is a read, the part of the bind up to that point isn't drawn.
Therefore, the default display for this program is:</p>
<p><canvas style='width: 100%; height: 106px' id='canvas-example_2b'></canvas></p>
<p>If you zoom in on the central green arrow, you can see the tiny remaining bind thread between the two sleeps.</p>
<h3 id="join">Join</h3>
<p><code>Lwt.join</code> waits for a collection of threads to finish:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example_3</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">join</span> <span class="o">[</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">003</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">001</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">002</span><span class="o">;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><canvas style='width: 100%; height: 196px' id='canvas-example_3'></canvas></p>
<p>In the trace, you can see the join thread being notified each time one of the threads it's waiting for completes.
When they're all done, it resolves itself.</p>
<h3 id="choose">Choose</h3>
<p><code>Lwt.choose</code> is similar to join, but only waits until one of its threads finishes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example_4</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">choose</span> <span class="o">[</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">003</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">00001</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">002</span><span class="o">;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><canvas style='width: 100%; height: 196px' id='canvas-example_4'></canvas></p>
<p>I cheated a bit here.
To avoid clutter, the viewer only draws each thread until its last recorded event (without this, threads that get garbage collected span the whole width of the trace), so I used <code>Profile.label ~thread &quot;(continues)&quot;</code> to create extra label events on the two remaining threads to make it clearer what's happening here.</p>
<h3 id="pick">Pick</h3>
<p><code>Lwt.pick</code> is similar to choose, but additionally cancels the other threads:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example_5</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">pick</span> <span class="o">[</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">003</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">00001</span><span class="o">;</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">001</span><span class="o">;</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><canvas style='width: 100%; height: 196px' id='canvas-example_5'></canvas></p>
<h3 id="exceptions">Exceptions</h3>
<p>Failed threads are shown with a red bar at the end and the exception message is displayed.
Also, any &quot;reads&quot; arrow coming from it is shown in red rather than blue.
Here, the bind thread fails but the try one doesn't because it catches the exception and returns unit.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">example_6</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">try_lwt</span>
</span><span class="line">    <span class="n">sleep</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0001</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">failwith</span> <span class="s2">&quot;oops&quot;</span>
</span><span class="line">  <span class="k">with</span> <span class="o">_</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">return</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: I'm using the Lwt syntax extension here for <code>try_lwt</code>, but you can use <code>Lwt.try_bind</code> if you prefer.</p>
<p><canvas style='width: 100%; height: 106px' id='canvas-example_6'></canvas></p>
<p>The same simplification done for &quot;bind&quot; threads also applies to &quot;try&quot; threads, so the try thread doesn't appear at all until you zoom in on the red arrow.</p>
<h2 id="making-your-own-traces">Making your own traces</h2>
<p>Update: These instructions were out-of-date so I've removed them. See the <a href="https://github.com/mirage/mirage-profile">mirage-profile</a> page for up-to-date instructions.</p>
<h2 id="examples">Examples</h2>
<p>A few months ago, I made <a href="/blog/blog/2014/07/28/my-first-unikernel/">my first unikernel</a> - a REST service for queuing files as a Xen guest OS.
Unlike a normal guest, which would include a Linux kernel, init system, libc, shell, Apache, etc, a Mirage unikernel is a single executable, and almost pure OCaml (apart from malloc, the garbage collector, etc).
Unikernels can be very small and simple, and have a much smaller attack surface than traditional systems.</p>
<p>For my first attempt at <a href="/blog/blog/2014/08/15/optimising-the-unikernel/">optimising the unikernel</a>, I used OCaml's built-in profiling support.
This recorded function calls and how much time was spent in each one.
But I quickly discovered that CPU time was rarely the important factor - how the various asynchronous threads were scheduled was more important, and the tracing made it difficult to see this.</p>
<p>So, let's see how the new tracing does on my previous problems...</p>
<h3 id="profiling-the-console">Profiling the console</h3>
<p>In the previous profiling post, I generated this graph using libreoffice:</p>
<p><img src="/blog/images/mirage-profiling/xen-console-messages.png" class="border center"/></p>
<p>As a reminder, Xen guests output to the console by writing the text to a shared memory page, increasing a counter to indicate this, and signalling dom0. The console logger in dom0 reads the data, increments another counter to confirm it got it, and signals back to the guest that that part of the buffer is free again.</p>
<p>To use the new tracing system, I added a <code>Profile.note_increase &quot;sent&quot; len</code> to the main loop, which increments a &quot;sent&quot; count on each iteration (i.e. each time we write a line to the console).
The viewer adds a mark on the trace for each increment and overlays a graph (the red line) so you can see overall progress easily:</p>
<p><canvas style='width: 100%; height: 300px' id='canvas-console'></canvas></p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=console">View full screen</a> | <a href="/blog/data/traces/console.sexp">Download console.sexp</a></p>
<p>As before, we can see that we send messages rapidly in bursts, followed by long periods without progress.
Zooming in to the places where the red line is increasing, we can see the messages being written to the buffer without any delays.
Looking at the edges of the sleeping regions, it's clear that we're simply waiting for Xen to notify us of space by signalling us on event channel 2.</p>
<p>Here's the complete test code:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Lwt</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span><span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Profile</span><span class="p">.</span><span class="n">start</span> <span class="o">~</span><span class="n">size</span><span class="o">:</span><span class="mi">10000</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">String</span><span class="p">.</span><span class="n">make</span> <span class="o">(</span><span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="sc">&#39;X&#39;</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">iterations</span> <span class="o">=</span> <span class="mi">1800</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">len</span> <span class="o">*</span> <span class="n">iterations</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="o">=</span> <span class="k">function</span>
</span><span class="line">      <span class="o">|</span> <span class="mi">0</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">      <span class="o">|</span> <span class="n">i</span> <span class="o">-&gt;</span>
</span><span class="line">          <span class="nn">Profile</span><span class="p">.</span><span class="n">note_increase</span> <span class="s2">&quot;sent&quot;</span> <span class="n">len</span><span class="o">;</span>
</span><span class="line">          <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="n">msg</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="n">loop</span> <span class="o">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="k">let</span> <span class="n">t0</span> <span class="o">=</span> <span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="n">loop</span> <span class="n">iterations</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">t1</span> <span class="o">=</span> <span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">time</span> <span class="o">=</span> <span class="n">t1</span> <span class="o">-.</span> <span class="n">t0</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;Wrote %d bytes in %.3f seconds (%.2f KB/s)&quot;</span>
</span><span class="line">      <span class="n">bytes</span> <span class="n">time</span>
</span><span class="line">      <span class="o">(</span><span class="n">float_of_int</span> <span class="n">bytes</span> <span class="o">/.</span> <span class="n">time</span> <span class="o">/.</span> <span class="mi">1024</span><span class="o">.))</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">events</span> <span class="o">=</span> <span class="nn">Profile</span><span class="p">.</span><span class="n">events</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="n">for_lwt</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="nn">Array</span><span class="p">.</span><span class="n">length</span> <span class="n">events</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span>
</span><span class="line">      <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="o">(</span><span class="nn">Profile</span><span class="p">.</span><span class="n">to_string</span> <span class="n">events</span><span class="o">.(</span><span class="n">i</span><span class="o">))</span>
</span><span class="line">    <span class="k">done</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">OS</span><span class="p">.</span><span class="nn">Time</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">1</span><span class="o">.</span><span class="mi">0</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="udp-transmission">UDP transmission</h3>
<p>Last time, we saw packet transmission being interrupted by periods of sleeping, garbage collection, and some brief but mysterious pauses.
I noted that the GC tended to run during <code>Ring.ack_responses</code>, suggesting that was getting called quite often, and with the new tracing we can see why.</p>
<p><img src="/blog/images/mirage-profiling/udp-packets.png" class="border center"/></p>
<p>This trace shows a unikernel booting (scroll left if you want to see that) and then sending 10 UDP packets.
I've left the trace running a little longer so you can see the acks (this is very obvious when sending more than 10 packets, but I wanted to keep this trace small):</p>
<p><canvas style='width: 100%; height: 700px' id='canvas-udp'></canvas></p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=udp">View full screen</a> | <a href="/blog/data/traces/udp.sexp">Download udp.sexp</a></p>
<p>Mirage creates two threads for each packet that we add to the ring buffer and they stick around until we get a notification back from dom0 that the packet has been read (actually, we create six threads for each packet, but the bind simplification hides four of them).</p>
<p>It looks like each packet is in two parts, as each one generates two acks, one much later than the other.
I think the two parts are the UDP header and the payload, which each have their own IO page.
Given the time needed to share and unshare pages, it would probably be more efficient to copy the payload into the same page as the header.
Interestingly, dom0 seems to ack all the headers first, but holds on to the payload pages for longer.</p>
<p>With 20 threads for just ten packets, you can imagine that the trace gets rather crowded when sending thousands!</p>
<h3 id="tcp-transmission">TCP transmission</h3>
<p>As before, the TCP picture is rather complicated:</p>
<p><canvas style='width: 100%; height: 700px' id='canvas-tcp'></canvas></p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=tcp">View full screen</a> | <a href="/blog/data/traces/tcp.sexp">Download tcp.sexp</a></p>
<p>The above shows a unikernel running on my ARM Cubietruck connecting to netcat on my laptop and sending 100 TCP packets over the stream.
There are three counters here:</p>
<ul>
<li><code>main-to-tcp</code> (purple) is incremented by the main thread just before sending a block of data to the TCP stream (just enough to fill one TCP segment).
</li>
<li><code>tcp-to-ip</code> (red) shows when the TCP system sent a segment to the IP layer for transmission.
</li>
<li><code>tcp-ackd-segs</code> (orange) shows when the TCP system got confirmation of receipt from the remote host (note: a TCP ask is not the same as a dom0 ring ack, which just says the network driver has accepted the segment for transmission).
</li>
</ul>
<p>There is clearly scope to improve the viewer here, but a few things can be seen already.
The general cycle is:</p>
<ul>
<li>The unikernel is sleeping, waiting for TCP acks.
</li>
<li>The remote end acks some packets (the orange line goes up).
</li>
<li>The TCP layer transmits some of the buffered packets (red line goes up).
</li>
<li>The TCP layer allows the main code to send more data (purple line goes up).
</li>
<li>The transmitted pages are freed (the dense vertical green lines) once Xen acks them.
</li>
</ul>
<p>I did wonder whether we unshared the pages as soon as dom0 had read the segment, or only when the remote end sent the TCP ack.
Having the graphs overlaid on the trace lets us answer this question - you can see that when the red line goes up (segments sent to dom0), the <code>ring.write</code> thread that is created then ends (and the page is unshared) in response to <code>ring.poll ack_responses</code>, before the TCP acks arrive.</p>
<p>TCP starts slowly, but as the window size gets bigger and more packets are transmitted at a time, the sleeping periods get shorter and then disappear as the process becomes CPU-bound.</p>
<p>There's also a long garbage collection period near the end (shortly before we close the socket).
This might be partly the fault of the tracing system, which currently allocates lots of small values, rather than writing to a preallocated buffer.</p>
<h3 id="disk-access">Disk access</h3>
<p>For our final example, let's revisit the block device profiling from last time.
Back then, making a series of read requests, each for 32 pages of data, produced this chart:</p>
<p><img src="/blog/images/mirage-profiling/block-reads-1-32.png" class="border center"/></p>
<p>With the new tracing, we can finally see what those mysterious wake-ups in the middle are:</p>
<p><canvas style='width: 100%; height: 500px' id='canvas-disk-direct'></canvas></p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=disk-direct">View full screen</a> | <a href="/blog/data/traces/disk-direct.sexp">Download disk-direct.sexp</a></p>
<p>Each time the main test code's read call returns, the orange trace (&quot;read&quot;) goes up.
You can see that we make three blocking calls to dom0 for each request.
I added another counter for the number of active grant refs (pages shared with dom0), shown as the red line (&quot;gntref&quot;).
You can see that for each call we share a bunch of pages, wait, and then unshare them all again.</p>
<p>In each group of three, we share 11 pages for the first two requests, but only 10 for the third.
This makes obvious what previously required a careful reading of the block code: requests for more than 11 pages have to be split up because that's all you can fit in the request structure.
Our request for 32 pages is split into requests for 11 + 11 + 10 pages, which are sent in series.</p>
<p>In fact, Xen also supports &quot;indirect&quot; requests, where the request structure references full pages of requests.
I added support for this to mirage-block-xen, which improved the speed nicely.
Here's a trace with indirect requests enabled:</p>
<p><canvas style='width: 100%; height: 500px' id='canvas-disk-indirect'></canvas></p>
<p><a href="/blog/javascripts/trace-viewer.html?trace=disk-indirect">View full screen</a> | <a href="/blog/data/traces/disk-indirect.sexp">Download disk-indirect.sexp</a></p>
<p>If you zoom in where the red line starts to rise, you can see it has 32 steps, as we allocate all the pages in one go, followed by a final later increment for the indirect page.</p>
<p>Zooming out, you can see we paused for GC a little later.
We got lucky here, with the GC occurring just after we sent the request and just before we started waiting for the reply, so it hardly slowed us down.
If we'd been unlucky the GC might have run before we sent the request, leaving dom0 idle and wasting the time.
Keeping multiple requests in flight would eliminate this risk.</p>
<h2 id="implementation-notes">Implementation notes</h2>
<p>I originally wrote the viewer as a native GTK application in OCaml.
The browser version was created by running the magical <a href="http://ocsigen.org/js_of_ocaml/">js_of_ocaml</a> tool, which turned out to be incredibly easy.
I just had to add support for the HTML canvas API alongside the code for GTK's Cairo canvas, but they're almost the same anyway.
Now my embarrassing inability to learn JavaScript need not hold me back!</p>
<p>Finding a layout algorithm that produced sensible results was the hardest part.
I'm quite pleased with the result.
The basic algorithm is:</p>
<ul>
<li>Generate an <a href="http://en.wikipedia.org/wiki/Interval_tree">interval tree</a> of the thread lifetimes.
</li>
<li>Starting with the root thread, place each thread at the highest place on the screen where it doesn't overlap any other threads, and no higher than its parent.
</li>
<li>Visit the threads recursively, depth first, visiting the child threads created in <em>reverse</em> order.
</li>
<li>If one thread merges with another, allow them to overlap.
</li>
<li>Don't show bind-type threads as children of their actual creator, but instead delay their start time to when they get activated and make them children of the thread that activates them, <em>unless</em> their parent merges with them.
</li>
</ul>
<p>For the vertical layout I originally used scrolling, but it was hard to navigate.
It now transforms the vertical coordinates from the layout engine by passing them through the <code>tanh</code> function, allowing you to focus on a particular thread but still see all the others, just more bunched up.
The main difficulty here is focusing on one of the top or bottom threads without wasting half the display area, which complicated the code a bit.</p>
<h2 id="summary">Summary</h2>
<p>Understanding concurrent programs can be much easier with a good visualisation.
By instrumenting Lwt, it was quite easy to collect useful information about what threads were doing.
Libraries that use Lwt only needed to be modified in order to label the threads.</p>
<p>My particular interest in making these tools is to explore the behaviour of <a href="http://openmirage.org/">Mirage unikernels</a> - tiny virtual machines written in OCaml that run without the overhead of traditional operating systems.</p>
<p>The traces produced provide much more information than the graphs I made previously.
We can see now not just when the unikernel isn't making progress, but why.
We saw that the networking code spends a lot of time handling ack messages from dom0 saying that it has read the data we shared with it, and that the disk code was splitting requests into small chunks because it didn't support indirect requests.</p>
<p>There is plenty of scope for improvement in the tools - some things I'd like include:</p>
<ul>
<li>A way to group or hide threads if you want to focus on something else, as diagrams can become very cluttered with e.g. threads waiting for shared pages to be released.
</li>
<li>The ability to stitch together traces from multiple machines so you can e.g. follow the progress of an IP packet after it leaves the VM.
</li>
<li>A visual indication of when interrupts occur vs when Mirage gets around to servicing them.
</li>
<li>More stats collection and reporting (e.g. average response time to handle a TCP request, etc).
</li>
<li>A more compact log format and more efficient tracing.
</li>
</ul>
<p>But hopefully, these tools will already help people to learn more about how their unikernels behave.
If you're interested in tracing or unikernels, the <a href="http://openmirage.org/community/">Mirage mailing list</a> is a good place to discuss things.</p>
<script src="/blog/javascripts/visualising-lwt.js"></script>
<script>
  function addSidebarToggler() {
  if(!$('body').hasClass('sidebar-footer')) {
    $('#content').append('<span class="toggle-sidebar"></span>');
    $('.toggle-sidebar').bind('click', function(e) {
      e.preventDefault();
      if ($('body').hasClass('collapse-sidebar')) {
        $('body').removeClass('collapse-sidebar');
      } else {
        $('body').addClass('collapse-sidebar');
      }
      resizeCanvasElements();
    });
  }
  }
</script>
]]></content>
  </entry>
  <entry>
    <title type="html">Simplifying the solver with functors</title>
    <link href="https://roscidus.com/blog/blog/2014/09/17/simplifying-the-solver-with-functors/"></link>
    <updated>2014-09-17T08:55:11+00:00</updated>
    <id>https://roscidus.com/blog/blog/2014/09/17/simplifying-the-solver-with-functors</id>
    <content type="html"><![CDATA[<p>After <a href="/blog/blog/2014/06/06/python-to-ocaml-retrospective/">converting 0install to OCaml</a>, I've been looking at using more of OCaml's features to further clean up the APIs.
In this post, I describe how using OCaml functors has made 0install's dependency solver easier to understand and more flexible.</p>
<!-- more -->
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=8342826">Hacker News</a> and <a href="http://www.reddit.com/r/programming/comments/2gw0dn/simplifying_0installs_solver_with_ocamls_functors/">Reddit</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
<ul>
<li><a href="#how-dependency-solvers-work">How dependency solvers work</a>
</li>
<li><a href="#optimising-the-result">Optimising the result</a>
</li>
</ul>
</li>
<li><a href="#the-current-solver-code">The current solver code</a>
</li>
<li><a href="#discovering-the-interface">Discovering the interface</a>
</li>
<li><a href="#comparison-with-java">Comparison with Java</a>
</li>
<li><a href="#diagnostics">Diagnostics</a>
</li>
<li><a href="#selections">Selections</a>
</li>
<li><a href="#summary">Summary</a>
</li>
</ul>
<h2 id="introduction">Introduction</h2>
<p>To run a program you need to pick a version of it to use, as well as compatible versions of all its dependencies.
For example, if you wanted <a href="http://0install.net/">0install</a> to select a suitable set of components to run <a href="http://www.serscis.eu/0install/serscis-access-modeller">SAM</a>, you could do it like this:</p>
<pre><code>$ 0install select http://www.serscis.eu/0install/serscis-access-modeller
- URI: http://www.serscis.eu/0install/serscis-access-modeller
  Version: 0.16
  Path: (not cached)
  
  - URI: http://repo.roscidus.com/utils/graphviz
    Version: 2.38.0-2
    Path: (package:arch:graphviz:2.38.0-2:x86_64)
  
  - URI: http://repo.roscidus.com/java/swt
    Version: 3.6.1
    Path: (not cached)
  
  - URI: http://repo.roscidus.com/java/iris
    Version: 0.6.0
    Path: (not cached)
  
  - URI: http://repo.roscidus.com/java/openjdk-jre
    Version: 7.65-2.5.2-1
    Path: (package:arch:jre7-openjdk:7.65-2.5.2-1:x86_64)
</code></pre>
<p>Here, the solver selected SAM version 0.16, along with its dependencies. GraphViz 2.38.0 and OpenJDK-JRE 7.65 are already installed from my distribution repository, while SWT 3.6.1 and IRIS 0.6.0 need to be downloaded (e.g. using <code>0install download</code>).</p>
<p>This post is about the code that decides which versions to use, and my attempts to make it easier to understand using OCaml functors and abstraction.
For a gentle introduction to functors, see <a href="https://realworldocaml.org/v1/en/html/functors.html">Real World OCaml: Chapter 9. Functors</a>.</p>
<h3 id="how-dependency-solvers-work">How dependency solvers work</h3>
<p>( This section isn't about functors, but it's quite interesting background. You can skip it if you prefer. )</p>
<p>Let's say I want to run <code>foo</code>, a graphical Java application.
There are three versions available:</p>
<ul>
<li><code>foo1</code> (stable)
<ul>
<li>requires Java 6..!7 (at least version 6, but before version 7)
</li>
</ul>
</li>
<li><code>foo2</code> (stable)
<ul>
<li>requires Java 6.. (at least version 6)
</li>
<li>requires SWT
</li>
</ul>
</li>
<li><code>foo3</code> (testing)
<ul>
<li>requires Java 7..
</li>
<li>requires SWT
</li>
</ul>
</li>
</ul>
<p>Let's imagine we have some candidates for Java and SWT too:</p>
<ul>
<li>Java: <code>java6_32bit</code>, <code>java6_64bit</code>, <code>java7_64bit</code>, <code>java8_64bit</code>
</li>
<li>SWT: <code>swt35_32bit</code>, <code>swt35_64bit</code>, <code>swt36_32bit</code>, <code>swt36_64bit</code>
</li>
</ul>
<p>My computer can run 32- and 64-bit binaries, so we need to consider both.</p>
<p>We start by generating a set of boolean constraints defining the necessary and sufficient conditions for a set of valid selections.
Each candidate becomes one variable, with <code>true</code> meaning it will be used and <code>false</code> that it won't (following the approach in <a href="https://cseweb.ucsd.edu/~lerner/papers/opium.html">OPIUM</a>).
For example, we don't want to select more than one version of each component, so the following must all be true:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="n">at_most_one</span><span class="p">(</span><span class="n">foo1</span><span class="p">,</span> <span class="n">foo2</span><span class="p">,</span> <span class="n">foo3</span><span class="p">)</span>
</span><span class="line"><span class="n">at_most_one</span><span class="p">(</span><span class="n">java6_32bit</span><span class="p">,</span> <span class="n">java6_64bit</span><span class="p">,</span> <span class="n">java7_64bit</span><span class="p">,</span> <span class="n">java8_64bit</span><span class="p">)</span>
</span><span class="line"><span class="n">at_most_one</span><span class="p">(</span><span class="n">swt35_32bit</span><span class="p">,</span> <span class="n">swt35_64bit</span><span class="p">,</span> <span class="n">swt36_32bit</span><span class="p">,</span> <span class="n">swt36_64bit</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We must select some version of <code>foo</code> itself, since that's our goal:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="n">foo1</span> <span class="ow">or</span> <span class="n">foo2</span> <span class="ow">or</span> <span class="n">foo3</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If we select <code>foo1</code>, we must select one of the Java 6 candidates.
Another way to say this is that we must either <em>not</em> select foo1 or, if we do, we must select a compatible Java version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo1</span><span class="p">)</span> <span class="ow">or</span> <span class="n">java6_32bit</span> <span class="ow">or</span> <span class="n">java6_64bit</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo2</span><span class="p">)</span> <span class="ow">or</span> <span class="n">java6_32bit</span> <span class="ow">or</span> <span class="n">java6_64bit</span> <span class="ow">or</span> <span class="n">java7_64bit</span> <span class="ow">or</span> <span class="n">java8_64bit</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo3</span><span class="p">)</span> <span class="ow">or</span> <span class="n">java7_64bit</span> <span class="ow">or</span> <span class="n">java8_64bit</span>
</span><span class="line">
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo2</span><span class="p">)</span> <span class="ow">or</span> <span class="n">swt35_32bit</span> <span class="ow">or</span> <span class="n">swt35_64bit</span> <span class="ow">or</span> <span class="n">swt36_32bit</span> <span class="ow">or</span> <span class="n">swt36_64bit</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo3</span><span class="p">)</span> <span class="ow">or</span> <span class="n">swt35_32bit</span> <span class="ow">or</span> <span class="n">swt35_64bit</span> <span class="ow">or</span> <span class="n">swt36_32bit</span> <span class="ow">or</span> <span class="n">swt36_64bit</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>SWT doesn't work with Java 8 (in this imaginary example):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt35_32bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt35_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt36_32bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt36_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Finally, although we can use 32 bit or 64 bit programs, we can't mix different types within a single program:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">java6_32bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">x64</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">java6_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x64</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">java7_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x64</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x64</span>
</span><span class="line">
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt35_32bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">x64</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt35_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x64</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt36_32bit</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">x64</span><span class="p">)</span>
</span><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">swt36_64bit</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x64</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Once we have all the equations, we throw them at a standard <a href="http://en.wikipedia.org/wiki/Boolean_satisfiability_problem">SAT</a> solver to get a set of valid versions.
0install's SAT solver is based on the <a href="http://minisat.se/Papers.html">MiniSat</a> algorithm.
The basic algorithm, <a href="http://en.wikipedia.org/wiki/DPLL_algorithm">DPLL</a>, works like this:</p>
<ol>
<li>The SAT solver simplifies the problem by looking for clauses with only a single variable.
If it finds any, it knows the value of that variable and can simplify the other clauses.
</li>
<li>When no further simplification is possible, it asks its caller to pick a variable to try.
Here, we might try <code>foo2=true</code>, for example.
It then goes back to step 1 to simplify again and so on, until it has either a solution or a conflict.
If a variable assignment leads to a conflict, then we go back and try it the other way (e.g. <code>foo2=false</code>).
</li>
</ol>
<p>In the above example, the process would be:</p>
<ol>
<li>No initial simplification is possible.
</li>
<li>Try <code>foo2=true</code>.
</li>
<li>This immediately leads to <code>foo1=false</code> and <code>foo3=false</code>.
</li>
<li>Now we try <code>java8_64bit=true</code>.
</li>
<li>This immediately removes the other versions of Java from consideration.
</li>
<li>It also immediately leads to <code>x64=true</code>, which eliminates all the 32-bit binaries.
</li>
<li>It also eliminates all versions of SWT.
</li>
<li><code>foo2</code> depends on SWT, so eliminating all versions leads to <code>foo2=false</code>, which is a conflict because we already set it to <code>true</code>.
</li>
</ol>
<p>MiniSat, unlike basic DPLL, doesn't just backtrack when it gets a conflict.
It also works backwards from the conflicting clause to find a small set of variables that are sufficient to cause the conflict.
In this case, we find that <code>foo2 and java8_64bit</code> implies conflict.
To avoid the conflict, we must make sure that at least one of these is false, and we can do this by adding a new clause:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="ow">not</span><span class="p">(</span><span class="n">foo2</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span><span class="p">(</span><span class="n">java8_64bit</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In this example, this has the same effect as simple backtracking, but if we'd chosen other variables between these two then learning the general rule could save us from exploring many other dead-ends.
We now try with <code>java7_64bit=true</code> and then <code>swt36_64bit=true</code>, which leads to a solution.</p>
<h3 id="optimising-the-result">Optimising the result</h3>
<p>The above process will always find some valid solution if one exists, but we generally want an &quot;optimal&quot; solution (e.g. preferring newer versions).
There are several ways to do this.</p>
<p>In his talk at <a href="https://ocaml.org/meetings/ocaml/2014/#Program">OCaml 2014</a>, <em>Using Preferences to Tame your Package Manager</em>, Roberto Di Cosmo explained how their tools allow you to specify a function to be optimised (e.g. <code>-count(removed),-count(changed)</code> to minimise the number of packages to be removed or changed).
When you have a global set of package versions (as in OPAM, Debian, etc) this is very useful, because the package manager needs to find solution that balances the needs of all installed programs.</p>
<p>0install has an interesting advantage here.
We can install multiple versions of libraries in parallel, and we don't allow one program to influence another program's choices.
If we run another SWT application, <code>bar</code>, we'll probably pick the same SWT version (<code>bar</code> will probably also prefer the latest stable 64-bit version) and so <code>foo</code> and <code>bar</code> will share the copy.
But if we run a non-SWT Java application, we are free to pick a better version of Java (e.g. Java 8) just for that program.</p>
<p>This means that we don't need to find a compromise solution for multiple programs.
When running <code>foo</code>, the version of <code>foo</code> itself is far more important than the versions of the libraries it uses.
We therefore define the &quot;optimal&quot; solution as the one optimising first the version of the main program, then the version of its first dependency and so on.
This means that:</p>
<ul>
<li>The behaviour is predictable and easy to explain.
</li>
<li>The behaviour is stable (small changes to libraries deep in the tree have little effect).
</li>
<li>Programs can rank their dependencies by importance, because earlier dependencies are optimised first.
</li>
<li>If <code>foo2</code> is the best version for our policy (e.g. &quot;prefer latest stable version&quot;) then every solution with <code>foo2=true</code> is better than every solution without.
If we direct the solver to try <code>foo2=true</code> and get a solution, there's no point considering the <code>foo2=false</code> cases.
This means that the first solution we find will always be optimal, which is very fast!
</li>
</ul>
<h2 id="the-current-solver-code">The current solver code</h2>
<p>The existing code is made up of several components:</p>
<p><span class="caption-wrapper border center"><img src="/blog/images/solver/components.png" title="Arrows indicate &quot;uses&quot; relationship." class="caption"/><span class="caption-text">Arrows indicate &quot;uses&quot; relationship.</span></span></p>
<dl><dt>SAT Solver</dt>
<dd>
Implements the SAT solver itself (if you want to use this code in your own projects, Simon Cruanes has
made a standalone version at <a href="https://github.com/c-cube/sat">https://github.com/c-cube/sat</a>, which has some additional features and
optimisations).
</dd>
<dt>Solver</dt>
<dd>
Fetches the candidate versions for each interface URI (equivalent to the package name in other systems) and builds up the SAT problem, then uses the SAT solver to solve it.
It directs the SAT solver in the direction of the optimal solution when there is a choice, as explained above.
</dd>
<dt>Impl provider</dt>
<dd>
Takes an interface URI and provides the candidates (&quot;implementations&quot;) to the solver.
The candidates can come from multiple XML &quot;feed&quot; files: the one at the given URI itself, plus any extra feeds that one imports, plus any additional local or remote feeds the user has added (e.g. a version in a local Git checkout or which has been built from source locally).
The impl provider also filters out obviously impossible choices (e.g. Windows binaries if we're on Linux, uncached versions in off-line mode, etc) and then ranks the remaining candidates according to the local policy (e.g. preferring stable versions, higher version numbers, etc).
</dd>
<dt>Feed provider</dt>
<dd>
Simply loads the feed XML from the disk cache or local file, if available.
It does not access the network.
</dd>
<dt>Driver</dt>
<dd>
Uses the solver to get an initial solution.
Then it asks the feed provider which feeds the solver tried to access and starts downloading them.
As new feeds arrive, it runs the solver again, possibly starting more downloads.
Managing these downloads is quite interesting; I used it as the example in <a href="https://roscidus.com/blog/blog/2013/11/28/asynchronous-python-vs-ocaml/">Asynchronous Python vs OCaml</a>.
</dd>
</dl>
<p>This is reasonably well split up already, thanks to occasional refactoring efforts, but we can always do better.
The subject of today's refactoring is the solver module itself.</p>
<p>Here's the problem:</p>
<figure class="code"><figcaption><span>solver.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">class</span> <span class="k">type</span> <span class="n">result</span> <span class="o">=</span>
</span><span class="line">  <span class="k">object</span>
</span><span class="line">    <span class="k">method</span> <span class="n">get_selections</span> <span class="o">:</span> <span class="nn">Selections</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">
</span><span class="line">    <span class="c">(* The remaining methods are used to provide diagnostics *)</span>
</span><span class="line">    <span class="k">method</span> <span class="n">get_selected</span> <span class="o">:</span> <span class="n">source</span><span class="o">:</span><span class="kt">bool</span> <span class="o">-&gt;</span> <span class="nn">General</span><span class="p">.</span><span class="n">iface_uri</span> <span class="o">-&gt;</span>
</span><span class="line">    			  <span class="nn">Impl</span><span class="p">.</span><span class="n">generic_implementation</span> <span class="n">option</span>
</span><span class="line">    <span class="k">method</span> <span class="n">impl_provider</span> <span class="o">:</span> <span class="nn">Impl_provider</span><span class="p">.</span><span class="n">impl_provider</span>
</span><span class="line">    <span class="k">method</span> <span class="n">implementations</span> <span class="o">:</span>
</span><span class="line">      <span class="o">((</span><span class="nn">General</span><span class="p">.</span><span class="n">iface_uri</span> <span class="o">*</span> <span class="kt">bool</span><span class="o">)</span> <span class="o">*</span>
</span><span class="line">       <span class="o">(</span><span class="n">diagnostics</span> <span class="o">*</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">generic_implementation</span><span class="o">)</span> <span class="n">option</span><span class="o">)</span> <span class="kt">list</span>
</span><span class="line">    <span class="k">method</span> <span class="n">requirements</span> <span class="o">:</span> <span class="n">requirements</span>
</span><span class="line">  <span class="k">end</span>
</span><span class="line">
</span><span class="line"><span class="c">(** Find a set of implementations which satisfy these</span>
</span><span class="line"><span class="c">  * requirements.</span>
</span><span class="line"><span class="c">  * @param closest_match adds a lowest-ranked (but valid)</span>
</span><span class="line"><span class="c">  *        implementation to every interface, so we can always</span>
</span><span class="line"><span class="c">  *        select something. Useful for diagnostics. *)</span>
</span><span class="line"><span class="k">val</span> <span class="n">do_solve</span> <span class="o">:</span>
</span><span class="line">  <span class="nn">Impl_provider</span><span class="p">.</span><span class="n">impl_provider</span> <span class="o">-&gt;</span> <span class="n">requirements</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="n">closest_match</span><span class="o">:</span><span class="kt">bool</span> <span class="o">-&gt;</span> <span class="n">result</span> <span class="n">option</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>What does <code>do_solve</code> actually do?
It gets candidates (of type <code>Impl.generic_implementation</code>) from <code>Impl_provider</code> and produces a <code>Selections.t</code>.
<code>Impl.generic_implementation</code> is a complex type including, among other things, the raw XML <code>&lt;implementation&gt;</code> element from the feed XML.
A <code>Selections.t</code> is a set of <code>&lt;selection&gt;</code> XML elements.</p>
<p>In other words: <code>do_solve</code> takes some arbitrary XML and produces some other XML.
It's very hard to tell from the <code>solver.mli</code> interface file what features of the input data it uses in the solve, and which it simply passes through.</p>
<p>Now imagine that instead of working on these messy concrete types, the solver instead used only a module with this type:</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl_provider</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span>
</span><span class="line">  <span class="k">type</span> <span class="n">restriction</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">implementations</span> <span class="o">:</span> <span class="n">impl_provider</span> <span class="o">-&gt;</span> <span class="n">iface_uri</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dependencies</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">required_interface</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">iface_uri</span>
</span><span class="line">  <span class="k">val</span> <span class="n">restrictions</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">meets_restriction</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We could then see exactly what information the solver needed to do its job.
For example, we could see just from the type signature that the solver doesn't understand version numbers, but just uses <code>meets_restriction</code> to check whether an abstract implementation (candidate) meets an abstract restriction.</p>
<p>Using OCaml's functors we can do just this, splitting out the core (non-XML) parts of the solver into a <code>Solver_core</code> module with a signature something like:</p>
<figure class="code"><figcaption><span>solver_core.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Make</span> <span class="o">:</span> <span class="k">functor</span> <span class="o">(</span><span class="nc">Model</span> <span class="o">:</span> <span class="nn">Sigs</span><span class="p">.</span><span class="nc">SOLVER_INPUT</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">val</span> <span class="n">do_solve</span> <span class="o">:</span>
</span><span class="line">    <span class="nn">Model</span><span class="p">.</span><span class="n">impl_provider</span> <span class="o">-&gt;</span> <span class="n">requirements</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">closest_match</span><span class="o">:</span><span class="kt">bool</span> <span class="o">-&gt;</span> <span class="nn">Model</span><span class="p">.</span><span class="n">impl</span> <span class="nn">StringMap</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This says that, given <em>any</em> concrete module that matches the <code>SOLVER_INPUT</code> type, the <code>Make</code> functor will return a module with a suitable <code>do_solve</code> function.
In particular, the compiler will check that the solver core makes no further assumptions about the types.
If it assumes that a <code>Model.impl_provider</code> is any particular concrete type then <code>solver_core.ml</code> will fail to compile, for example.</p>
<h2 id="discovering-the-interface">Discovering the interface</h2>
<p>The above sounds nice in theory, but how easy is it to change the existing code to the new design?
I don't even know what <code>SOLVER_INPUT</code> will actually look like - surely more complex than the example above!
Actually, it turned out to be quite easy.
You can start with just a few concrete types, e.g.</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl_provider</span> <span class="o">=</span> <span class="nn">Impl_provider</span><span class="p">.</span><span class="n">impl_provider</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">generic_implementation</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This doesn't constrain <code>Solver_core</code> at all, since it's allowed to know the real types and use them as before.
This step just lets us make the <code>Solver_core.Make</code> functor and have things still compile and work.</p>
<p>Next, I made <code>impl_provider</code> abstract (removing the <code>= Impl_provider.impl_provider</code>), letting the compiler find all the places where the code assumed the concrete type.</p>
<p>First, it was getting the candidate implementations for an interface from it.
The <code>impl_provider</code> was actually returning several things: the valid candidates, the rejects, and an optional conflicting interface (used when one interface replaces another).
The solver doesn't use the rejects, which are only needed for the diagnostics system, so we can simplify that interface here.</p>
<p>Secondly, not all dependencies need to be considered (e.g. a Windows-only dependency when we're on Linux).
Since the <code>impl_provider</code> already knows the platform in order to filter out incompatible binaries, we also use it to filter the dependencies.
Our new module type is:</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl_provider</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">generic_implementation</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">dependency</span>
</span><span class="line">
</span><span class="line">  <span class="k">type</span> <span class="n">iface_info</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">replacement</span> <span class="o">:</span> <span class="n">iface_uri</span> <span class="n">option</span><span class="o">;</span>
</span><span class="line">    <span class="n">impls</span> <span class="o">:</span> <span class="n">impl</span> <span class="kt">list</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">implementations</span> <span class="o">:</span> <span class="n">impl_provider</span> <span class="o">-&gt;</span> <span class="n">iface_uri</span> <span class="o">-&gt;</span>
</span><span class="line">                        <span class="n">source</span><span class="o">:</span><span class="kt">bool</span> <span class="o">-&gt;</span> <span class="n">iface_info</span>
</span><span class="line">  <span class="k">val</span> <span class="n">is_dep_needed</span> <span class="o">:</span> <span class="n">impl_provider</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>At this point, I'm not trying to improve the interface, just to find out what it is.
Continuing to make the types abstract in this way is a fairly mechanical process, which led to:</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>  <span class="c">(* new name for impl_provider *)</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span>
</span><span class="line">  <span class="k">type</span> <span class="n">restriction</span>
</span><span class="line">  <span class="k">type</span> <span class="n">iface_info</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">replacement</span> <span class="o">:</span> <span class="n">iface_uri</span> <span class="n">option</span><span class="o">;</span>
</span><span class="line">    <span class="n">impls</span> <span class="o">:</span> <span class="n">impl</span> <span class="kt">list</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="k">val</span> <span class="n">implementations</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">iface_uri</span> <span class="o">-&gt;</span> <span class="n">source</span><span class="o">:</span><span class="kt">bool</span> <span class="o">-&gt;</span> <span class="n">iface_info</span>
</span><span class="line">  <span class="k">val</span> <span class="n">get_command</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="n">command</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">requires</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_requires</span> <span class="o">:</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">machine_group</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="nn">Arch</span><span class="p">.</span><span class="n">machine_group</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">restrictions</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">meets_restriction</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dep_iface</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">iface_uri</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dep_required_commands</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dep_essential</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">restricts_only</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">is_dep_needed</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">impl_self_commands</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_self_commands</span> <span class="o">:</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">impl_to_string</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_to_string</span> <span class="o">:</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The main addition is the <code>command</code> type, which is essentially an optional entry point to an implementation.
For example, if a library can also be used as a program then it may provide a &quot;run&quot; command, perhaps adding a dependency on an option parser library.
Programs often also provide a &quot;test&quot; command for running the unit-tests, etc.
There are also two <code>to_string</code> functions at the end for debugging and diagnostics.</p>
<p>Having elicited this API between the solver and the rest of the system, it was clear that it had a few flaws:</p>
<ul>
<li>
<p><code>is_dep_needed</code> is pointless. There are only two places where we pass a dependency to the solver (<code>requires</code> and <code>command_requires</code>), so we can just filter the unnecessary dependencies out there and not bother the solver core with them at all.
The abstraction ensures there's no other way for the solver core to get a dependency.</p>
</li>
<li>
<p><code>impl_self_commands</code> and <code>command_self_commands</code> worried me.
These are used for the (rare) case that one command in an implementation depends on another command in the same implementation.
This might happen if, for example, the &quot;test&quot; command wants to test the &quot;run&quot; command.
Logically, these are just another kind of dependency; returning them separately means code that follows dependencies might forget them.</p>
<p>Sure enough, there was just such a bug in the code.
When we build the SAT problem we <em>do</em> consider self commands (so we always find a valid result), but when we're optimising the result we ignore them, possibly leading to non-optimal solutions.
I added a unit-test and made <code>requires</code> return both dependencies and self commands together to avoid the same mistake in future.</p>
</li>
<li>
<p>For a similar reason, I replaced <code>dep_iface</code>, <code>dep_required_commands</code>, <code>restricts_only</code> and <code>dep_essential</code> with a single <code>dep_info</code> function returning a record type.</p>
</li>
<li>
<p>I added <code>type command_name = private string</code>.
This means that the solver can't confuse command names with other strings and makes the type signature more obvious.
I didn't make it fully abstract, but was a bit lazy and used <code>private</code>, allowing the solver to cast to a string for debug logging and to let it use them as keys in a StringMap.</p>
</li>
<li>
<p>There is a boolean <code>source</code> attribute in <code>do_solve</code> and <code>implementations</code>.
This is used if the user wants to select source code rather than a binary.
I wanted to support the case of a compiler that is compiled using an older version of itself (though I never completed this work).
In that case, we need to select different versions of the same interface, so the solver actually picks a unique implementation for each (interface, source) pair.</p>
<p>I tried giving these pairs a new abstract type - &quot;role&quot; - and that simplified things nicely.
It turned out that every place where we passed only an interface (e.g. <code>dep_iface</code>), we eventually ended up doing <code>(iface, false)</code> to get back to a role, so I was able to replace these with roles too.</p>
<p>This is quite significant.
Currently, the main interface can be source or binary but dependencies are always binary.
For example, source code may depend on a compiler, build tool, etc.
People have wondered in the past how easy it would be to support dependencies on source code too - it turns out this now requires no changes to the solver, just an extra attribute in the XML format!</p>
</li>
<li>
<p>With the role type now abstract, I removed <code>Model.t</code> (the <code>impl_provider</code>) and moved it inside the role type.
This simplifies the API and allows us to use different providers for different roles (imagine solving for components to cross-compile a program; some dependencies like <code>make</code> should be for the build platform, while others are for the target).</p>
</li>
</ul>
<p>Here's the new API:</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Role</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">    <span class="k">type</span> <span class="n">t</span>
</span><span class="line">    <span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">    <span class="k">val</span> <span class="n">compare</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">int</span>
</span><span class="line">  <span class="k">end</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command</span>
</span><span class="line">  <span class="k">type</span> <span class="n">restriction</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span> <span class="o">=</span> <span class="k">private</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">dep_role</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span><span class="o">;</span>
</span><span class="line">    <span class="n">dep_restrictions</span> <span class="o">:</span> <span class="n">restriction</span> <span class="kt">list</span><span class="o">;</span>
</span><span class="line">    <span class="n">dep_importance</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="n">essential</span> <span class="o">|</span> <span class="o">`</span><span class="n">recommended</span> <span class="o">|</span> <span class="o">`</span><span class="n">restricts</span> <span class="o">];</span>
</span><span class="line">    <span class="n">dep_required_commands</span> <span class="o">:</span> <span class="n">command_name</span> <span class="kt">list</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="k">type</span> <span class="n">role_information</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">replacement</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span><span class="o">;</span>
</span><span class="line">    <span class="n">impls</span> <span class="o">:</span> <span class="n">impl</span> <span class="kt">list</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="k">type</span> <span class="n">requirements</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">role</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span><span class="o">;</span>
</span><span class="line">    <span class="n">command</span> <span class="o">:</span> <span class="n">command_name</span> <span class="n">option</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="k">val</span> <span class="n">implementations</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">role_information</span>
</span><span class="line">  <span class="k">val</span> <span class="n">get_command</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">command_name</span> <span class="o">-&gt;</span> <span class="n">command</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">requires</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span> <span class="o">*</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_requires</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span> <span class="o">*</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">machine_group</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="nn">Arch</span><span class="p">.</span><span class="n">machine_group</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">meets_restriction</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">impl_to_string</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_to_string</span> <span class="o">:</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: <code>Role</code> is a submodule to make it easy to use it as the key in a map.</p>
<p>Hopefully you find it much easier to understand what the solver does (and doesn't) do from this type.
The <code>Solver_core</code> code no longer depends on the rest of 0install and can be understood on its own.</p>
<p>The remaining code in <code>Solver</code> defines the implementation of a <code>SOLVER_INPUT</code> module and applies the functor, like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">CoreModel</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">generic_implementation</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">command</span>
</span><span class="line">  <span class="k">type</span> <span class="n">restriction</span> <span class="o">=</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">restriction</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span> <span class="o">=</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span> <span class="o">=</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">*</span> <span class="nn">Impl</span><span class="p">.</span><span class="n">dependency</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line">  <span class="k">let</span> <span class="n">get_command</span> <span class="n">impl</span> <span class="n">name</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">StringMap</span><span class="p">.</span><span class="n">find</span> <span class="n">name</span> <span class="nn">Impl</span><span class="p">.</span><span class="o">(</span><span class="n">impl</span><span class="o">.</span><span class="n">props</span><span class="o">.</span><span class="n">commands</span><span class="o">)</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">dep_info</span> <span class="o">(</span><span class="n">role</span><span class="o">,</span> <span class="n">dep</span><span class="o">)</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="n">dep_role</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">      <span class="n">scope</span> <span class="o">=</span> <span class="n">role</span><span class="o">.</span><span class="n">scope</span><span class="o">;</span>
</span><span class="line">      <span class="n">iface</span> <span class="o">=</span> <span class="n">dep</span><span class="o">.</span><span class="nn">Impl</span><span class="p">.</span><span class="n">dep_iface</span><span class="o">;</span>
</span><span class="line">      <span class="c">(* note: only dependencies on binaries supported for now. *)</span>
</span><span class="line">      <span class="n">source</span> <span class="o">=</span> <span class="bp">false</span><span class="o">;</span>
</span><span class="line">    <span class="o">};</span>
</span><span class="line">    <span class="n">dep_importance</span> <span class="o">=</span> <span class="n">dep</span><span class="o">.</span><span class="nn">Impl</span><span class="p">.</span><span class="n">dep_importance</span><span class="o">;</span>
</span><span class="line">    <span class="n">dep_required_commands</span> <span class="o">=</span> <span class="n">dep</span><span class="o">.</span><span class="nn">Impl</span><span class="p">.</span><span class="n">dep_required_commands</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line"><span class="k">end</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="nc">Core</span> <span class="o">=</span> <span class="nn">Solver_core</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">CoreModel</span><span class="o">)</span>
</span><span class="line"><span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>CoreModel</code> implementation of <code>SOLVER_INPUT</code> simply maps the abstract types and functions to use the real types.
The limitation that dependencies are always binary is easier to see here, and it's fairly obvious how to fix it.</p>
<p>Note that we <em>don't</em> define the module as <code>module CoreModel : SOLVER_INPUT = ...</code>.
The rest of the code in <code>Solver</code> still needs to see the concrete types; only <code>Solver_core</code> is restricted to see it just as a <code>SOLVER_INPUT</code>.</p>
<h2 id="comparison-with-java">Comparison with Java</h2>
<p>Using functors for this seemed pretty easy, and I started wondering how I'd solve this problem in other languages.
Python simply can't do this kind of thing, of course - there you have to read all the code to understand what it does.
In Java, we might declare some abstract interfaces, though:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">interface</span> <span class="nc">RoleInfo</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="n">List</span><span class="o">&lt;</span><span class="n">Impl</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">get_implementations</span><span class="p">();</span>
</span><span class="line"><span class="w">  </span><span class="n">Role</span><span class="w"> </span><span class="nf">get_replacement</span><span class="p">();</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line"><span class="kd">interface</span> <span class="nc">Impl</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="n">Command</span><span class="w"> </span><span class="nf">get_command</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">name</span><span class="p">);</span>
</span><span class="line"><span class="w">  </span><span class="n">String</span><span class="w"> </span><span class="nf">machine_group</span><span class="p">();</span>
</span><span class="line"><span class="w">  </span><span class="n">List</span><span class="o">&lt;</span><span class="n">Dependency</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">requires</span><span class="p">();</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line"><span class="kd">interface</span> <span class="nc">Restriction</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="kt">boolean</span><span class="w"> </span><span class="nf">meets_restriction</span><span class="p">(</span><span class="n">Impl</span><span class="w"> </span><span class="n">i</span><span class="p">);</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line"><span class="kd">interface</span> <span class="nc">Dependency</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="n">Role</span><span class="w"> </span><span class="nf">get_role</span><span class="p">();</span>
</span><span class="line"><span class="w">  </span><span class="n">List</span><span class="o">&lt;</span><span class="n">Restriction</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">get_restrictions</span><span class="p">();</span>
</span><span class="line"><span class="w">  </span><span class="n">Importance</span><span class="w"> </span><span class="nf">get_importance</span><span class="p">();</span>
</span><span class="line"><span class="w">  </span><span class="n">List</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">get_required_commands</span><span class="p">();</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line"><span class="kd">interface</span> <span class="nc">Role</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="n">RoleInfo</span><span class="w"> </span><span class="nf">get_implementations</span><span class="p">();</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line"><span class="kd">class</span> <span class="nc">Solver</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">  </span><span class="kd">public</span><span class="w"> </span><span class="n">Map</span><span class="o">&lt;</span><span class="n">Role</span><span class="p">,</span><span class="n">Impl</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">do_solve</span><span class="p">(</span><span class="n">Role</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">...</span>
</span><span class="line"><span class="w">  </span><span class="p">}</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>There's a problem, though.
We can create a <code>ConcreteRole</code> and pass that to <code>Solver.do_solve</code>, but we'll get back a map from abstract roles to abstract impls.
We need to get concrete types out to do anything useful with the result.</p>
<p>A Java programmer would probably cast the results back to the concrete types, but there's a problem with this (beyond the obvious fact that it's not statically type checked):
If we accept dynamic casting as a legitimate technique (OCaml doesn't support it), there's nothing to stop the abstract solver core from doing it too.
We're back to reading all the code to find out what information it really uses.</p>
<p>There are other places where dynamic casts are needed too, such as in <code>meets_restriction</code> (which needs a concrete implementation, not an abstract one).</p>
<p>I did try using generics, but I didn't manage to get it to compile, and I stopped when I got to:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="w">  </span><span class="kd">public</span><span class="w"> </span><span class="o">&lt;</span><span class="n">I</span><span class="p">,</span><span class="n">R</span><span class="p">,</span><span class="n">D</span><span class="o">&gt;</span><span class="w"> </span><span class="n">Map</span><span class="o">&lt;</span><span class="n">Role</span><span class="o">&lt;</span><span class="n">I</span><span class="p">,</span><span class="n">R</span><span class="p">,</span><span class="n">D</span><span class="o">&gt;</span><span class="p">,</span><span class="n">Impl</span><span class="o">&lt;</span><span class="n">I</span><span class="p">,</span><span class="n">R</span><span class="p">,</span><span class="n">D</span><span class="o">&gt;&gt;</span>
</span><span class="line"><span class="w">         </span><span class="nf">do_solve</span><span class="p">(</span><span class="n">Role</span><span class="o">&lt;</span><span class="n">I</span><span class="p">,</span><span class="n">R</span><span class="p">,</span><span class="n">D</span><span class="o">&gt;</span><span class="w"> </span><span class="n">role</span><span class="p">,</span><span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">command</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I think it's fair to say that if this ever did compile, it certainly wouldn't have made the code easier to read.</p>
<h2 id="diagnostics">Diagnostics</h2>
<p>The <code>Diagnostics</code> module takes a failed solver result (produced with <code>do_solve ~closest_match:true</code>) that is close to what we think the user wanted but with some components left blank, and tries to explain to the user why none of the available candidates was suitable
(see the <a href="http://0install.net/injector-trouble.html">Trouble-shooting</a> guide for some examples and pictures).</p>
<p>I made a <code>SOLVER_RESULT</code> module type which extended <code>SOLVER_INPUT</code> with the final selections and diagnostic information:</p>
<figure class="code"><figcaption><span>sigs.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_RESULT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">include</span> <span class="nc">SOLVER_INPUT</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">  <span class="k">val</span> <span class="n">requirements</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">requirements</span>
</span><span class="line">  <span class="k">val</span> <span class="n">user_restrictions</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">get_selected</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">selected_commands</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">explain</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">
</span><span class="line">  <span class="k">type</span> <span class="n">rejection</span>
</span><span class="line">  <span class="k">val</span> <span class="n">rejects</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">impl</span> <span class="o">*</span> <span class="n">rejection</span><span class="o">)</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">describe_problem</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">rejection</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">
</span><span class="line">  <span class="k">type</span> <span class="n">version</span>
</span><span class="line">  <span class="k">val</span> <span class="n">version</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">version</span>
</span><span class="line">  <span class="k">val</span> <span class="n">format_version</span> <span class="o">:</span> <span class="n">version</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">id_of_impl</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">format_machine</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">string_of_restriction</span> <span class="o">:</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dummy_impl</span> <span class="o">:</span> <span class="n">impl</span>	<span class="c">(** Placeholder for missing impls *)</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: the <code>explain</code> function here is for the diagnostics-of-last-resort; the diagnostics system uses it on cases it can't explain itself, which generally indicates a bug somewhere.</p>
<p>Then I made a <code>Diagnostics.Make</code> functor as with <code>Solver_core</code>.
This means that the diagnostics now sees the same information as the solver, with the above additions.
For example, it sees the same dependencies as the solver did (e.g. we can't forget to filter them out with <code>is_dep_needed</code>).
Like the solver, the diagnostics assumed that a dependency was always a binary dependency and used <code>(iface, false)</code> to get the role.
Since the role is now abstract, it can't do this and should cope with source dependencies automatically.</p>
<p>The new API prompted me to consider self-command dependencies again, so the diagnostics code is now able to explain correctly problems caused by missing self-commands (previously, I forgot to handle this case).</p>
<h2 id="selections">Selections</h2>
<p>Sometimes we use the results of the solver directly.
In other cases, we save them to disk as an <a href="http://0install.net/selections-spec.html">XML selections document</a> first.
These XML documents are handled by the <code>Selections</code> module, which had its own API.</p>
<p>For consistency, I decided to share type names and methods as much as possible.
I split out the core of <code>SOLVER_INPUT</code> into another module type:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">CORE_MODEL</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Role</span> <span class="o">:</span> <span class="k">sig</span> <span class="o">...</span> <span class="k">end</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span> <span class="o">=</span> <span class="k">private</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dependency</span>
</span><span class="line">  <span class="k">type</span> <span class="n">dep_info</span> <span class="o">=</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
</span><span class="line">  <span class="k">type</span> <span class="n">requirements</span> <span class="o">=</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
</span><span class="line">  <span class="k">val</span> <span class="n">requires</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span> <span class="o">*</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_requires</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="n">dependency</span> <span class="kt">list</span> <span class="o">*</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">dep_info</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">dep_info</span>
</span><span class="line">  <span class="k">val</span> <span class="n">get_command</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">command_name</span> <span class="o">-&gt;</span> <span class="n">command</span> <span class="n">option</span>
</span><span class="line"><span class="k">end</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_INPUT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">include</span> <span class="nc">CORE_MODEL</span>
</span><span class="line">  <span class="k">type</span> <span class="n">role_information</span> <span class="o">=</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
</span><span class="line">  <span class="k">type</span> <span class="n">restriction</span>
</span><span class="line">  <span class="k">val</span> <span class="n">impl_to_string</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">command_to_string</span> <span class="o">:</span> <span class="n">command</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">implementations</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">role_information</span>
</span><span class="line">  <span class="k">val</span> <span class="n">restrictions</span> <span class="o">:</span> <span class="n">dependency</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">meets_restriction</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">bool</span>
</span><span class="line">  <span class="k">val</span> <span class="n">machine_group</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="nn">Arch</span><span class="p">.</span><span class="n">machine_group</span> <span class="n">option</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Actually, there is some overlap with <code>SOLVER_RESULT</code> too, so I created a <code>SELECTIONS</code> type as well:</p>
<p><img src="/blog/images/solver/sigs.png" class="border center"/></p>
<p>Now the relationship becomes clear.
<code>SOLVER_INPUT</code> extends the core model with ways to get the possible candidates and restrictions on their use.
<code>SELECTIONS</code> extends the core with ways to find out which implementations were selected.
<code>SOLVER_RESULT</code> combines the above two, providing extra information for diagnostics by relating the selections back to the candidates (information that isn't available when loading saved selections).</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SELECTIONS</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">  <span class="k">type</span> <span class="n">role</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span>
</span><span class="line">  <span class="k">type</span> <span class="n">requirements</span>
</span><span class="line">  <span class="k">val</span> <span class="n">get_selected</span> <span class="o">:</span> <span class="n">role</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">selected_commands</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">role</span> <span class="o">-&gt;</span> <span class="n">command_name</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">val</span> <span class="n">requirements</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">requirements</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">RoleMap</span> <span class="o">:</span> <span class="nc">MAP</span> <span class="k">with</span> <span class="k">type</span> <span class="n">key</span> <span class="o">=</span> <span class="n">role</span>
</span><span class="line"><span class="k">end</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">SOLVER_RESULT</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="k">include</span> <span class="nc">SOLVER_INPUT</span>
</span><span class="line">  <span class="k">include</span> <span class="nc">SELECTIONS</span> <span class="k">with</span>
</span><span class="line">    <span class="k">type</span> <span class="n">impl</span> <span class="o">:=</span> <span class="n">impl</span> <span class="ow">and</span>
</span><span class="line">    <span class="k">type</span> <span class="n">command_name</span> <span class="o">:=</span> <span class="n">command_name</span> <span class="ow">and</span>
</span><span class="line">    <span class="k">type</span> <span class="n">requirements</span> <span class="o">:=</span> <span class="n">requirements</span> <span class="ow">and</span>
</span><span class="line">    <span class="k">type</span> <span class="n">role</span> <span class="o">=</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">type</span> <span class="n">rejection</span>
</span><span class="line">  <span class="k">val</span> <span class="n">rejects</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">impl</span> <span class="o">*</span> <span class="n">rejection</span><span class="o">)</span> <span class="kt">list</span>
</span><span class="line">  <span class="k">type</span> <span class="n">version</span>
</span><span class="line">  <span class="k">val</span> <span class="n">version</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">version</span>
</span><span class="line">  <span class="k">val</span> <span class="n">format_version</span> <span class="o">:</span> <span class="n">version</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">user_restrictions</span> <span class="o">:</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">restriction</span> <span class="n">option</span>
</span><span class="line">  <span class="k">val</span> <span class="n">id_of_impl</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">format_machine</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">string_of_restriction</span> <span class="o">:</span> <span class="n">restriction</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">describe_problem</span> <span class="o">:</span> <span class="n">impl</span> <span class="o">-&gt;</span> <span class="n">rejection</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">explain</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span>
</span><span class="line">  <span class="k">val</span> <span class="n">raw_selections</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">impl</span> <span class="nn">RoleMap</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I had a bit of trouble here.
I wanted to include <code>CORE_MODEL</code> in <code>SELECTIONS</code>, but doing that caused an error when I tried to bring them together in <code>SOLVER_RESULT</code>,
because of the duplicate <code>Role</code> submodule.
So instead I just define the types I need and let the user of the signature link them up.</p>
<p>Update: I've since discovered that you can just do <code>with module Role := Role</code> to solve this.</p>
<p>Correct use of the <code>with</code> keyword seems the key to a happy life with OCaml functors.
When defining the <code>RoleMap</code> submodule in <code>SELECTIONS</code> I use it to let users know the keys of a <code>RoleMap</code> are the same type as <code>SELECTIONS.role</code> (otherwise it will be abstract and you can't assume anything about it).
In <code>SOLVER_RESULT</code>, I use it to link the types in <code>SELECTIONS</code> with the types in <code>SOLVER_INPUT</code>.</p>
<p>Notice the use of <code>=</code> vs <code>:=</code>.
<code>=</code> says that two types are the same.
<code>:=</code> additionally removes the type from the module signature.
We use <code>:=</code> for <code>impl</code> because we already have a type with that name from <code>SOLVER_INPUT</code> and we can't have two.
However, we use <code>=</code> for <code>role</code> because that doesn't exist in <code>CORE_MODEL</code> and we'd like <code>SOLVER_RESULT</code> to include everything in <code>SELECTIONS</code>.</p>
<p>Finally, I included the new <code>SELECTIONS</code> signature in the interface file for the <code>Selections</code> module:</p>
<figure class="code"><figcaption><span>selections.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">type</span> <span class="n">selection</span> <span class="o">=</span> <span class="o">[`</span><span class="n">selection</span><span class="o">]</span> <span class="nn">Element</span><span class="p">.</span><span class="n">t</span>
</span><span class="line"><span class="k">type</span> <span class="n">role</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">iface</span> <span class="o">:</span> <span class="nn">General</span><span class="p">.</span><span class="n">iface_uri</span><span class="o">;</span>
</span><span class="line">  <span class="n">source</span> <span class="o">:</span> <span class="kt">bool</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line"><span class="k">include</span> <span class="nn">Sigs</span><span class="p">.</span><span class="nc">CORE_MODEL</span> <span class="k">with</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span> <span class="o">=</span> <span class="n">selection</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span> <span class="o">=</span> <span class="kt">string</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="nn">Role</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="n">role</span>
</span><span class="line"><span class="k">include</span> <span class="nn">Sigs</span><span class="p">.</span><span class="nc">SELECTIONS</span> <span class="k">with</span>
</span><span class="line">  <span class="k">type</span> <span class="n">role</span> <span class="o">:=</span> <span class="n">role</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="n">command_name</span> <span class="o">:=</span> <span class="n">command_name</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="n">requirements</span> <span class="o">:=</span> <span class="n">requirements</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="n">impl</span> <span class="o">:=</span> <span class="n">selection</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With this change, anything that uses <code>SELECTIONS</code> can work with both loaded selections and with the solver output, even though they have different implementations.
For example, the <code>Tree</code> module generates a tree (for display) from a selections dependency graph (pruning loops and diamonds).
It's now a functor, which can be used with either.
For example, <code>0install show selections.xml</code> applies it to the <code>Selections</code> module:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">SelectionsTree</span> <span class="o">=</span> <span class="nn">Tree</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">Selections</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="o">...</span> <span class="nn">SelectionsTree</span><span class="p">.</span><span class="n">as_tree</span> <span class="n">sels</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The same code is used in the GUI to render the tree view, but now with <code>Make(Solver.Model)</code>.
As before, it's important to preserve the types - the GUI needs to know that each node in the tree is a <code>Solver.Model.Role.t</code> so that it can show information about the available candidates, for example.</p>
<h2 id="summary">Summary</h2>
<p>An important function of a package manager is finding a set of package versions that are compatible.
An efficient way to do this is to express the necessary and sufficient constraints as a set of boolean equations and then use a SAT solver to find a solution.
While finding a valid solution is easy, finding the optimal one can be harder.
Because 0install is able to install libraries in parallel and can choose to use different versions for different applications, it only needs to consider one application at a time.
As well as being faster, this makes it possible to use a simple definition of optimality that is easy to compute.</p>
<p>0install's existing solver code has already been broken down into modular components: downloading metadata, collecting candidates, rejecting invalid candidates and ranking the rest, building the SAT problem and solving it.
However, the code that builds the SAT problem and optimises the solution was tightly coupled to the concrete representation, making it harder to see what it was doing and harder to extend it with new features.
Its type signature essentially just said that it takes XML as input and returns XML as output.</p>
<p>OCaml functors are functions over modules.
They allow a module to declare the interface it expects from its dependencies in an abstract way, providing just the information the module requires and nothing else.
The module can then be compiled against this abstract interface, ensuring that it makes no assumptions about the actual types.
Later, the functor can be applied to the concrete representation to get a module that uses the concrete types.</p>
<p>Turning the existing solver code into a functor turned out to be a simple iterative process that discovered the existing implicit API between the solver and the rest of the code.
Once this abstract API had been found, many possible improvements became obvious.
The new solver core is both simpler than the original and can be understood on its own without looking at the rest of the code.
It is also more flexible: we could now add support for source dependencies, cross-compilation, etc, without changing the core of the solver.
The challenge now is only how to express these things in the XML format.</p>
<p>In a language without functors, such as Java, we could still define the solver to work over abstract interfaces, but the results returned would also be abstract, which is not useful.
Trying to achieve the same effect as functors using generics appears very difficult and the resulting code would likely be hard to read.</p>
<p>Splitting up the abstract interface into multiple module types allowed parts of the interface to be shared with the separate selections-handling module.
This in turn allowed another module - for turning selections into trees - to become a functor that could also work directly on the solver results.
Finally, it made the relationship between the solver results and the selections type clear - solver results are selections plus diagnostics information.</p>
<p>The code discussed in this post can be found at <a href="https://github.com/0install/0install">https://github.com/0install/0install</a>.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Optimising the unikernel</title>
    <link href="https://roscidus.com/blog/blog/2014/08/15/optimising-the-unikernel/"></link>
    <updated>2014-08-15T11:05:27+00:00</updated>
    <id>https://roscidus.com/blog/blog/2014/08/15/optimising-the-unikernel</id>
    <content type="html"><![CDATA[<p>After <a href="/blog/blog/2014/07/28/my-first-unikernel/">creating my REST queuing service as a Mirage unikernel</a>, I reported that it could serve the data at 2.46 MB/s from my ARM CubieTruck dev board.
That's fast enough for my use (it's faster than my Internet connection), but I was curious why it was slower than the Linux guest, which serves files with <code>nc</code> at 20 MB/s.</p>
<!-- more -->
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=8206882">Hacker News</a> and <a href="http://www.reddit.com/r/programming/comments/2e7jbx/optimising_the_unikernel_thomas_leonards_blog/">Reddit</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-tcp-test-case">The TCP test-case</a>
</li>
<li><a href="#compiler-optimisations">Compiler optimisations</a>
</li>
<li><a href="#profiling-support">Profiling support</a>
</li>
<li><a href="#profiling-the-console">Profiling the console</a>
</li>
<li><a href="#profiling-udp">Profiling UDP</a>
</li>
<li><a href="#profiling-tcp">Profiling TCP</a>
</li>
<li><a href="#profiling-disk-access">Profiling disk access</a>
<ul>
<li><a href="#update-linux-is-slow-too">Update: Linux is slow too!</a>
</li>
</ul>
</li>
<li><a href="#profiling-the-queuing-service">Profiling the queuing service</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="the-tcp-test-case">The TCP test-case</h2>
<p>To avoid confusing things by testing the disk and the network at the same time, I made a simpler test case that waits for a TCP connection
and transmits a pre-allocated buffer multiple times:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span><span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span> <span class="o">(</span><span class="nc">S</span><span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">STACKV4</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">let</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nn">Io_page</span><span class="p">.</span><span class="n">pages</span> <span class="mi">16</span> <span class="o">|&gt;</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="nn">Io_page</span><span class="p">.</span><span class="n">to_cstruct</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="n">s</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">S</span><span class="p">.</span><span class="n">listen_tcpv4</span> <span class="n">s</span> <span class="o">~</span><span class="n">port</span><span class="o">:</span><span class="mi">80</span> <span class="o">(</span><span class="k">fun</span> <span class="n">flow</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="k">let</span> <span class="n">warmups</span> <span class="o">=</span> <span class="mi">10</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">iterations</span> <span class="o">=</span> <span class="mi">100</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">bytes</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">lenv</span> <span class="n">buffer</span> <span class="o">*</span> <span class="n">iterations</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="o">=</span> <span class="k">function</span>
</span><span class="line">        <span class="o">|</span> <span class="mi">0</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">        <span class="o">|</span> <span class="n">i</span> <span class="o">-&gt;</span>
</span><span class="line">            <span class="n">lwt</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">S</span><span class="p">.</span><span class="nn">TCPV4</span><span class="p">.</span><span class="n">writev</span> <span class="n">flow</span> <span class="n">buffer</span> <span class="k">in</span>
</span><span class="line">            <span class="n">loop</span> <span class="o">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">      <span class="n">loop</span> <span class="n">warmups</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">lwt</span> <span class="n">time</span> <span class="o">=</span> <span class="nn">Profile</span><span class="p">.</span><span class="n">time</span> <span class="o">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="n">loop</span> <span class="n">iterations</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="nn">S</span><span class="p">.</span><span class="nn">TCPV4</span><span class="p">.</span><span class="n">close</span> <span class="n">flow</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;Wrote %d bytes in %.3f seconds (%.2f KB/s)&quot;</span>
</span><span class="line">        <span class="n">bytes</span> <span class="n">time</span>
</span><span class="line">        <span class="o">(</span><span class="n">float_of_int</span> <span class="n">bytes</span> <span class="o">/.</span> <span class="n">time</span> <span class="o">/.</span> <span class="mi">1024</span><span class="o">.))</span>
</span><span class="line">    <span class="o">);</span>
</span><span class="line">
</span><span class="line">    <span class="nn">S</span><span class="p">.</span><span class="n">listen</span> <span class="n">s</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Profile.time</code> just runs the function and returns how long it took in seconds.
I do a few warm-up iterations at the start because TCP starts slowly and we don't want to benchmark that.</p>
<h2 id="compiler-optimisations">Compiler optimisations</h2>
<p>While looking at the assembler output during some earlier debugging, I'd noticed that gcc was generating very poor code. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="objdump"><span class="line"><span class="mh">0041a550</span><span class="w"> </span><span class="p">&lt;</span><span class="nf">_xmalloc</span><span class="p">&gt;:</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a550:</span><span class="w">	</span><span class="nf">e92d4800</span><span class="w"> 	</span><span class="no">push</span><span class="w">	</span><span class="p">{</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="no">lr</span><span class="p">}</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a554:</span><span class="w">	</span><span class="nf">e28db004</span><span class="w"> 	</span><span class="no">add</span><span class="w">	</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="no">sp</span><span class="p">,</span><span class="w"> </span><span class="mi">#4</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a558:</span><span class="w">	</span><span class="nf">e24dd028</span><span class="w"> 	</span><span class="no">sub</span><span class="w">	</span><span class="no">sp</span><span class="p">,</span><span class="w"> </span><span class="no">sp</span><span class="p">,</span><span class="w"> </span><span class="mi">#40</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x28</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a55c:</span><span class="w">	</span><span class="nf">e50b0028</span><span class="w"> 	</span><span class="no">str</span><span class="w">	</span><span class="no">r0</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-40</span><span class="p">]</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x28</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a560:</span><span class="w">	</span><span class="nf">e50b102c</span><span class="w"> 	</span><span class="no">str</span><span class="w">	</span><span class="no">r1</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-44</span><span class="p">]</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x2c</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a564:</span><span class="w">	</span><span class="nf">e3a03000</span><span class="w"> 	</span><span class="no">mov</span><span class="w">	</span><span class="no">r3</span><span class="p">,</span><span class="w"> </span><span class="mi">#0</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a568:</span><span class="w">	</span><span class="nf">e50b300c</span><span class="w"> 	</span><span class="no">str</span><span class="w">	</span><span class="no">r3</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-12</span><span class="p">]</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a56c:</span><span class="w">	</span><span class="nf">e3a03010</span><span class="w"> 	</span><span class="no">mov</span><span class="w">	</span><span class="no">r3</span><span class="p">,</span><span class="w"> </span><span class="mi">#16</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a570:</span><span class="w">	</span><span class="nf">e50b3014</span><span class="w"> 	</span><span class="no">str</span><span class="w">	</span><span class="no">r3</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-20</span><span class="p">]</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a574:</span><span class="w">	</span><span class="nf">e51b002c</span><span class="w"> 	</span><span class="no">ldr</span><span class="w">	</span><span class="no">r0</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-44</span><span class="p">]</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x2c</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a578:</span><span class="w">	</span><span class="nf">e3a01004</span><span class="w"> 	</span><span class="no">mov</span><span class="w">	</span><span class="no">r1</span><span class="p">,</span><span class="w"> </span><span class="mi">#4</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a57c:</span><span class="w">	</span><span class="nf">ebffff5d</span><span class="w"> 	</span><span class="no">bl</span><span class="w">	</span><span class="mh">41a2f8</span> <span class="p">&lt;</span><span class="no">align_up</span><span class="p">&gt;</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a580:</span><span class="w">	</span><span class="nf">e50b002c</span><span class="w"> 	</span><span class="no">str</span><span class="w">	</span><span class="no">r0</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-44</span><span class="p">]</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x2c</span>
</span><span class="line"><span class="w">  </span><span class="nl">41a584:</span><span class="w">	</span><span class="nf">e51b002c</span><span class="w"> 	</span><span class="no">ldr</span><span class="w">	</span><span class="no">r0</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="no">fp</span><span class="p">,</span><span class="w"> </span><span class="mi">#-44</span><span class="p">]</span><span class="w">	</span><span class="err">;</span><span class="w"> </span><span class="mi">0x2c</span>
</span><span class="line"><span class="x">  ...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>gcc is using registers very inefficiently here.
For example, it stores <code>r1</code> to <code>[fp, #-44]</code> and then a few lines later loads from there into <code>r0</code>, when it could just have moved it directly.
The last two lines show it saving <code>r0</code> to the stack and then immediately loading it back again into the same register!</p>
<p>The fix here turned out to be simple.
Mini-OS by default compiles in debug mode with no optimisations.
Compiling with <code>debug=n</code> fixes this, and I <a href="https://github.com/mirage/mirage-xen-minios/pull/4">updated mirage-xen-minios</a> to do this.</p>
<table class="table"><thead><tr><th> Optimisations </th><th> TCP download speed</th></tr></thead><tbody><tr><td> 	     none </td><td> 6.92 MB/s</td></tr><tr><td>	      -O3 </td><td> 11.93 MB/s</td></tr></tbody></table><p>Even though Mirage is almost all OCaml, it does use Mini-OS's C functions for various low-level operations and these optimisations make a big difference!</p>
<h2 id="profiling-support">Profiling support</h2>
<p>The OCaml compiler provides a profiling option, which works the same way as gcc's <code>-pg</code> option for C code.
To enable it, you add <code>true: profile</code> to your <code>_tags</code> file and rebuild.</p>
<p>I decided to see what would happen if I enabled this for my Xen unikernel:</p>
<pre><code>_build/main.native.o: In function `caml_program':
:(.text+0x2): undefined reference to `__gnu_mcount_nc'
</code></pre>
<p>Profiling works by inserting a call to <code>__gnu_mcount_nc</code> at the start of every function.
It looks like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="objdump"><span class="line"><span class="mh">00000000</span><span class="w"> </span><span class="p">&lt;</span><span class="nf">caml_program</span><span class="p">&gt;:</span>
</span><span class="line"><span class="x">       0:       b500            push    {lr}</span>
</span><span class="line"><span class="x">       2:       f7ff fffe       bl      0 &lt;__gnu_mcount_nc&gt;</span>
</span><span class="line"><span class="x">       ...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>__gnu_mcount_nc</code> function gets the address of the callee function (<code>caml_program</code> in this example) in the link register (<code>lr</code>/<code>r14</code>) and the address of its caller on the stack (pushed by the code fragment above).
Normally, the profiler would use this information to build up a static call graph (saying which functions call which other functions).
Using a regular timer interrupt to sample the program counter it can estimate how much time was spent in each function,
and using the call graph it can show cumulative totals (time spent in each function plus time spent in its children).</p>
<p>I decided to start with something a bit simpler.
I wrote <a href="https://github.com/talex5/xen/blob/8f786348db50611e1251dc3ed505bdf5bb388fe9/extras/mini-os/arch/arm/arm32.S#L330">some ARM code</a> for <code>__gnu_mcount_nc</code> that simply writes the caller, callee and current time to a trace buffer (when the buffer is full, it stops tracing).
Ideally, I'd like to get notified each time we leave a function too.
gcc can do that for C code with its <code>-finstrument-functions</code> option, but I didn't see an option for that in OCaml.
Instead, I assume that every function runs until I see a call whose caller is not its parent.
This works surprisingly well, though it does mean that if a function seems to take a long time you need to check its parents too,
and it might get confused for recursive calls.
Also, for tail calls, we see the parent as the function we will return to rather than the function that actually called us.</p>
<p>At the end, I dump out the trace buffer to the console with some OCaml code.
Back on my laptop, I wrote some code to parse this output and look up each address in the ELF image to get the function name for each address.
(This code isn't public yet as it needs a lot of cleaning up.)</p>
<p>One thing I quickly discovered: compiling just the unikernel with profiling isn't sufficient.
As soon as you call a non-profiled function it can no longer construct the call graph and the results are useless.
I manually recompiled every C and OCaml library I was using with profiling, which was quite tedious.</p>
<p>Update: Thomas Gazagnaire has added an <a href="https://github.com/ocaml/opam-repository/pull/2479">OPAM profiling switch</a> which should make this much easier in future.</p>
<p>Initially, the trace buffer filled up almost instantly with calls to <code>stub_evtchn_test_and_clear</code>.
It seems that we call this once for each of the 4096 channels every time we look for work.
To avoid clutter, I reduced the number of event channels to 10 (this had no noticeable effect on performance).
I also tried removing the <code>memset</code> which zeroes out newly allocated IO pages.
This also made no difference.</p>
<p>I measured the overhead added by the tracing, both when compiled in but inactive and when actively writing to the trace buffer:</p>
<p><img src="/blog/images/mirage-profiling/tracing-overhead.png" class="border center"/></p>
<p>So, not too bad.</p>
<h2 id="profiling-the-console">Profiling the console</h2>
<p>The TCP code's trace was quite complicated, so I decided to start by profiling the much simpler console device,
which I'd noticed was surprisingly slow at dumping the trace results.</p>
<p>A Xen virtual console is a pair of ring buffers (one for input from the keyboard, one for output to the screen) in a shared memory page, defined like this:</p>
<figure class="code"><figcaption><span>console.h</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">struct</span><span class="w"> </span><span class="nc">xencons_interface</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="kt">char</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="mi">1024</span><span class="p">];</span>
</span><span class="line"><span class="w">    </span><span class="kt">char</span><span class="w"> </span><span class="n">out</span><span class="p">[</span><span class="mi">2048</span><span class="p">];</span>
</span><span class="line"><span class="w">    </span><span class="n">XENCONS_RING_IDX</span><span class="w"> </span><span class="n">in_cons</span><span class="p">,</span><span class="w"> </span><span class="n">in_prod</span><span class="p">;</span>
</span><span class="line"><span class="w">    </span><span class="n">XENCONS_RING_IDX</span><span class="w"> </span><span class="n">out_cons</span><span class="p">,</span><span class="w"> </span><span class="n">out_prod</span><span class="p">;</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We're only interested in the &quot;out&quot; side here.
The producer (i.e. our unikernel) writes the data to the buffer and advances the <code>out_prod</code> counter.
The consumer (<code>xenconsoled</code>, running in Dom0) reads the data and advances <code>out_cons</code>.
If the consumer catches up with the producer it sleeps until the producer notifies it there is more data.
If the producer catches up with the consumer (the buffer is full) it sleeps until the consumer notifies it there is space available again.</p>
<p>Here's my console test-case - writing a string to the console in a loop:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span><span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">String</span><span class="p">.</span><span class="n">make</span> <span class="o">(</span><span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="sc">&#39;X&#39;</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">iterations</span> <span class="o">=</span> <span class="mi">10000</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">len</span> <span class="o">*</span> <span class="n">iterations</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="o">=</span> <span class="k">function</span>
</span><span class="line">      <span class="o">|</span> <span class="mi">0</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">      <span class="o">|</span> <span class="n">i</span> <span class="o">-&gt;</span>
</span><span class="line">          <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="n">msg</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="n">loop</span> <span class="o">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="n">loop</span> <span class="mi">1000</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>  <span class="c">(* Warm up *)</span>
</span><span class="line">    <span class="k">let</span> <span class="n">t0</span> <span class="o">=</span> <span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="n">lwt</span> <span class="bp">()</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">iterations</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">time</span> <span class="o">=</span> <span class="nn">Clock</span><span class="p">.</span><span class="n">time</span> <span class="bp">()</span> <span class="o">-.</span> <span class="n">t0</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="o">(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;Wrote %d bytes in %.3f seconds (%.2f KB/s)&quot;</span>
</span><span class="line">      <span class="n">bytes</span> <span class="n">time</span>
</span><span class="line">      <span class="o">(</span><span class="n">float_of_int</span> <span class="n">bytes</span> <span class="o">/.</span> <span class="n">time</span> <span class="o">/.</span> <span class="mi">1024</span><span class="o">.))</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">OS</span><span class="p">.</span><span class="nn">Time</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">2</span><span class="o">.</span><span class="mi">0</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I got around 45 KB/s. The trace output looked like this:</p>
<table class="table"><thead><tr><th style="text-align: right"> Start  </th><th> Function </th><th style="text-align: right"> Duration</th></tr></thead><tbody><tr><td style="text-align: right"> 118684 </td><td> - - <code>camlUnikernel__loop_1270</code> </td><td style="text-align: right"> 219</td></tr><tr><td style="text-align: right"> 118730 </td><td> - - &gt; <code>camlConsole__write_all_low_1123</code> </td><td style="text-align: right"> 158</td></tr><tr><td style="text-align: right"> 118737 </td><td> - - &gt; - <code>camlRing__repeat_1279</code> </td><td style="text-align: right"> 89</td></tr><tr><td style="text-align: right"> 118739 </td><td> - - &gt; - - <code>camlRing__write_1270</code> </td><td style="text-align: right"> 87</td></tr><tr><td style="text-align: right"> 118830 </td><td> - - &gt; - <code>camlEventchn__fun_1138</code> </td><td style="text-align: right"> 58</td></tr><tr><td style="text-align: right"> 118831 </td><td> - - &gt; - - <code>stub_evtchn_notify</code> </td><td style="text-align: right"> 57</td></tr><tr><td style="text-align: right"> 118907 </td><td> - - <code>camlUnikernel__loop_1270</code> </td><td style="text-align: right"> 211</td></tr><tr><td style="text-align: right"> 118944 </td><td> - - &gt; <code>camlConsole__write_all_low_1123</code> </td><td style="text-align: right"> 168</td></tr><tr><td style="text-align: right"> 118951 </td><td> - - &gt; - <code>camlRing__repeat_1279</code> </td><td style="text-align: right"> 100</td></tr><tr><td style="text-align: right"> 118953 </td><td> - - &gt; - - <code>camlRing__write_1270</code> </td><td style="text-align: right"> 98</td></tr><tr><td style="text-align: right"> 119054 </td><td> - - &gt; - <code>camlEventchn__fun_1138</code> </td><td style="text-align: right"> 58</td></tr><tr><td style="text-align: right"> 119055 </td><td> - - &gt; - - <code>stub_evtchn_notify</code> </td><td style="text-align: right"> 57</td></tr></tbody></table><p>The start time and duration are measured in counter clock ticks, and the counter is running at 24 MHz.
The <code>--&gt;--&gt;--&gt;</code> indicates the level of nesting (I vary the character to make it easier to scan vertically with the eye).
The output shows two iterations of the loop taken from the middle of the sample.
To make the output more readable, my analysis script prunes the tree at calls that took less than 50 ticks, and removes calls to the Lwt library (while still showing the functions they called as a result).
The durations include the times for their children (including pruned children).</p>
<p>You can see that on each iteration we call <code>Console.write_all_low</code>, which writes the string to the shared memory ring and notifies the console daemon in Dom0.
Each iteration is taking roughly 200 ticks, which is about 8 us per iteration.
So we'd expect the speed to be around 6 bytes / 8 us, which is about 700 KB/s.</p>
<p>Looking at the cumulative time spent in each function, the top entries are:</p>
<table class="table"><thead><tr><th> Function			</th><th> Ticks (at 24 MHz)</th></tr></thead><tbody><tr><td> <code>caml_c_call</code>    		</td><td> 9002738</td></tr><tr><td> <code>caml_block_domain</code> 	</td><td> 9001576</td></tr><tr><td> <code>block_domain</code> 		</td><td> 9001462</td></tr><tr><td> <code>camlUnikernel__loop_1270</code> 	</td><td> 374418</td></tr><tr><td> <code>camlConsole__write_all_low_1123</code> </td><td> 298735</td></tr></tbody></table><p>Note: the trace only includes calls until the trace buffer was full, so these aren't the total times for the whole run.
But we can immediately see that we spent most of the time in <code>block_domain</code>, which is what Mirage calls when it has nothing to do and is waiting for an external event.
Here's a graph showing how many iterations of the test loop we had started over time:</p>
<p><img src="/blog/images/mirage-profiling/xen-console-messages.png" class="border center"/></p>
<p>So, we wrote 679 messages very quickly, then waited a long time, then wrote 1027 more, then waited again, etc.
I thought there might be a bug in <code>block_domain</code> causing it to miss a wake-up event, so I limited the time it would spend blocking.
It didn't make any difference; it would keep waking up, seeing that it had nothing to do, and going back to sleep again.</p>
<p>In case the problem was with Mirage's implementation of the shared rings or console device,
I tried writing the same test directly in C in Mini-OS's <code>test.c</code> and got the same result (I had to modify it slightly because by default Mini-OS's <code>console_print</code> discards data when the buffer is full instead of waiting).
Finally, I tried it from a Linux guest and got 25 KB/s (interestingly, Linux uses 100% CPU while doing this).
The times were highly variable (each point on this plot is from writing the message 10,000 times and calculating the average):</p>
<p><img src="/blog/images/mirage-profiling/xen-console-speed.png" class="border center"/></p>
<p>After some investigation, it turned out that Xen was deliberately limiting the rate:</p>
<figure class="code"><figcaption><span>xen/tools/console/daemon/io.c</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="cm">/* How many events are allowed in each time period */</span>
</span><span class="line"><span class="cp">#define RATE_LIMIT_ALLOWANCE 30</span>
</span><span class="line"><span class="cm">/* Duration of each time period in ms */</span>
</span><span class="line"><span class="cp">#define RATE_LIMIT_PERIOD 200</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Mystery solved, although I don't know why the rates are so variable.
Mirage wasn't doing anything except running the test case and Linux was booted with <code>init=/bin/bash</code>, so there was nothing else running there either.</p>
<p>Lessons:</p>
<ul>
<li>Just scrolling through the raw trace can be misleading. It appears that it keeps calling the loop function, but in fact almost all the time was spent in the two very rare <code>block_domain</code> calls. Graphing iterations over time can show these problems effectively.
</li>
<li>Compare with the speed on Linux. Sometimes, Xen really is that slow and it's not our fault.
</li>
<li>Compile everything for profiling or the results aren't much use.
</li>
</ul>
<h2 id="profiling-udp">Profiling UDP</h2>
<p>TCP involves ack packets, expanding windows and other complications, so I next looked at the simpler <a href="http://en.wikipedia.org/wiki/User_Datagram_Protocol">UDP</a> protocol.
Here, we can throw packets out continuously without worrying about the other end.</p>
<p>With a payload size of 1476 bytes (the maximum possible for UDP), I got 17 MB/s.
All packets were successfully received on my laptop.
My Linux guest got 13.4 MB/s with <code>nc -u &lt; /dev/zero</code>, so we're actually faster!</p>
<p>Here's a sample iteration from the trace:</p>
<table class="table"><thead><tr><th style="text-align: right"> Start  	</th><th> Function 				</th><th style="text-align: right"> Duration</th></tr></thead><tbody><tr><td style="text-align: right"> 243418	</td><td> - - &gt; - <code>camlUnikernel__loop_1287</code>	</td><td style="text-align: right"> 938</td></tr><tr><td style="text-align: right"> 243436	</td><td> - - &gt; - - <code>camlUdpv4__fun_1430</code>	</td><td style="text-align: right"> 266</td></tr><tr><td style="text-align: right"> 243439	</td><td> - - &gt; - - &gt; <code>camlIpv4__allocate_frame_1369</code>	</td><td style="text-align: right"> 190</td></tr><tr><td style="text-align: right"> 243440	</td><td> - - &gt; - - &gt; - <code>camlIo_page__get_1122</code>	</td><td style="text-align: right"> 72</td></tr><tr><td style="text-align: right"> 243632	</td><td> - - &gt; - - &gt; <code>camlIpv4__fun_1509</code>	</td><td style="text-align: right"> 69</td></tr><tr><td style="text-align: right"> 243818	</td><td> - - &gt; - - <code>camlNetif__fun_2893</code>	</td><td style="text-align: right"> 185</td></tr><tr><td style="text-align: right"> 243852	</td><td> - - &gt; - - &gt; <code>camlNetif__fun_2618</code>	</td><td style="text-align: right"> 103</td></tr><tr><td style="text-align: right"> 243895	</td><td> - - &gt; - - &gt; - <code>camlLwt_ring__fun_1223</code> </td><td style="text-align: right"> 58</td></tr><tr><td style="text-align: right"> 244007	</td><td> - - &gt; - - <code>camlNetif__fun_2931</code>	</td><td style="text-align: right"> 132</td></tr><tr><td style="text-align: right"> 244009	</td><td> - - &gt; - - &gt; <code>camlNetif__xmit_1509</code>	</td><td style="text-align: right"> 123</td></tr><tr><td style="text-align: right"> 244028	</td><td> - - &gt; - - &gt; - <code>camlNetif__fun_2618</code>	</td><td style="text-align: right"> 76</td></tr><tr><td style="text-align: right"> 244141	</td><td> - - &gt; - - <code>camlNetif__fun_2958</code>	</td><td style="text-align: right"> 177</td></tr><tr><td style="text-align: right"> 244144	</td><td> - - &gt; - - &gt; <code>camlRing__push_requests_and_check_notify_1112</code>	</td><td style="text-align: right"> 174</td></tr><tr><td style="text-align: right"> 244150	</td><td> - - &gt; - - &gt; - <code>camlRing__sring_push_requests_1070</code>	</td><td style="text-align: right"> 158</td></tr><tr><td style="text-align: right"> 244151	</td><td> - - &gt; - - &gt; - - <code>caml_memory_barrier</code>	</td><td style="text-align: right"> 77</td></tr><tr><td style="text-align: right"> 244231	</td><td> - - &gt; - - &gt; - - <code>caml_memory_barrier</code>	</td><td style="text-align: right"> 77</td></tr></tbody></table><p>Here's a graph of loop iterations (packets sent) over time (each blue dot is one packet sent):</p>
<p><img src="/blog/images/mirage-profiling/udp-packets.png" class="border center"/></p>
<p>The gaps indicate places where we were not sending packets.
The garbage collector shows up twice in the trace (both times in <code>Ring.ack_responses</code> oddly).
However, we spend more time in <code>block_domain</code> than doing GC, indicating that we're often waiting for Xen.
Looking at the trace just before it blocks, I see calls to <code>Netif.wait_for_free_tx</code>, which seems reasonable.</p>
<h2 id="profiling-tcp">Profiling TCP</h2>
<p>The TCP header is larger than the UDP one, making it less efficient even in the best case,
and TCP needs to process acks, keep track of window sizes, and handle retransmissions.
Strange, then, that the Linux guest manages 39 MB/s over TCP compared with just 13.4 MB/s for UDP!
(even stranger is that I got 47.2 MB/s for Linux when I tried it for last
month's post; however I am using a different version of Linux in dom0 now)</p>
<p>I capture some packets sent by the Linux guest using <code>tshark</code> running in Dom0.
Loading it into Wireshark on my laptop, I see that all the TCP checksums are wrong, so it
looks like Linux is using TCP checksum offloading.</p>
<p>According to <a href="http://lists.xen.org/archives/html/xen-devel/2013-12/msg00884.html">Question about TCP checksum offload in Xen</a>:</p>
<blockquote>
<p>A domain has no way of knowing how any given packet is going to leave the host (or even if it is) so it can't know ahead of time whether to calculate any checksums: the skb's [socket buffers] are just marked with &quot;checksum needed&quot; as usual and either the egress NIC will do the job or dom0 will do it.</p>
</blockquote>
<p>Getting this working on Mirage was a bit tricky.
The TCP layer can avoid adding the checksum only if the network device says it's capable of doing it itself, and packets have to be flagged as needing the checksum.
You can't just flag all packets because the Linux dom0 silently drops non-TCP/UDP packets with it set (e.g. ARP packets).
I hacked something together and got a modest speed improvement.</p>
<p>Here's a graph for the TCP test, where each iteration of the loop is sending one TCP packet (segment):</p>
<p><img src="/blog/images/mirage-profiling/tcp-packets.png" class="border center"/></p>
<p>Note: we send many warm up packets before starting the trace as TCP starts slowly (which looks pretty but isn't relevant here).</p>
<p>Zooming in, the picture is quite interesting (where it had gaps, I searched for a typical function that occurred in the gap and added a symbol for it):</p>
<p><img src="/blog/images/mirage-profiling/tcp-packets-zoom.png" class="border center"/></p>
<p>It looks like we start by transmitting packets steadily, until the current window is full.
Then we start buffering the packets instead of sending them, which is very fast.
At some point the TCP system stops accepting more data, which causes the main loop to block, allowing us to process other events.
<code>rx_poll response</code> indicates one iteration of the <code>Netif.rx_poll</code> loop, which seems to be dealing with acks from Xen saying that our packets have been transmitted (and the memory can therefore be recycled).
After a while, the TCP ack packets arrive and we process them, which opens up the transmit window again.
Then we send out the buffered packets, before returning to the main loop.</p>
<p>So, in each cycle we spend about 60% of the time transmitting packets, a quarter dealing with acks from Xen and the rest handling TCP acks from the remote host.
It might be possible to optimise things a bit here by reusing grant references, but I didn't investigate further.</p>
<h2 id="profiling-disk-access">Profiling disk access</h2>
<p>My next test case reads a series of sectors sequentially from the disk and then writes them. Reading or writing one sector (4096 bytes) at a time was very slow (2.7 MB/s read, 0.7 MB/s write).
Using larger buffers, so that we transfer more in each operation, helped but even at 64 sectors per op I only got 12.3 MB/s read / 5.12 MB/s write (the device is capable of 20 MB/s read and 10 MB/s write).
Here's a trace where we read using 32-sector buffers (10.9 MB/s):</p>
<p><img src="/blog/images/mirage-profiling/block-reads-1-32.png" class="border center"/></p>
<p>We spend a lot of time waiting for each block to arrive, although there are some curious ack messages, which we deal with quickly.
What if we have two requests in flight at once?
This gets us 18.27 MB/s:</p>
<p><img src="/blog/images/mirage-profiling/block-reads-2-32.png" class="border center"/></p>
<p>Strangely, the two blocks arrive close together.
Although it takes us longer to get the first one (I don't know why), we get them more quickly after that.
Having three requests in flight doesn't help though (18.25 MB/s):</p>
<p><img src="/blog/images/mirage-profiling/block-reads-3-32.png" class="border center"/></p>
<p>Looking at the block driver code, it batches requests into groups of 11. This probably explains why 32 sectors-per-read did well - it's very close to 33.</p>
<p>For writing, the number of requests in flight makes little difference, but writing 8 sectors in each request is by far the best (7 MB/s).</p>
<p>I don't understand why we're not getting the full speed of the card here, since we're spending most of the time blocking.
However, we are pretty close (18r/7w out of a possible 20r/10w), which is good enough for today.</p>
<h3 id="update-linux-is-slow-too">Update: Linux is slow too!</h3>
<p>I originally tested with <code>hdparm</code>, which reports about 20 MB/s as expected:</p>
<pre><code>$ hdparm -t /dev/mmcblk0
 Timing buffered disk reads:  62 MB in  3.07 seconds =  20.21 MB/sec
</code></pre>
<p>But testing with <code>dd</code>, I don't get this speed.
<code>dd</code>'s speed seems to depend a lot on the block size. Using <code>4096 * 11</code> bytes (which I assume is what dom0 would do in response to a single guest request), I get just 16.9 MB/s:</p>
<pre><code>$ dd iflag=direct if=/dev/vg0/bench of=/dev/null bs=45056 count=1000
1000+0 records in
1000+0 records out
45056000 bytes (45 MB) copied, 2.65911 s, 16.9 MB/s
</code></pre>
<table class="table"><thead><tr><th style="text-align: right"> Block size (pages)  </th><th style="text-align: right">  Linux dom0   </th><th style="text-align: right"> Linux domU</th></tr></thead><tbody><tr><td style="text-align: right"> 11  	        </td><td style="text-align: right"> 17.0 MB/s     </td><td style="text-align: right"> 14.5 MB/s</td></tr><tr><td style="text-align: right"> 16		        </td><td style="text-align: right"> 18.8 MB/s     </td><td style="text-align: right"> 16.3 MB/s</td></tr><tr><td style="text-align: right"> 32		        </td><td style="text-align: right"> 20.8 MB/s     </td><td style="text-align: right"> 18.6 MB/s</td></tr></tbody></table><p>So perhaps Mirage is doing pretty well already - it's about as fast as the Linux guest.
Xen seems to be the limiting factor here, because it doesn't allow us to make large enough requests.</p>
<h2 id="profiling-the-queuing-service">Profiling the queuing service</h2>
<p>Finally, I looked at applying all this new information to my queuing service.
As a baseline, <code>wget</code> reports that I can currently download from it at 4.6 MB/s, with profiling compiled in but disabled:</p>
<p><img src="/blog/images/mirage-profiling/http-download1.png" class="border center"/></p>
<p>There's some complicated copying going on because we're using the HTTP Chunked encoding, which writes the size of each chunk of data followed by the data itself, then the next chunk, etc.
Since we know the length at the start, we can use the simpler Fixed encoding.
This increases the speed to 5.2 MB/s.
It's a shame the HTTP API uses strings everywhere: we have to copy the data from the disk buffer to a string on the heap to give it to the HTTP API, which then copies it back into a new buffer to send it to the network card.
If it took a stream of buffers, we could just pass them straight through.</p>
<p>Finally, I added the read-ahead support from the block profiling above, which increased the speed to 6.8 MB/s.
Here's the new graph, showing that we're sending packets much faster (note the change in the Y-scale):</p>
<p><img src="/blog/images/mirage-profiling/http-download2.png" class="border center"/></p>
<p>I used a queue length of 5, with 33 sectors per request. I tried increasing it to 10, but that caused more GC work.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Even the unoptimised service is faster than my current (ADSL) Internet connection, so optimising it isn't currently necessary, but it's interesting to look at performance and get a feel for where the bottlenecks are.</p>
<p>Mirage doesn't have any specific profiling support, but the fact that the whole OS is a single executable makes profiling it quite easy.
OCaml's <code>profile</code> option isn't a perfect fit for tracing because it doesn't record when a function finishes, but you can still get useful results from it.
Graphing some metric (e.g. packets sent) over time seemed the most useful way to look at the data.
I'm currently just using libreoffice's chart tool, but I should probably find something more suitable.
It would be great to be able to zoom in easily, show durations (not just events), filter the trace display easily, etc.
I'd also like support for following Lwt threads even when they block.
Recommendations for good visualisation tools welcome!</p>
<p>Writing to the Xen console from Mirage is slow because <code>xenconsoled</code> rate limits us. Mirage still gets better performance than Linux though, and uses far less CPU (looks like Linux is just spinning). My UDP test kernel sent data faster than Linux's <code>nc</code> utility (probably because <code>nc</code> made a poor choice of payload size). Linux does very well on TCP. I don't know why it's so fast. Using Xen's TCP checksum offloading does help a bit though. SD card performance on Mirage is close to what the hardware supports when I choose the right request size and keep two requests in flight at once. It's surprising we don't manage the full speed, though. For networking and disk access, managing Xen's grant refs for the shared memory pages seems to take up a lot of time - maybe there are ways to optimise that.</p>
<p>With a few modifications (TCP checksum offload, HTTP fixed encoding, keeping multiple disk reads in flight and using optimal buffer sizes), I increased the download speed of my test service running on my ARM dev board from 2.46 MB/s to 7.24 MB/s (when compiled without profiling).
I'm sure people more familiar with Mirage will have more suggestions.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">My first unikernel</title>
    <link href="https://roscidus.com/blog/blog/2014/07/28/my-first-unikernel/"></link>
    <updated>2014-07-28T16:21:10+00:00</updated>
    <id>https://roscidus.com/blog/blog/2014/07/28/my-first-unikernel</id>
    <content type="html"><![CDATA[<p>I wanted to make a simple REST service for queuing file uploads, deployable as a virtual machine. The traditional way to do this is to download a Linux cloud image, install the software inside it, and deploy that. Instead I decided to try a <a href="http://queue.acm.org/detail.cfm?id=2566628">unikernel</a>.</p>
<p>Unikernels promise some interesting benefits. The Ubuntu 14.04 amd64-disk1.img cloud image is 243 MB unconfigured, while the unikernel ended up at just 5.2 MB (running the queue service). Ubuntu runs a large amount of C code in security-critical places, while the unikernel is almost entirely type-safe OCaml. And besides, trying new things is fun.</p>
<!-- more -->
<p>( this post also appeared on <a href="http://www.reddit.com/r/programming/comments/2c1soi/my_first_unikernel_created_in_ocaml_and_mirage/">Reddit</a> and <a href="https://news.ycombinator.com/item?id=8109485">Hacker News</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
<ul>
<li><a href="#a-hello-world-kernel">A hello world kernel</a>
</li>
<li><a href="#using-mirage-libraries">Using Mirage libraries</a>
</li>
<li><a href="#the-mirage-unix-libraries">The mirage-unix libraries</a>
</li>
<li><a href="#the-mirage-tool">The <code>mirage</code> tool</a>
</li>
</ul>
</li>
<li><a href="#test-case">Test case</a>
<ul>
<li><a href="#storage">Storage</a>
</li>
<li><a href="#implementation">Implementation</a>
</li>
<li><a href="#unit-testing-the-storage-system">Unit-testing the storage system</a>
</li>
<li><a href="#the-http-server">The HTTP server</a>
</li>
<li><a href="#buffered-reads">Buffered reads</a>
</li>
<li><a href="#streaming-uploads">Streaming uploads</a>
</li>
<li><a href="#buffered-writes">Buffered writes</a>
</li>
<li><a href="#upload-speed-on-xen">Upload speed on Xen</a>
</li>
<li><a href="#tcp-retransmissions">TCP retransmissions</a>
</li>
<li><a href="#adding-a-block-cache">Adding a block cache</a>
</li>
<li><a href="#replacing-fat">Replacing FAT</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>Regular readers will know that a few months ago I began a new job at Cambridge University.
Working for an author of Real World OCaml and leader of OCaml Labs, on a project building pure-OCaml distributed systems, who found me through my blog posts about learning OCaml, I thought they might want me to write some OCaml.</p>
<p>But no.
They've actually had me <a href="http://openmirage.org/blog/introducing-xen-minios-arm">porting the tiny Mini-OS kernel to ARM</a>, using a mixture of C and assembler, to let <a href="http://openmirage.org/">the Mirage unikernel</a> run on ARM devices.
Of course, I got curious and wanted to write a Mirage application for myself...</p>
<h2 id="introduction">Introduction</h2>
<p>Linux, like many popular operating systems, is a <em>multi-user</em> system.
This design dates back to the early days of computing, when a single expensive computer, running a single OS, would be shared between many users.
The goal of the kernel is to protect itself from its users, and to protect the users from each other.</p>
<p>Today, computers are cheap and many people own several.
Even when a physical computer is shared (e.g. in cloud computing), this is typically done by running multiple virtual machines, each serving a single user.
Here, protecting the OS from its (usually single) application is pointless.</p>
<p>Removing the security barrier between the kernel and the application greatly simplifies things;
we can run the whole system (kernel + application) as a single, privileged, executable - a <em>unikernel</em>.</p>
<p>And while we're rewriting everything anyway, we might as well replace C with a modern memory safe language, eliminating whole classes of bugs and security vulnerabilities, allowing decent error reporting, and providing structured data types throughout.</p>
<p>In the past, two things have made writing a completely new OS impractical:</p>
<ul>
<li>Legacy applications won't run on it.
</li>
<li>It probably won't support your hardware.
</li>
</ul>
<p>Virtualisation removes both obstacles:
legacy applications can run in their own legacy VMs, and
drivers are only needed for the virtual devices - e.g.
a single network driver and a single block driver will cover all real network cards and hard drives.</p>
<h3 id="a-hello-world-kernel">A hello world kernel</h3>
<p>The <a href="http://openmirage.org/wiki/install">mirage tutorial</a> starts by showing the easy, fully-automated way to build a unikernel.
If you want to get started quickly you may prefer to read that and skip this section, but since one of the advantages of unikernels is their relative simplicity, let's do things the &quot;hard&quot; way first to understand how it works behind the scenes.</p>
<p>Here's the normal &quot;hello world&quot; program in OCaml:</p>
<figure class="code"><figcaption><span>hw.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">print_endline</span> <span class="s2">&quot;Hello, world!&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To compile and run as a normal application, we'd do:</p>
<pre><code>$ ocamlopt hw.ml -o hw
$ ./hw 
Hello, world!
</code></pre>
<p>How can we make a unikernel that does the equivalent?
As it turns out, the above code works unmodified (though the Mirage people might frown at you for doing it this way).
We compile hw.ml to a <code>hw.native.o</code> file and then link with the unikernel libraries instead of the standard C library:</p>
<pre><code>$ export OPAM_DIR=$(opam config var prefix)
$ export PKG_CONFIG_PATH=$OPAM_DIR/lib/pkgconfig
$ ocamlopt -output-obj -o hw.native.o hw.ml
$ ld -d -static -nostdlib --start-group \
    $(pkg-config --static --libs openlibm libminios-xen) \
    hw.native.o \
    $OPAM_DIR/lib/mirage-xen/libocaml.a \
    $OPAM_DIR/lib/mirage-xen/libxencaml.a \
    --end-group \
    $(gcc -print-libgcc-file-name) \
    -o hw.xen
</code></pre>
<p>We now have a kernel image, <code>hw.xen</code>, which can be booted as a VM under the <a href="http://www.xenproject.org/developers/teams/hypervisor.html">Xen hypervisor</a> (as used by Amazon, Rackspace, etc to host VMs). But first, let's look at the libraries we added:</p>
<dl><dt>openlibm</dt>
<dd>
This is a standard maths library. It provides functions such as <code>sin</code>, <code>cos</code>, etc.
</dd>
<dt>libminios-xen</dt>
<dd>
This provides the architecture-specific boot code, a <code>printk</code> function for debugging, <code>malloc</code> for allocating memory and some low-level functions for talking to Xen.
</dd>
<dt>libocaml.a</dt>
<dd>
The OCaml runtime (the garbage collector, etc).
</dd>
<dt>libxencaml.a</dt>
<dd>
OCaml bindings for libminios and some boot code.
</dd>
<dt>libgcc.a</dt>
<dd>
Support functions for code that gcc generates (actually, not needed on x86).
</dd>
</dl>
<p>To deploy the new unikernel, we create a Xen configuration file for it (here, I'm giving it 16 MB of RAM):</p>
<figure class="code"><figcaption><span>hw.xl</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="n">name</span> <span class="o">=</span> <span class="s1">&#39;hw&#39;</span>
</span><span class="line"><span class="n">kernel</span> <span class="o">=</span> <span class="s1">&#39;hw.xen&#39;</span>
</span><span class="line"><span class="n">memory</span> <span class="o">=</span> <span class="mi">16</span>
</span><span class="line"><span class="n">on_crash</span> <span class="o">=</span> <span class="s1">&#39;preserve&#39;</span>
</span><span class="line"><span class="n">on_poweroff</span> <span class="o">=</span> <span class="s1">&#39;preserve&#39;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Setting <code>on_crash</code> and <code>on_poweroff</code> to <code>preserve</code> lets us see any output or errors, which would otherwise be missed if the VM exits too quickly.</p>
<p>We can now boot our new VM:</p>
<pre><code>$ xl create -c hw.xl
Xen Minimal OS!
  start_info: 000000000009b000(VA)
    nr_pages: 0x800
  shared_inf: 0x6ee97000(MA)
     pt_base: 000000000009e000(VA)
nr_pt_frames: 0x5
    mfn_list: 0000000000097000(VA)
   mod_start: 0x0(VA)
     mod_len: 0
       flags: 0x0
    cmd_line: 
       stack: 0000000000055e00-0000000000075e00
Mirage: start_kernel
MM: Init
      _text: 0000000000000000(VA)
     _etext: 000000000003452d(VA)
   _erodata: 000000000003c000(VA)
     _edata: 000000000003e4d0(VA)
stack start: 0000000000055e00(VA)
       _end: 0000000000096d64(VA)
  start_pfn: a6
    max_pfn: 800
Mapping memory range 0x400000 - 0x800000
setting 0000000000000000-000000000003c000 readonly
skipped 0000000000001000
MM: Initialise page allocator for a8000(a8000)-800000(800000)
MM: done
Demand map pfns at 801000-2000801000.
Initialising timer interface
Initialising console ... done.
gnttab_table mapped at 0000000000801000.
xencaml: app_main_thread
getenv(OCAMLRUNPARAM) -&gt; null
getenv(CAMLRUNPARAM) -&gt; null
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Unsupported function lseek called in Mini-OS kernel
Hello, world!
main returned 0
</code></pre>
<p>( Note: I'm testing locally by running Xen under VirtualBox. Not all of Xen's features can be used in this mode, but it works for testing unikernels. I'm also using my Git version of <code>mirage-xen</code>; the official one will display an error after printing the greeting because it expects you to provide a mainloop too. The warnings about <code>lseek</code> are just OCaml trying to find the current file offsets for <code>stdin</code>, <code>stdout</code> and <code>stderr</code>.)</p>
<p>As you can see, the boot process is quite short.
Execution begins at <a href="http://xenbits.xenproject.org/gitweb/?p=xen.git;a=blob;f=extras/mini-os/arch/x86/x86_64.S;h=df3469ef4319a75cc9d4c36b51f4097897c015f2;hb=HEAD#l19"><code>_start</code></a>.
Using <code>objdump -d hw.xen</code>, you can see that this just sets up the stack pointer register and calls the C function <code>arch_init</code>:</p>
<pre><code>0000000000000000 &lt;_start&gt;:
       0:   fc                      cld    
       1:   48 8b 25 0f 00 00 00    mov    0xf(%rip),%rsp        # 17 &lt;stack_start&gt;
       8:   48 81 e4 00 00 ff ff    and    $0xffffffffffff0000,%rsp
       f:   48 89 f7                mov    %rsi,%rdi
      12:   e8 e2 bb 00 00          callq  bbf9 &lt;arch_init&gt;
</code></pre>
<p><a href='http://xenbits.xenproject.org/gitweb/?p=xen.git;a=blob;f=extras/mini-os/arch/x86/setup.c;h=5e87dd1d99014a8139d8a63b375793661e57263b;hb=HEAD#l93'>arch_init</a> (in libminios) initialises the traps and FPU and then prints <code>Xen Minimal OS!</code> and information about various addresses.
It then calls <code>start_kernel</code>.</p>
<p><a href='https://github.com/mirage/mirage-platform/blob/45814f85acca6174915acbb571146e1c4e978684/xen/runtime/xencaml/main.c#L57'>start_kernel</a> (in libxencaml) sets up a few more features (events, interrupts, malloc, time-keeping and grant tables), then calls <code>caml_startup</code>.</p>
<p><a href='https://github.com/mirage/mirage-platform/blob/45814f85acca6174915acbb571146e1c4e978684/xen/runtime/ocaml/startup.c#L202'>caml_startup</a> (in libocaml) initialises the garbage collector and calls <code>caml_program</code>, which is our <code>hw.native.o</code>.</p>
<p>We call <code>print_endline</code>, which libxencaml, as a convenience for debugging, forwards to libminios's <code>console_print</code>.</p>
<h3 id="using-mirage-libraries">Using Mirage libraries</h3>
<p>The above was a bit of a hack, which ended up just using the C console driver in libminios (one of the few things it provides, as it's needed for printk).
We can instead use the <code>mirage-console-xen</code> OCaml library, like this:</p>
<figure class="code"><figcaption><span>hw.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Lwt</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Console</span><span class="p">.</span><span class="n">connect</span> <span class="s2">&quot;0&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">function</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="n">failwith</span> <span class="s2">&quot;Failed to connect to console&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="n">default_console</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="nn">Console</span><span class="p">.</span><span class="n">log_s</span> <span class="n">default_console</span> <span class="s2">&quot;Hello, world!&quot;</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">OS</span><span class="p">.</span><span class="nn">Main</span><span class="p">.</span><span class="n">run</span> <span class="n">main</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Mirage uses the usual <code>Lwt</code> library for cooperative threading, which I wrote about at last year in <a href="https://roscidus.com/blog/blog/2013/11/28/asynchronous-python-vs-ocaml/">Asynchronous Python vs OCaml</a> - <code>&gt;&gt;=</code> means to wait for the result, allowing other code to run. Everything in Mirage is non-blocking, even looking up the console. <code>OS.Main.run</code> runs the main event loop.</p>
<p>Since we're using libraries, let's switch to ocamlbuild and give the dependencies in the <code>_tags</code> file, as usual for OCaml projects:</p>
<pre><code>true: warn(A), strict_sequence, package(mirage-console-xen)
</code></pre>
<p>The only unusual thing we have to do here is tell ocamlbuild not to link in the <code>Unix</code> module when we build <code>hw.native.o</code>:</p>
<pre><code>$ ocamlbuild -lflags -linkpkg,-dontlink,unix -use-ocamlfind hw.native.o
</code></pre>
<p>In the same way, we can use other libraries to access raw block devices (<a href="https://github.com/mirage/mirage-block-xen">mirage-block-xen</a>), timers (<a href="https://github.com/mirage/mirage-clock">mirage-clock-xen</a>) and network interfaces (<a href="https://github.com/mirage/mirage-net-xen/">mirage-net-xen</a>).
Other (non-Xen-specific) OCaml libraries can then be used on top of these low-level drivers.
For example, <a href="https://github.com/mirage/ocaml-fat">fat-filesystem</a> can provide a filesystem on a block device, while <a href="https://github.com/mirage/mirage-tcpip">tcpip</a> provides an OCaml TCP/IP stack on a network interface.</p>
<h3 id="the-mirage-unix-libraries">The mirage-unix libraries</h3>
<p>You may have noticed that the Xen driver libraries we used above ended in <code>-xen</code>.
In fact, each of these is just an implementation of some generic interface provided by Mirage.
For example, <a href="https://github.com/mirage/mirage/blob/master/types/V1.mli">mirage/types</a> defines the abstract <code>CONSOLE</code> interface as:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">CONSOLE</span> <span class="o">=</span> <span class="k">sig</span>
</span><span class="line">  <span class="c">(** Text console input/output operations. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">type</span> <span class="n">error</span> <span class="o">=</span> <span class="o">[</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Invalid_console</span> <span class="k">of</span> <span class="kt">string</span>
</span><span class="line">  <span class="o">]</span>
</span><span class="line">  <span class="c">(** The type representing possible errors when attaching a console. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">include</span> <span class="nc">DEVICE</span> <span class="k">with</span>
</span><span class="line">    <span class="k">type</span> <span class="n">error</span> <span class="o">:=</span> <span class="n">error</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">write</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">int</span>
</span><span class="line">  <span class="c">(** [write t buf off len] writes up to [len] chars of [String.sub buf</span>
</span><span class="line"><span class="c">      off len] to the console [t] and returns the number of bytes</span>
</span><span class="line"><span class="c">      written. Raises {!Invalid_argument} if [len &gt; buf - off]. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">write_all</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="n">io</span>
</span><span class="line">  <span class="c">(** [write_all t buf off len] is a thread that writes [String.sub buf</span>
</span><span class="line"><span class="c">      off len] to the console [t] and returns when done. Raises</span>
</span><span class="line"><span class="c">      {!Invalid_argument} if [len &gt; buf - off]. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">log</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">unit</span>
</span><span class="line">  <span class="c">(** [log str] writes as much characters of [str] that can be written</span>
</span><span class="line"><span class="c">      in one write operation to the console [t], then writes</span>
</span><span class="line"><span class="c">      &quot;\r\n&quot; to it. *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">val</span> <span class="n">log_s</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="n">io</span>
</span><span class="line">  <span class="c">(** [log_s str] is a thread that writes [str ^ &quot;\r\n&quot;] in the</span>
</span><span class="line"><span class="c">      console [t]. *)</span>
</span><span class="line">
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>By linking against the <code>-unix</code> versions of libraries rather than the <code>-xen</code> ones, we can compile our code as an ordinary Unix program and run it directly.
This makes testing and debugging very easy.</p>
<p>To make sure our code is generic enough to do this, we can wrap it in a functor that takes any console module as an input:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span> <span class="s2">&quot;Hello, world!&quot;</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The code that provides a Xen or Unix console and calls this goes in <code>main.ml</code>:</p>
<figure class="code"><figcaption><span>main.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Lwt</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">console</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Console</span><span class="p">.</span><span class="n">connect</span> <span class="s2">&quot;0&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">function</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="n">failwith</span> <span class="s2">&quot;Failed to connect to console&quot;</span>
</span><span class="line">  <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="n">c</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="n">c</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="nc">U</span> <span class="o">=</span> <span class="nn">Unikernel</span><span class="p">.</span><span class="nc">Main</span><span class="o">(</span><span class="nc">Console</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">OS</span><span class="p">.</span><span class="nn">Main</span><span class="p">.</span><span class="n">run</span> <span class="o">(</span><span class="n">console</span> <span class="o">&gt;&gt;=</span> <span class="nn">U</span><span class="p">.</span><span class="n">start</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="the-mirage-tool">The <code>mirage</code> tool</h3>
<p>With the platform-specific code isolated in <code>main.ml</code>, we can now use the <code>mirage</code> command-line tool to generate it automatically for the target platform.
<code>mirage</code> takes a <code>config.ml</code> configuration file and generates <code>Makefile</code> and <code>main.ml</code> based on the current platform and the arguments passed.</p>
<figure class="code"><figcaption><span>config.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">open</span> <span class="nc">Mirage</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="o">=</span> <span class="n">foreign</span> <span class="s2">&quot;Unikernel.Main&quot;</span> <span class="o">(</span><span class="n">console</span> <span class="o">@-&gt;</span> <span class="n">job</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="n">register</span> <span class="s2">&quot;hw&quot;</span> <span class="o">[</span>
</span><span class="line">    <span class="n">main</span> <span class="o">$</span> <span class="n">default_console</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><pre><code>$ mirage configure --unix
$ make
$ ./mir-hw 
Hello, world!
</code></pre>
<p>I won't describe this in detail because at this point we've reached the start of the <a href="http://openmirage.org/wiki/hello-world">official tutorial</a>, and you can read that instead.</p>
<h2 id="test-case">Test case</h2>
<p>Because <a href="http://0install.net">0install</a> is decentralised, it doesn't need a single centrally-managed repository (or several incompatible repositories, each trying to package every program, as is common with Linux distributions).
In 0install, it's possible for every developer to run their own <a href="http://0install.net/0repo.html">repository</a>, containing just their software, with cross-repository dependencies handled automatically.
But just because it's possible doesn't mean we have to go to that extreme: having medium sized repositories each managed by a team of people can be very convenient, especially where package maintainers come and go.</p>
<p>The general pattern for a group repository is to have a public server that accepts new package uploads from developers, and a private (firewalled) server with the repository's GPG key, which downloads from it:</p>
<p><img src="/blog/images/0repo-multi.png" class="center"/></p>
<p>Debian uses an anonymous FTP server for its incoming queue, polling it with a cron job.
This turns out to be surprisingly complicated.
You need to handle incomplete uploads (not processing them until they're done, or deleting them eventually if they never complete), allow contributors to overwrite or delete their own partial uploads (Debian allows you to upload a GPG-signed command file, which provides some control), etc, as well as keep the service fully patched.
Also, the cron system can be annoying: if the package contains a mistake then it will be several minutes before it discovers this and emails the packager.</p>
<p>Perhaps there are some decent systems out there to handle all this, but it seemed like a good opportunity to try making a unikernel.</p>
<p>A particularly nice feature of this test-case is that it doesn't matter too much if it fails:
the repository itself will check the developer's signature on the files, so an attacker can't compromise the repository by breaking into the queue; everything in the queue is intended to become public, so we need not worry much about confidentiality; lost uploads can be easily resubmitted; and if it goes down for a bit, it just means that new software can't be added to the repository.
So, there's nothing critical about this service, which is reassuring.</p>
<h3 id="storage">Storage</h3>
<p>The <a href="https://github.com/mirage/merge-queues">merge-queues</a> library builds a queue abstraction on top of
<a href="https://github.com/mirage/irmin">Irmin</a>, a Git-inspired storage system for Mirage.
But my needs are simple, and I wanted to test the more primitive libraries first, so I decided to build my queue directly on a plain filesystem.
This was the first interface I came up with:</p>
<figure class="code"><figcaption><span>upload_queue.mli</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="k">type</span> <span class="nc">FS</span> <span class="o">=</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">FS</span> <span class="k">with</span>
</span><span class="line">  <span class="k">type</span> <span class="n">page_aligned_buffer</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">t</span> <span class="ow">and</span>
</span><span class="line">  <span class="k">type</span> <span class="n">block_device_error</span> <span class="o">=</span> <span class="nn">Fat</span><span class="p">.</span><span class="nn">Fs</span><span class="p">.</span><span class="n">block_error</span>
</span><span class="line">
</span><span class="line"><span class="c">(** An upload.</span>
</span><span class="line"><span class="c"> * To avoid loading complete uploads into RAM, we stream them</span>
</span><span class="line"><span class="c"> * between the network and the disk. *)</span>
</span><span class="line"><span class="k">type</span> <span class="n">item</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">  <span class="n">size</span> <span class="o">:</span> <span class="n">int64</span><span class="o">;</span>
</span><span class="line">  <span class="n">data</span> <span class="o">:</span> <span class="kt">string</span> <span class="nn">Lwt_stream</span><span class="p">.</span><span class="n">t</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="k">type</span> <span class="n">add_error</span> <span class="o">=</span> <span class="o">[`</span><span class="nc">Wrong_size</span> <span class="k">of</span> <span class="n">int64</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Unknown</span> <span class="k">of</span> <span class="n">exn</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="k">module</span> <span class="nc">Make</span> <span class="o">:</span> <span class="k">functor</span> <span class="o">(</span><span class="nc">F</span> <span class="o">:</span> <span class="nc">FS</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="k">sig</span>
</span><span class="line">  <span class="c">(** An upload queue. *)</span>
</span><span class="line">  <span class="k">type</span> <span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="c">(** Create a new queue, backed by a filesystem. *)</span>
</span><span class="line">  <span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="nn">F</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="n">t</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">
</span><span class="line">  <span class="k">module</span> <span class="nc">Upload</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">    <span class="c">(** Add an upload to the queue.</span>
</span><span class="line"><span class="c">     * The upload is added only once the end of the stream is</span>
</span><span class="line"><span class="c">     * reached, and only if the total size matches the size</span>
</span><span class="line"><span class="c">     * in the record.</span>
</span><span class="line"><span class="c">     * To cancel an add, just terminate the stream. *)</span>
</span><span class="line">    <span class="k">val</span> <span class="n">add</span> <span class="o">:</span>
</span><span class="line">      <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">item</span> <span class="o">-&gt;</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Ok</span> <span class="k">of</span> <span class="kt">unit</span> <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="k">of</span> <span class="n">add_error</span> <span class="o">]</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">end</span>
</span><span class="line">
</span><span class="line">  <span class="k">module</span> <span class="nc">Download</span> <span class="o">:</span> <span class="k">sig</span>
</span><span class="line">    <span class="c">(** Interface for the repository software to fetch items</span>
</span><span class="line"><span class="c">     * from the queue. Only one client may use this interface</span>
</span><span class="line"><span class="c">     * at a time, or things will go wrong. *)</span>
</span><span class="line">
</span><span class="line">    <span class="c">(** Return a fresh stream for the item at the head of the</span>
</span><span class="line"><span class="c">     * queue, without removing it. After downloading it</span>
</span><span class="line"><span class="c">     * successfully, the client should call [delete]. If the</span>
</span><span class="line"><span class="c">     * queue is empty, this blocks until an item is available. *)</span>
</span><span class="line">    <span class="k">val</span> <span class="n">peek</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">item</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">
</span><span class="line">    <span class="c">(** Delete the item previously retrieved by [peek].</span>
</span><span class="line"><span class="c">     * If the previous item has already been deleted, this does</span>
</span><span class="line"><span class="c">     * nothing, even if there are more items in the queue. *)</span>
</span><span class="line">    <span class="k">val</span> <span class="n">delete</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="nn">Lwt</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">  <span class="k">end</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Our <code>unikernel.ml</code> will use this to make a queue, backed by a filesystem.
Uploaders' HTTP POSTs will be routed to <code>Upload.add</code>, while the repository's GET and DELETE invocations go to the <code>Download</code> submodule.
<code>delete</code> is a separate operation because we want the repository to confirm that it got the item successfully before we delete it, in case of network errors.</p>
<p>Ideally, we might require that the <code>DELETE</code> comes over the same HTTP connection as the <code>GET</code> just in case we accidentally run two instances of the repository software, but that's unlikely and it's convenient to test using separate <code>curl</code> invocations.</p>
<p>We're using another functor here, <code>Upload_queue.Make</code>, so that our queue will work over any filesystem.
In theory, we can configure our unikernel with a FAT filesystem on a block device when running under Xen,
while using a regular directory when running under Linux (e.g. for testing).</p>
<p>But it doesn't work.
You can see at the top that I had to restrict Mirage's abstract <code>FS</code> type in two ways:</p>
<ul>
<li>
<p>The <code>read</code> and <code>write</code> functions in <code>FS</code> pass the data using the abstract <code>page_aligned_buffer</code> type.
Since we need to do something with the data, this isn't good enough.
I therefore declare that this must be a <code>Cstruct.t</code> (basically, an array of bytes).
This is actually OK; <a href="https://github.com/mirage/mirage-fs-unix">mirage-fs-unix</a> also uses this type.</p>
</li>
<li>
<p>One of the possible error codes from <code>FS</code> is the abstract type <code>FS.block_device_error</code>, and I can't
see any way to turn one of these into a string using the <code>FS</code> interface.
I therefore require a filesystem implementation that defines it to be <code>Fat.Fs.block_error</code>.
Obviously, this means we now only support the FAT filesystem.</p>
</li>
</ul>
<p>This doesn't prevent us from running as a normal process, because we can ask for a Unix &quot;block&quot; device (actually, just a plain <code>disk.img</code> file) and pass that to the <code>Fat</code> module, but it would be nice to have the option of using a real directory.</p>
<p>I asked about this on the mailing list - <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2014-07/msg00048.html">Mirage questions from writing a REST service</a> - and it looks like the <code>FS</code> type will change soon.</p>
<h3 id="implementation">Implementation</h3>
<p>For the curious, this initial implementation is in <a href='https://github.com/0install/0repo-queue/blob/063db53c6faf76c7a9edd18416c25459603b777d/upload_queue.ml'>upload_queue.ml</a>.</p>
<p>Internally, the module creates an in-memory queue to keep track of successful uploads.
Uploads are streamed to the disk and
when an upload completes with the declared size, the filename is added to the queue.
If the upload ends with the wrong size (probably because the connection was lost), the file is deleted.</p>
<p>But what if our VM gets rebooted?
We need to scan the file system at start up and work out which uploads are complete and which should be deleted.
My first thought was to name the files <code>NUMBER.part</code> during the upload and rename on success.
However, the <code>FS</code> interface currently lacks a <code>rename</code> method.
Instead, I write an <code>N</code> byte to the start of each file and set it to <code>Y</code> on success.
That works, but renaming would be nicer!</p>
<p>For downloading, the <code>peek</code> function returns the item at the head of the queue.
If the queue is empty, it waits until something arrives.
The repository just makes a GET request - if something is available then it returns immediately,
otherwise the connection stays open until some data is ready, allowing the repository to respond immediately to new uploads.</p>
<h3 id="unit-testing-the-storage-system">Unit-testing the storage system</h3>
<p>Because our unikernel can run as a process, testing is easy even if you don't have a local Xen deployment.
A set of unit-tests test the upload queue module just as for any other program, and the service can be run as a normal process, listening on a normal TCP socket.
A slight annoyance here is that the generated Makefile doesn't include any rules to build the tests so you have to add them manually, and
if you regenerate the Makefile then it loses the new rule.</p>
<p>As you might expect from such a new system, testing uncovered several problems. The first (minor) problem is that when the disk becomes full, the unhelpful error reported by the filesystem is <code>Failure(&quot;Unknown error: Failure(\&quot;fault\&quot;)&quot;)</code>.</p>
<p>( I asked about this on the mailing list - <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2014-07/msg00069.html">Error handling in Mirage</a> - and there seems to be agreement that error handling should change. )</p>
<p>A more serious problem was that deleting files corrupted the FAT directory index.
I downloaded the FAT library and added a unit-test for delete, which made it easy to track the problem down (despite my lack of knowledge of FAT).
Here's the code for marking a directory entry as deleted in the FAT library:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">    <span class="k">let</span> <span class="n">b</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">sub</span> <span class="n">block</span> <span class="n">offset</span> <span class="n">sizeof</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">delta</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">create</span> <span class="n">sizeof</span> <span class="k">in</span>
</span><span class="line">    <span class="k">begin</span> <span class="k">match</span> <span class="n">unmarshal</span> <span class="n">b</span> <span class="k">with</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">Lfn</span> <span class="n">lfn</span> <span class="o">-&gt;</span>
</span><span class="line">	<span class="k">let</span> <span class="n">lfn&#39;</span> <span class="o">=</span> <span class="o">{</span> <span class="n">lfn</span> <span class="k">with</span> <span class="n">lfn_deleted</span> <span class="o">=</span> <span class="bp">true</span> <span class="o">}</span> <span class="k">in</span>
</span><span class="line">	<span class="n">marshal</span> <span class="n">delta</span> <span class="o">(</span><span class="nc">Lfn</span> <span class="n">lfn&#39;</span><span class="o">)</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">Dos</span> <span class="n">dos</span> <span class="o">-&gt;</span>
</span><span class="line">	<span class="k">let</span> <span class="n">dos&#39;</span> <span class="o">=</span> <span class="o">{</span> <span class="n">dos</span> <span class="k">with</span> <span class="n">deleted</span> <span class="o">=</span> <span class="bp">true</span> <span class="o">}</span> <span class="k">in</span>
</span><span class="line">	<span class="n">marshal</span> <span class="n">b</span> <span class="o">(</span><span class="nc">Dos</span> <span class="n">dos&#39;</span><span class="o">)</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">End</span> <span class="o">-&gt;</span> <span class="k">assert</span> <span class="bp">false</span>
</span><span class="line">    <span class="k">end</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Update</span><span class="p">.</span><span class="n">from_cstruct</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">of_int</span> <span class="n">offset</span><span class="o">)</span> <span class="n">delta</span> <span class="o">::</span> <span class="n">acc</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's supposed to take an entry, unmarshal it into an OCaml structure, set the <code>deleted</code> flag, and marshal the result into a new <code>delta</code> structure.
These deltas are returned and applied to the device.
The bug is a simple typo: <code>Lfn</code> (long filename) entries update correctly, but for old <code>Dos</code> ones it writes the new block to the input, not to <code>delta</code>.
The fix was simple enough (I also refactored it slightly to encourage the correct behaviour in future):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">    <span class="k">let</span> <span class="n">b</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">sub</span> <span class="n">block</span> <span class="n">offset</span> <span class="n">sizeof</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">delta</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">create</span> <span class="n">sizeof</span> <span class="k">in</span>
</span><span class="line">    <span class="n">marshal</span> <span class="n">delta</span> <span class="k">begin</span> <span class="k">match</span> <span class="n">unmarshal</span> <span class="n">b</span> <span class="k">with</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">Lfn</span> <span class="n">lfn</span> <span class="o">-&gt;</span> <span class="nc">Lfn</span> <span class="o">{</span> <span class="n">lfn</span> <span class="k">with</span> <span class="n">lfn_deleted</span> <span class="o">=</span> <span class="bp">true</span> <span class="o">}</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">Dos</span> <span class="n">dos</span> <span class="o">-&gt;</span> <span class="nc">Dos</span> <span class="o">{</span> <span class="n">dos</span> <span class="k">with</span> <span class="n">deleted</span> <span class="o">=</span> <span class="bp">true</span> <span class="o">}</span>
</span><span class="line">      <span class="o">|</span> <span class="nc">End</span> <span class="o">-&gt;</span> <span class="k">assert</span> <span class="bp">false</span>
</span><span class="line">    <span class="k">end</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Update</span><span class="p">.</span><span class="n">from_cstruct</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">of_int</span> <span class="n">offset</span><span class="o">)</span> <span class="n">delta</span> <span class="o">::</span> <span class="n">acc</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This demonstrates both the good and the bad of Mirage: the bug was easy to find and fix, using regular debugging tools.
I'm sure fixing a filesystem corruption bug in the Linux kernel would have been vastly more difficult.
On the other hard, Linux is rather well tested, whereas I appear to be the first person ever to try deleting a file in Mirage!</p>
<h3 id="the-http-server">The HTTP server</h3>
<p>This turned out to be quite simple. Here's the unikernel's <a href="https://github.com/0install/0repo-queue/blob/92c99d34104b6bea707044924f9753a9cc7f0414/unikernel.ml"><code>start</code></a> function:</p>
<figure class="code"><figcaption><span>unikernel.ml</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">F</span> <span class="o">:</span> <span class="nn">Upload_queue</span><span class="p">.</span><span class="nc">FS</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">H</span> <span class="o">:</span> <span class="nn">Cohttp_lwt</span><span class="p">.</span><span class="nc">Server</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Q</span> <span class="o">=</span> <span class="nn">Upload_queue</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">F</span><span class="o">)</span>
</span><span class="line">  <span class="o">[...]</span>
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="n">fs</span> <span class="n">http</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">write</span> <span class="o">:=</span> <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;starting queue service&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="nn">Q</span><span class="p">.</span><span class="n">create</span> <span class="n">fs</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">q</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="k">let</span> <span class="n">callback</span> <span class="o">_</span><span class="n">conn_id</span> <span class="n">request</span> <span class="n">body</span> <span class="o">=</span>
</span><span class="line">      <span class="k">match</span> <span class="nn">Uri</span><span class="p">.</span><span class="n">path</span> <span class="n">request</span><span class="o">.</span><span class="nn">H</span><span class="p">.</span><span class="nn">Request</span><span class="p">.</span><span class="n">uri</span> <span class="k">with</span>
</span><span class="line">      <span class="o">|</span> <span class="s2">&quot;/uploader&quot;</span> <span class="o">-&gt;</span> <span class="n">handle_uploader</span> <span class="n">q</span> <span class="n">request</span> <span class="n">body</span>
</span><span class="line">      <span class="o">|</span> <span class="s2">&quot;/downloader&quot;</span> <span class="o">-&gt;</span> <span class="n">handle_downloader</span> <span class="n">q</span> <span class="n">request</span>
</span><span class="line">      <span class="o">|</span> <span class="n">path</span> <span class="o">-&gt;</span>
</span><span class="line">          <span class="nn">H</span><span class="p">.</span><span class="n">respond_error</span>
</span><span class="line">	    <span class="o">~</span><span class="n">status</span><span class="o">:`</span><span class="nc">Bad_request</span>
</span><span class="line">      	    <span class="o">~</span><span class="n">body</span><span class="o">:(</span><span class="nn">Printf</span><span class="p">.</span><span class="n">sprintf</span> <span class="s2">&quot;Bad path &#39;%s&#39;</span><span class="se">\n</span><span class="s2">&quot;</span> <span class="n">path</span><span class="o">)</span>
</span><span class="line">	    <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="k">let</span> <span class="n">conn_closed</span> <span class="o">_</span><span class="n">conn_id</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">      <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;connection closed&quot;</span> <span class="o">|&gt;</span> <span class="n">ignore</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line">    <span class="n">http</span> <span class="o">{</span> <span class="nn">H</span><span class="p">.</span>
</span><span class="line">      <span class="n">callback</span><span class="o">;</span>
</span><span class="line">      <span class="n">conn_closed</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line"><span class="k">end</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, our functor is extended to take a filesystem (using the restricted type required by our <code>Upload_queue</code>, as noted above) and an HTTP server module as arguments.</p>
<p>The HTTP server calls our <code>callback</code> each time it receives a request, and this dispatches <code>/uploader</code> requests to <code>handle_uploader</code> and <code>/downloader</code> ones to <code>handle_downloader</code>. These are also very simple, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="k">let</span> <span class="n">get</span> <span class="n">q</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">Q</span><span class="p">.</span><span class="nn">Download</span><span class="p">.</span><span class="n">peek</span> <span class="n">q</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="o">{</span><span class="nn">Upload_queue</span><span class="p">.</span><span class="n">size</span><span class="o">;</span> <span class="n">data</span><span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">body</span> <span class="o">=</span> <span class="nn">Cohttp_lwt_body</span><span class="p">.</span><span class="n">of_stream</span> <span class="n">data</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">headers</span> <span class="o">=</span> <span class="nn">Cohttp</span><span class="p">.</span><span class="nn">Header</span><span class="p">.</span><span class="n">init_with</span>
</span><span class="line">      <span class="s2">&quot;Content-Length&quot;</span> <span class="o">(</span><span class="nn">Int64</span><span class="p">.</span><span class="n">to_string</span> <span class="n">size</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">    <span class="c">(* Adding a content-length loses the transfer-encoding</span>
</span><span class="line"><span class="c">     * for some reason, so add it back: *)</span>
</span><span class="line">    <span class="k">let</span> <span class="n">headers</span> <span class="o">=</span> <span class="nn">Cohttp</span><span class="p">.</span><span class="nn">Header</span><span class="p">.</span><span class="n">add</span> <span class="n">headers</span>
</span><span class="line">      <span class="s2">&quot;transfer-encoding&quot;</span> <span class="s2">&quot;chunked&quot;</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">H</span><span class="p">.</span><span class="n">respond</span> <span class="o">~</span><span class="n">headers</span> <span class="o">~</span><span class="n">status</span><span class="o">:`</span><span class="nc">OK</span> <span class="o">~</span><span class="n">body</span> <span class="bp">()</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">handle_downloader</span> <span class="n">q</span> <span class="n">request</span> <span class="o">=</span>
</span><span class="line">    <span class="k">match</span> <span class="nn">H</span><span class="p">.</span><span class="nn">Request</span><span class="p">.</span><span class="n">meth</span> <span class="n">request</span> <span class="k">with</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">GET</span> <span class="o">-&gt;</span> <span class="n">get</span> <span class="n">q</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">DELETE</span> <span class="o">-&gt;</span> <span class="n">delete</span> <span class="n">q</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">HEAD</span> <span class="o">|</span> <span class="o">`</span><span class="nc">PUT</span> <span class="o">|</span> <span class="o">`</span><span class="nc">POST</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">OPTIONS</span> <span class="o">|</span> <span class="o">`</span><span class="nc">PATCH</span> <span class="o">-&gt;</span> <span class="n">unsupported_method</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The other methods (<code>put</code> and <code>delete</code>) are similar.</p>
<h3 id="buffered-reads">Buffered reads</h3>
<p>Running as a <code>--unix</code> process, I initially got a download speed of
17.2 KB/s, which was rather disappointing.
Especially as Apache on the same machine gets 615 MB/s!</p>
<p>Increasing the size of the chunks I was reading from the Fat
filesystem (a disk.img file) from 512 bytes to 1MB, I was able to
increase this to 2.83 MB/s, and removing the <code>O_DIRECT</code> flag from
<code>mirage-block-unix</code>, download speed increased to 15 MB/s (so this is
with Linux caching the data in RAM).</p>
<p>To check the filesystem was the problem, I removed the <code>F.read</code> call
(so it would return uninitialised data instead of the actual file contents).
It then managed a very respectable 514 MB/s.
Nothing wrong with the HTTP code then.</p>
<h3 id="streaming-uploads">Streaming uploads</h3>
<p>It all worked nicely running as a Unix process, so the next step was to deploy on Xen.
I was hoping that most of the bugs would already have been found during the Unix testing,
but in fact there were more lurking.</p>
<p>It worked for very small files, but when uploading larger files it quickly ran
out of memory on my 64-bit x86 test system. I also tried it on my 32-bit CubieTruck
ARM board, but that failed even sooner, with <code>Invalid_argument(&quot;String.create&quot;)</code> (on 32-bit
platforms, OCaml strings are limited to 16 MB).</p>
<p>In both cases, the problem was that the <a href="https://github.com/mirage/ocaml-cohttp">cohttp</a> library tried to read the entire upload in one go.
I found the <a href="https://github.com/mirage/ocaml-cohttp/blob/86394acdb580257ee78cb6976966662575b01bb0/cohttp/transfer_io.ml#L56">read</a> function in <code>Transfer_io</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">read</span> <span class="o">~</span><span class="n">len</span> <span class="n">ic</span> <span class="o">=</span>
</span><span class="line">  <span class="c">(* TODO functorise string to a bigbuffer *)</span>
</span><span class="line">  <span class="k">match</span> <span class="n">len</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span><span class="mi">0</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="nc">Done</span>
</span><span class="line">  <span class="o">|</span><span class="n">len</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">read_exactly</span> <span class="n">ic</span> <span class="n">len</span> <span class="o">&gt;&gt;=</span> <span class="k">function</span>
</span><span class="line">    <span class="o">|</span><span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="nc">Done</span>
</span><span class="line">    <span class="o">|</span><span class="nc">Some</span> <span class="n">buf</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="o">(</span><span class="nc">Final_chunk</span> <span class="n">buf</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I changed it to use <code>read</code> rather than <code>read_exactly</code> (<code>read</code> returns whatever data is available, waiting only if there isn't any at all):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">read</span> <span class="o">~</span><span class="n">remaining</span> <span class="n">ic</span> <span class="o">=</span>
</span><span class="line">  <span class="c">(* TODO functorise string to a bigbuffer *)</span>
</span><span class="line">  <span class="k">match</span> <span class="o">!</span><span class="n">remaining</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span><span class="mi">0</span> <span class="o">-&gt;</span> <span class="n">return</span> <span class="nc">Done</span>
</span><span class="line">  <span class="o">|</span><span class="n">len</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">read</span> <span class="n">ic</span> <span class="n">len</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">buf</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">remaining</span> <span class="o">:=</span> <span class="o">!</span><span class="n">remaining</span> <span class="o">-</span> <span class="nn">String</span><span class="p">.</span><span class="n">length</span> <span class="n">buf</span><span class="o">;</span>
</span><span class="line">    <span class="k">if</span> <span class="o">!</span><span class="n">remaining</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">then</span> <span class="n">return</span> <span class="o">(</span><span class="nc">Final_chunk</span> <span class="n">buf</span><span class="o">)</span>
</span><span class="line">    <span class="k">else</span> <span class="n">return</span> <span class="o">(</span><span class="nc">Chunk</span> <span class="n">buf</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I also had to change the signature to take a mutable reference (<code>remaining</code>) for the remaining data, otherwise it has no way to know when it's done (<a href="https://github.com/talex5/ocaml-cohttp/commit/030969c32d310d690b8df5929d9a5f32caee6eb5">patch</a>).</p>
<h3 id="buffered-writes">Buffered writes</h3>
<p>With the uploads now split into chunks, upload speed with <code>--unix</code> was 178 KB/s.
Batching up the chunks (which were generally 4 KB each) into a 64 KB buffer increased the speed to 2083 KB/s.
With a 1 MB buffer, I got 6386 KB/s.</p>
<p>Here's the code I used:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">page_buffer</span> <span class="o">=</span> <span class="nn">Io_page</span><span class="p">.</span><span class="n">get</span> <span class="mi">256</span> <span class="o">|&gt;</span> <span class="nn">Io_page</span><span class="p">.</span><span class="n">to_cstruct</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line"><span class="c">(* Set the first byte to N to indicate that we&#39;re not done yet.</span>
</span><span class="line"><span class="c"> * If we reboot while this flag is set, the partial upload will</span>
</span><span class="line"><span class="c"> * be deleted. *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">page_buffer_used</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">1</span> <span class="k">in</span>
</span><span class="line"><span class="nn">Cstruct</span><span class="p">.</span><span class="n">set_char</span> <span class="n">page_buffer</span> <span class="mi">0</span> <span class="sc">&#39;N&#39;</span><span class="o">;</span>
</span><span class="line"><span class="k">let</span> <span class="n">file_offset</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">1</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">flush_page_buffer</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;Flushing %d bytes to disk&quot;</span> <span class="o">!</span><span class="n">page_buffer_used</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">buffered_data</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">sub</span> <span class="n">page_buffer</span> <span class="mi">0</span> <span class="o">!</span><span class="n">page_buffer_used</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">F</span><span class="p">.</span><span class="n">write</span> <span class="n">q</span><span class="o">.</span><span class="n">fs</span> <span class="n">name</span> <span class="o">!</span><span class="n">file_offset</span> <span class="n">buffered_data</span> <span class="o">&gt;&gt;|=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="n">file_offset</span> <span class="o">:=</span> <span class="o">!</span><span class="n">file_offset</span> <span class="o">+</span> <span class="o">!</span><span class="n">page_buffer_used</span><span class="o">;</span>
</span><span class="line">  <span class="n">page_buffer_used</span> <span class="o">:=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">  <span class="n">return</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">add_data</span> <span class="n">src</span> <span class="n">i</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">src_remaining</span> <span class="o">=</span> <span class="nn">String</span><span class="p">.</span><span class="n">length</span> <span class="n">src</span> <span class="o">-</span> <span class="n">i</span> <span class="k">in</span>
</span><span class="line">  <span class="k">if</span> <span class="n">src_remaining</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">then</span> <span class="n">return</span> <span class="bp">()</span>
</span><span class="line">  <span class="k">else</span> <span class="o">(</span>
</span><span class="line">    <span class="k">let</span> <span class="n">page_buffer_free</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">len</span> <span class="n">page_buffer</span> <span class="o">-</span> <span class="o">!</span><span class="n">page_buffer_used</span> <span class="k">in</span>
</span><span class="line">    <span class="k">let</span> <span class="n">chunk_size</span> <span class="o">=</span> <span class="n">min</span> <span class="n">page_buffer_free</span> <span class="n">src_remaining</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">Cstruct</span><span class="p">.</span><span class="n">blit_from_string</span> <span class="n">src</span> <span class="n">i</span> <span class="n">page_buffer</span> <span class="o">!</span><span class="n">page_buffer_used</span> <span class="n">chunk_size</span><span class="o">;</span>
</span><span class="line">    <span class="n">page_buffer_used</span> <span class="o">:=</span> <span class="o">!</span><span class="n">page_buffer_used</span> <span class="o">+</span> <span class="n">chunk_size</span><span class="o">;</span>
</span><span class="line">    <span class="n">lwt</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">      <span class="k">if</span> <span class="n">page_buffer_free</span> <span class="o">=</span> <span class="n">chunk_size</span> <span class="k">then</span> <span class="n">flush_page_buffer</span> <span class="bp">()</span>
</span><span class="line">      <span class="k">else</span> <span class="n">return</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">    <span class="n">add_data</span> <span class="n">src</span> <span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">chunk_size</span><span class="o">)</span>
</span><span class="line">  <span class="o">)</span> <span class="k">in</span>
</span><span class="line">
</span><span class="line"><span class="n">data</span> <span class="o">|&gt;</span> <span class="nn">Lwt_stream</span><span class="p">.</span><span class="n">iter_s</span> <span class="o">(</span><span class="k">fun</span> <span class="n">data</span> <span class="o">-&gt;</span> <span class="n">add_data</span> <span class="n">data</span> <span class="mi">0</span><span class="o">)</span> <span class="o">&gt;&gt;=</span>
</span><span class="line"><span class="n">flush_page_buffer</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Asking on the mailing list confirmed that
<a href="http://lists.xenproject.org/archives/html/mirageos-devel/2014-07/msg00054.html">Fat is not well optimised</a>.
This isn't actually a problem for my service, since it's still faster than
my Internet connection, but there's clearly more work needed here.</p>
<h3 id="upload-speed-on-xen">Upload speed on Xen</h3>
<p>Testing on my little CubieTruck board, I then got:</p>
<table class="table"><tbody><tr><td> Upload speed   </td><td> 74 KB/s</td></tr><tr><td> Download speed </td><td> 1.6 KB/s</td></tr></tbody></table><p>Hmm. To get a feel for what the board is capable of, I ran <code>nc -l -p 8080 &lt; /dev/zero</code> on the board
and <code>nc cubietruck 8080 | pv &gt; /dev/null</code> on my laptop, getting 29 MB/s.</p>
<p>Still, my unikernel is running as a guest, meaning it has the overhead of using the virtual network
interface (it has to pass the data to dom0, which then sends it over the real interface). So I installed
a Linux guest and tried from there. 47.2 MB/s. Interesting. I have no idea why it's faster than dom0!</p>
<p>I loaded up <a href="http://www.wireshark.org/">Wireshark</a> to see what was happening with the unikernel transfers.
The upload transfer mostly went fast, but stalled in the middle for 15 seconds and then for 12 seconds at the end.
Wireshark showed that the unikernel was ack'ing the packets but reducing the TCP window size, indicating that the packets weren't being processed by the application code.
The delays corresponded to the times when we were flushing the data to the SD card, which makes sense.
So, this looks like another filesystem problem (we should be able to write to the SD card much faster than this).</p>
<h3 id="tcp-retransmissions">TCP retransmissions</h3>
<p>For the download, Wireshark showed that many of the packets had incorrect TCP checksums and were having to be retransmitted.
I was already familiar with this bug from a previous mailing list discussion: <a href="http://lists.xenproject.org/archives/html/mirageos-devel/2014-07/msg00131.html">wireshark capture of failed download from mirage-www on ARM</a>.
That turned out be <a href="http://lists.xenproject.org/archives/html/xen-users/2014-07/msg00067.html">a Linux bug</a> - the privileged dom0 code responsible for sending our virtual network packets to the real network becomes confused if two packets occupy the same physical page in memory.</p>
<p>Here's what happens:</p>
<ol>
<li>We read 1 MB of data from the disk and send it to the HTTP layer as the next chunk.
</li>
<li><a href="https://github.com/mirage/ocaml-cohttp/blob/86394acdb580257ee78cb6976966662575b01bb0/cohttp/transfer_io.ml#L48"><code>Chunked.write</code></a> does the HTTP chunking and sends it to the TCP/IP channel.
</li>
<li><a href="https://github.com/mirage/mirage-tcpip/blob/dedd5e0626a3fc8d679e60e92e5fc327c37759bd/channel/channel.ml#L181"><code>Channel.write_string</code></a> writes the HTTP output into pages (aligned 4K blocks of memory).
</li>
<li><a href="https://github.com/mirage/mirage-tcpip/blob/dedd5e0626a3fc8d679e60e92e5fc327c37759bd/tcp/pcb.ml#L476"><code>Pcb.writefn</code></a> then determines that each page is too big for a TCP packet and splits each one into smaller chunks, sharing the single underlying page:
</li>
</ol>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line">  <span class="k">let</span> <span class="k">rec</span> <span class="n">writefn</span> <span class="n">pcb</span> <span class="n">wfn</span> <span class="n">data</span> <span class="o">=</span>
</span><span class="line">    <span class="k">let</span> <span class="n">len</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">len</span> <span class="n">data</span> <span class="k">in</span>
</span><span class="line">    <span class="k">match</span> <span class="n">write_available</span> <span class="n">pcb</span> <span class="k">with</span>
</span><span class="line">    <span class="o">|</span> <span class="mi">0</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">write_wait_for</span> <span class="n">pcb</span> <span class="mi">1</span> <span class="o">&gt;&gt;</span>
</span><span class="line">      <span class="n">writefn</span> <span class="n">pcb</span> <span class="n">wfn</span> <span class="n">data</span>
</span><span class="line">    <span class="o">|</span> <span class="n">av_len</span> <span class="k">when</span> <span class="n">av_len</span> <span class="o">&lt;</span> <span class="n">len</span> <span class="o">-&gt;</span> 
</span><span class="line">      <span class="k">let</span> <span class="n">first_bit</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">sub</span> <span class="n">data</span> <span class="mi">0</span> <span class="n">av_len</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">remaing_bit</span> <span class="o">=</span> <span class="nn">Cstruct</span><span class="p">.</span><span class="n">sub</span> <span class="n">data</span> <span class="n">av_len</span> <span class="o">(</span><span class="n">len</span> <span class="o">-</span> <span class="n">av_len</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="n">writefn</span> <span class="n">pcb</span> <span class="n">wfn</span> <span class="n">first_bit</span>  <span class="o">&gt;&gt;</span>
</span><span class="line">      <span class="n">writefn</span> <span class="n">pcb</span> <span class="n">wfn</span> <span class="n">remaing_bit</span>
</span><span class="line">    <span class="o">|</span> <span class="n">av_len</span> <span class="o">-&gt;</span> 
</span><span class="line">      <span class="n">wfn</span> <span class="o">[</span><span class="n">data</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>My original fix changed <code>mirage-net-xen</code> to wait until the first buffer had been read before sending the second one.
That fixed the retransmissions, but all the waiting meant I still only got 56 KB/s.
Instead, I changed <code>writefn</code> to copy <code>remaining_bit</code> into a new IO page, and with that I got 495 KB/s.</p>
<p>Replacing the filesystem read with a simple <code>String.create</code> of the same length, I got 3.9 MB/s, showing that once again the
FAT filesystem was now the limiting factor.</p>
<h3 id="adding-a-block-cache">Adding a block cache</h3>
<p>I tried adding a <a href="https://github.com/0install/0repo-queue/blob/fat-caching/block_cache.ml">block cache</a> layer between <code>mirage-block-xen</code> and <code>fat-filesystem</code>, like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">module</span> <span class="nc">Main</span> <span class="o">(</span><span class="nc">C</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">CONSOLE</span><span class="o">)</span>
</span><span class="line">            <span class="o">(</span><span class="nc">B</span> <span class="o">:</span> <span class="nn">V1_LWT</span><span class="p">.</span><span class="nc">BLOCK</span><span class="o">)</span>
</span><span class="line">	    <span class="o">(</span><span class="nc">H</span> <span class="o">:</span> <span class="nn">Cohttp_lwt</span><span class="p">.</span><span class="nc">Server</span><span class="o">)</span> <span class="o">=</span> <span class="k">struct</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">BC</span> <span class="o">=</span> <span class="nn">Block_cache</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">B</span><span class="o">)</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">F</span> <span class="o">=</span> <span class="nn">Fat</span><span class="p">.</span><span class="nn">Fs</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">BC</span><span class="o">)(</span><span class="nc">Io_page</span><span class="o">)</span>
</span><span class="line">  <span class="k">module</span> <span class="nc">Q</span> <span class="o">=</span> <span class="nn">Upload_queue</span><span class="p">.</span><span class="nc">Make</span><span class="o">(</span><span class="nc">F</span><span class="o">)</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">mem_cache_size</span> <span class="o">=</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span>	<span class="c">(* 1 MB *)</span>
</span><span class="line">
</span><span class="line">  <span class="k">let</span> <span class="n">start</span> <span class="n">c</span> <span class="n">b</span> <span class="n">http</span> <span class="o">=</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">write</span> <span class="o">:=</span> <span class="nn">C</span><span class="p">.</span><span class="n">log_s</span> <span class="n">c</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Log</span><span class="p">.</span><span class="n">info</span> <span class="s2">&quot;start in queue service&quot;</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">
</span><span class="line">    <span class="nn">BC</span><span class="p">.</span><span class="n">connect</span> <span class="o">(</span><span class="n">b</span><span class="o">,</span> <span class="n">mem_cache_size</span><span class="o">)</span> <span class="o">&gt;&gt;=</span> <span class="k">function</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="n">failwith</span> <span class="s2">&quot;BC.connect&quot;</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="n">bc</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">F</span><span class="p">.</span><span class="n">connect</span> <span class="n">bc</span> <span class="o">&gt;&gt;=</span> <span class="k">function</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Error</span> <span class="o">_</span> <span class="o">-&gt;</span> <span class="n">failwith</span> <span class="s2">&quot;F.connect&quot;</span>
</span><span class="line">    <span class="o">|</span> <span class="o">`</span><span class="nc">Ok</span> <span class="n">fs</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">Q</span><span class="p">.</span><span class="n">create</span> <span class="n">fs</span> <span class="o">&gt;&gt;=</span> <span class="k">fun</span> <span class="n">q</span> <span class="o">-&gt;</span>
</span><span class="line"><span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With this in place, upload speed remains at 76 KB/s, but the download speed increases to 1 MB/s (for a 20 MB file, which therefore doesn't fit in the cache).
This suggests that the FAT filesystem is reading the same disk sectors many times.
Enlarging the memory cache to cover the whole file, the download speed only increases to 1.3 MB/s,
so the FAT code must be doing some inefficient calculations too.</p>
<h3 id="replacing-fat">Replacing FAT</h3>
<p>Since most of my problems seemed to be coming from using FAT, I decided to try a new approach.
I removed all the FAT code and the block cache and changed <code>upload_queue.ml</code> to write directly to the block
device.
With that (no caching), I get:</p>
<table class="table"><tbody><tr><td> Upload speed   </td><td> 2.27 MB/s</td></tr><tr><td> Download speed </td><td> 2.46 MB/s</td></tr></tbody></table><p>That's not too bad. It's faster than my Internet connection, which means that the unikernel is no longer the limiting factor.</p>
<p>Here's the new version: <a href="https://github.com/0install/0repo-queue/blob/master/upload_queue.ml"><code>upload_queue.ml</code></a>.
The big simplification comes from knowing that the queue will spend most of its time empty (another good reason to use a small VM for it).
The code has a <code>next_free_sector</code> which it advances every time an upload starts.
When the queue becomes empty and there are no uploads in progress this variable is reset back to sector 1 (sector 0 holds the index).
This does mean that we may report disk full errors to uploaders even when there is free space on the disk, but this won't happen in typical usage because the repository downloads things as soon as they're uploaded (if it does happen, it just means uploaders have to wait a couple of minutes until the repository empties the queue).</p>
<p>Managing the block device manually brought a few more advantages over FAT:</p>
<ol>
<li>No need to generate random file names for the uploads.
</li>
<li>No need to delete incomplete uploads (we only write the file's index entry to disk on success).
</li>
<li>The system should recover automatically from filesystem corruption because invalid entries can be detected reliably at boot time and discarded.
</li>
<li>Disk full errors are reported correctly.
</li>
<li>The queue ordering isn't lost on reboot.
</li>
</ol>
<h2 id="conclusions">Conclusions</h2>
<p>Modern operating systems are often extremely complex, but much of this is historical baggage which isn't needed on a modern system where you're running a single application as a VM under a hypervisor. Mirage allows you to create very small VMs which contain almost no C code. These VMs should be easier to write, more reliable and more secure.</p>
<p>Creating a bootable OCaml kernel is surprisingly easy, and from there adding support for extra devices is just a matter of pulling in the appropriate libraries. By programming against generic interfaces, you can create code that runs under Linux/Unix/OS X or as a virtual machine under Xen, and switch between configurations using the <code>mirage</code> tool.</p>
<p>Mirage is still very young, and I found many rough edges while writing my queuing service for 0install:</p>
<ul>
<li>While Linux provides fast, reliable filesystems as standard, Mirage currently only provides a basic FAT implementation.
</li>
<li>Linux provides caching as standard, while you have to implement this yourself on Mirage.
</li>
<li>Error reporting should be a big improvement over C's error codes, but getting friendly error messages from Mirage is currently difficult.
</li>
<li>The system has clearly been designed for high performance (the APIs generally write to user-provided buffers to avoid copying, much like C libraries do), but many areas have not yet been optimised.
</li>
<li>Buffers often have extra requirements (e.g. must be page-aligned, a single page, immutable, etc) which are not currently captured in the type system, and this can lead to run-time errors which would ideally have been detected at compile time.
</li>
</ul>
<p>However, there is <a href="http://openmirage.org/blog/announcing-mirage-20-release">a huge amount of work</a> happening on Mirage right now and it looks like all of these problems are being worked on.
If you're interested in low-level OS programming and don't want to mess about with C, Mirage is a lot of fun, and it can be useful for practical tasks already with a bit of effort.</p>
<p>There are still many areas I need to find out more about.
In particular, using the new <a href="http://openmirage.org/blog/introducing-ocaml-tls">pure-OCaml TLS stack</a> to secure the system and trying the <a href="http://openmirage.org/blog/introducing-irmin">Irmin Git-like distributed, branchable storage</a> to provide the queue instead of writing it myself.
I hope to try those soon...</p>
]]></content>
  </entry>
</feed>
