<?xml version="1.0" encoding="UTF-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:base="https://roscidus.com/">
  <title>Category: ocaml | Thomas Leonard's blog</title>
  <link href="https://roscidus.com/blog/blog/categories/ocaml/atom.xml" rel="self"></link>
  <link href="https://roscidus.com/blog/"></link>
  <updated>2025-11-16T09:00:00+00:00</updated>
  <id>https://roscidus.com/blog/</id>
  <author>
    <name>Thomas Leonard</name>
  </author>
  <entry>
    <title type="html">Linux mode setting, from the comfort of OCaml</title>
    <link href="https://roscidus.com/blog/blog/2025/11/16/libdrm-ocaml/"></link>
    <updated>2025-11-16T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/11/16/libdrm-ocaml</id>
    <content type="html"><![CDATA[<p>Linux provides the KMS (Kernel Mode Setting) API to let applications query and configure display settings.
It's used by Wayland compositors and other programs that need to configure the hardware directly.
I found the C API a little verbose and hard to follow so I made <a href="https://github.com/talex5/libdrm-ocaml">libdrm-ocaml</a>,
which lets us run commands interactively in a REPL.</p>
<!-- more -->
<p>We'll start by discovering what hardware is available and how it's currently configured,
then configure a monitor to display a simple bitmap, and then finally render a 3D animation.
The post should be a useful introduction to KMS even if you don't know OCaml.</p>
<p>( this post also appeared on <a href="https://news.ycombinator.com/item?id=45947822">Hacker News</a> )</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#running-it-yourself">Running it yourself</a>
</li>
<li><a href="#querying-the-current-state">Querying the current state</a>
<ul>
<li><a href="#finding-devices">Finding devices</a>
</li>
<li><a href="#listing-resources">Listing resources</a>
</li>
<li><a href="#connectors">Connectors</a>
</li>
<li><a href="#modes">Modes</a>
</li>
<li><a href="#properties">Properties</a>
</li>
<li><a href="#encoders">Encoders</a>
</li>
<li><a href="#crt-controllers">CRT Controllers</a>
</li>
<li><a href="#framebuffers">Framebuffers</a>
</li>
<li><a href="#crtc-planes">CRTC planes</a>
</li>
<li><a href="#expanded-resources-diagram">Expanded resources diagram</a>
</li>
</ul>
</li>
<li><a href="#making-changes">Making changes</a>
<ul>
<li><a href="#non-atomic-mode-setting">Non-atomic mode setting</a>
</li>
<li><a href="#dumb-buffers">Dumb buffers</a>
</li>
<li><a href="#atomic-mode-setting">Atomic mode setting</a>
</li>
</ul>
</li>
<li><a href="#d-rendering">3D rendering</a>
</li>
<li><a href="#linux-vts">Linux VTs</a>
</li>
<li><a href="#debugging">Debugging</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="running-it-yourself">Running it yourself</h2>
<p>If you want to follow along, you'll need to install <a href="https://github.com/talex5/libdrm-ocaml">libdrm-ocaml</a> and an interactive REPL like <a href="https://github.com/ocaml-community/utop">utop</a>.
With Nix, you can set everything up like this:</p>
<pre><code>git clone https://github.com/talex5/libdrm-ocaml
cd libdrm-ocaml
nix develop
dune utop
</code></pre>
<p>You should see a <code>utop #</code> prompt, where you can enter OCaml expressions.
Use <code>;;</code> to tell the REPL you've finished typing and it's time to evaluate, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="mi">1</span><span class="o">+</span><span class="mi">1</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="kt">int</span> <span class="o">=</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Alternatively, you can install things using <a href="https://opam.ocaml.org/">opam</a> (OCaml's package manager):</p>
<pre><code>opam install libdrm utop
utop
</code></pre>
<p>Then, at the utop prompt enter <code>#require &quot;libdrm&quot;;;</code> (including the leading <code>#</code>).</p>
<h2 id="querying-the-current-state">Querying the current state</h2>
<p>Before changing anything, we'll start by discovering what hardware is available.</p>
<p>I'll introduce the API as we go along, but you can check the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/index.html">API reference docs</a>
if you want more information.</p>
<h3 id="finding-devices">Finding devices</h3>
<p>To list available graphics devices:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="n">list</span> <span class="bp">()</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Info</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[{</span><span class="n">primary_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/card0&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">render_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/renderD128&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="n">info</span> <span class="o">=</span> <span class="nc">PCI</span> <span class="o">{</span><span class="n">bus</span> <span class="o">=</span> <span class="o">{</span><span class="n">domain</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">bus</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span> <span class="n">dev</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">func</span> <span class="o">=</span> <span class="mi">0</span><span class="o">};</span>
</span><span class="line">              <span class="n">dev</span> <span class="o">=</span> <span class="o">{</span><span class="n">vendor_id</span> <span class="o">=</span> <span class="mh">0x1002</span><span class="o">;</span>
</span><span class="line">                     <span class="n">device_id</span> <span class="o">=</span> <span class="mh">0x67ff</span><span class="o">;</span>
</span><span class="line">                     <span class="n">subvendor_id</span> <span class="o">=</span> <span class="mh">0x1458</span><span class="o">;</span>
</span><span class="line">                     <span class="n">subdevice_id</span> <span class="o">=</span> <span class="mh">0x230b</span><span class="o">;</span>
</span><span class="line">                     <span class="n">revision_id</span> <span class="o">=</span> <span class="mh">0xff</span><span class="o">}}}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>libdrm scans the <code>/dev/dri/</code> directory looking for devices.
It uses <code>stat</code> to find the device major and minor numbers and uses the virtual <code>/sys</code> filesystem to get information about each one.
This is a PCI device, and the information corresponds to the values from <code>lspci</code>, e.g.</p>
<pre><code>$ lspci -nns 0:1:0.0
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI]
  Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev ff)
</code></pre>
<p>Each graphics device can have a <em>primary</em> and a <em>render</em> node.
The primary node gives full access to the device, including configuring monitors,
while the render node just allows applications to render scenes to memory.
In the <a href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/">last post</a> I was using the render to node to create a 3D image,
and then sending it to the Wayland compositor for display.
This time we'll be doing the display ourselves, so we need to open the primary node:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">dev</span> <span class="o">=</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">openfile</span> <span class="s2">&quot;/dev/dri/card0&quot;</span> <span class="o">[</span><span class="nc">O_CLOEXEC</span><span class="o">;</span> <span class="nc">O_RDWR</span><span class="o">]</span> <span class="mi">0</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">dev</span> <span class="o">:</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">file_descr</span> <span class="o">=</span> <span class="o">&lt;</span><span class="n">abstr</span><span class="o">&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To check the driver version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Version</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="nn">Version</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">version</span> <span class="o">=</span> <span class="mi">3</span><span class="o">.</span><span class="mi">61</span><span class="o">.</span><span class="mi">0</span><span class="o">;</span> <span class="n">name</span> <span class="o">=</span> <span class="s2">&quot;amdgpu&quot;</span><span class="o">;</span> <span class="n">date</span> <span class="o">=</span> <span class="s2">&quot;0&quot;</span><span class="o">;</span> <span class="n">desc</span> <span class="o">=</span> <span class="s2">&quot;AMD GPU&quot;</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If you're familiar with the C API, this corresponds to the <code>drmGetVersion</code> function,
and <code>Drm.Device.list</code> corresponds to <code>drmGetDevices2</code>;
I reorganised things a bit to make better use of OCaml's modules.</p>
<h3 id="listing-resources">Listing resources</h3>
<p>Let's see what resources we've got to play with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">resources</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Resources</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">resources</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Resources</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">fbs</span> <span class="o">=</span> <span class="bp">[]</span><span class="o">;</span>
</span><span class="line">   <span class="n">crtcs</span> <span class="o">=</span> <span class="o">[</span><span class="mi">57</span><span class="o">;</span> <span class="mi">60</span><span class="o">;</span> <span class="mi">63</span><span class="o">;</span> <span class="mi">66</span><span class="o">;</span> <span class="mi">69</span><span class="o">];</span>
</span><span class="line">   <span class="n">connectors</span> <span class="o">=</span> <span class="o">[</span><span class="mi">71</span><span class="o">;</span> <span class="mi">78</span><span class="o">;</span> <span class="mi">84</span><span class="o">];</span>
</span><span class="line">   <span class="n">encoders</span> <span class="o">=</span> <span class="o">[</span><span class="mi">70</span><span class="o">;</span> <span class="mi">76</span><span class="o">;</span> <span class="mi">83</span><span class="o">;</span> <span class="mi">86</span><span class="o">;</span> <span class="mi">87</span><span class="o">;</span> <span class="mi">88</span><span class="o">;</span> <span class="mi">89</span><span class="o">;</span> <span class="mi">90</span><span class="o">];</span>
</span><span class="line">   <span class="n">min_width</span><span class="o">,</span><span class="n">max_width</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">16384</span><span class="o">;</span>
</span><span class="line">   <span class="n">min_height</span><span class="o">,</span><span class="n">max_height</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">16384</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: The Kernel Mode Setting functions are in the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Kms/index.html">Drm.Kms</a> module.
The C API calls these functions <code>drmMode*</code>, but I found that confusing as
e.g. <code>drmModeGetResources</code> sounds like you're asking for the resources of a mode.</p>
<p>A <em>CRTC</em> is a CRT Controller, and typically controls a single monitor
(known as a <a href="https://en.wikipedia.org/wiki/Cathode_ray_tube">Cathode Ray Tube</a> for historical reasons).
<em>Framebuffers</em> provide image data to a CRTC (we create framebuffers as needed).
<em>Connectors</em> correspond to physical connectors (e.g. where you plug in a monitor cable).
An <em>Encoder</em> encodes data from the CRTC for a particular connector.</p>
<p><a href="/blog/images/libdrm/arch-simple.svg"><span class="caption-wrapper center"><img src="/blog/images/libdrm/arch-simple.svg" title="Resources diagram (simplified)" class="caption"/><span class="caption-text">Resources diagram (simplified)</span></span></a></p>
<h3 id="connectors">Connectors</h3>
<p>To save a bit of typing, I'll create an alias for the <code>Drm.Kms</code> module:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">module</span> <span class="nc">K</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nc">Kms</span><span class="o">;;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You could also <code>open Drm.Kms</code> to avoid needing any prefix, but I'll keep using <code>K</code> for clarity.</p>
<p>To get details for the first connector (the head of the list):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">resources</span><span class="o">.</span><span class="n">connectors</span><span class="o">);;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span>
</span><span class="line"> <span class="n">connector_type</span> <span class="o">=</span> <span class="nc">DisplayPort</span><span class="o">;</span>
</span><span class="line"> <span class="n">connector_type_id</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span>
</span><span class="line"> <span class="n">connection</span> <span class="o">=</span> <span class="nc">Connected</span><span class="o">;</span>
</span><span class="line"> <span class="n">mm_width</span><span class="o">,</span><span class="n">mm_height</span> <span class="o">=</span> <span class="mi">700</span><span class="o">,</span><span class="mi">390</span><span class="o">;</span>
</span><span class="line"> <span class="n">subpixel</span> <span class="o">=</span> <span class="nc">Unknown</span><span class="o">;</span>
</span><span class="line"> <span class="n">modes</span> <span class="o">=</span> <span class="o">[</span><span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">30</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">29</span><span class="o">.</span><span class="mi">97</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="mi">2560</span><span class="n">x1440</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line">          <span class="o">...];</span>
</span><span class="line"> <span class="n">props</span> <span class="o">=</span> <span class="o">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">77</span><span class="o">;</span> <span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">5</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">6</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">4</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">34</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">35</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">36</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">37</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">72</span><span class="o">:</span><span class="mi">8</span><span class="o">;</span> <span class="mi">73</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> 
</span><span class="line">          <span class="mi">7</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">74</span><span class="o">:</span><span class="mi">0</span><span class="o">;</span> <span class="mi">75</span><span class="o">:</span><span class="mi">15</span><span class="o">];</span>
</span><span class="line"> <span class="n">encoder_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">70</span><span class="o">;</span>
</span><span class="line"> <span class="n">encoders</span> <span class="o">=</span> <span class="o">[</span><span class="mi">70</span><span class="o">]}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is DisplayPort connector 1 (usually called <code>DP-1</code>) and it's currently <code>Connected</code>.
The connector also says which modes are available on the connected monitor.</p>
<p>I was lucky in that the first connector was the one I'm using,
but really we should get all the connectors and filter them to find the connected ones.
<code>List.map</code> can be used to run <code>get</code> on each of them:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">connectors</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">)</span> <span class="n">resources</span><span class="o">.</span><span class="n">connectors</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">connectors</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...};</span>
</span><span class="line">   <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">78</span><span class="o">;</span> <span class="c">(* HDMI-A-1 *)</span> <span class="o">...};</span>
</span><span class="line">   <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">84</span><span class="o">;</span> <span class="c">(* DVI-D-1 *)</span> <span class="o">...}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then to filter:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">is_connected</span> <span class="o">(</span><span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span> <span class="o">(</span><span class="n">c</span><span class="o">.</span><span class="n">connection</span> <span class="o">=</span> <span class="nc">Connected</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">is_connected</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">bool</span> <span class="o">=</span> <span class="o">&lt;</span><span class="k">fun</span><span class="o">&gt;</span>
</span><span class="line">
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="n">is_connected</span> <span class="n">connectors</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">connected</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We'll investigate <code>c</code>, the first connected one:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">c</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">connected</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="c">(* DP-1 *)</span> <span class="o">...}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h4 id="a-note-on-ids">A note on IDs</h4>
<p>In the libdrm C API, IDs are just integers.
To avoid mix-ups, I made them distinct types in the OCaml API.
For example, if you try to use an encoder ID as a connector ID:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">resources</span><span class="o">.</span><span class="n">encoders</span><span class="o">);;</span>
</span><span class="line">                           <span class="o">^^^^^^^^^^^^^^^^^^^^^^^^^^^^</span>
</span><span class="line"><span class="nc">Error</span><span class="o">:</span> <span class="nc">This</span> <span class="n">expression</span> <span class="n">has</span> <span class="k">type</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Encoder</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">       <span class="n">but</span> <span class="n">an</span> <span class="n">expression</span> <span class="n">was</span> <span class="n">expected</span> <span class="k">of</span> <span class="k">type</span>
</span><span class="line">         <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span><span class="line">       <span class="nc">These</span> <span class="n">two</span> <span class="n">variant</span> <span class="n">types</span> <span class="n">have</span> <span class="n">no</span> <span class="n">intersection</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Normally this is what you want, but for interactive use it's annoying that you can't just pass a plain integer.
e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="mi">71</span><span class="o">;;</span>
</span><span class="line">                           <span class="o">^^</span>
</span><span class="line"><span class="nc">Error</span><span class="o">:</span> <span class="nc">The</span> <span class="n">constant</span> <span class="mi">71</span> <span class="n">has</span> <span class="k">type</span> <span class="kt">int</span> <span class="n">but</span> <span class="n">an</span> <span class="n">expression</span> <span class="n">was</span> <span class="n">expected</span> <span class="k">of</span> <span class="k">type</span>
</span><span class="line">         <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>You can get any kind of ID with <code>Drm.Id.of_int</code> (e.g. <code>K.Connector.get dev (Drm.Id.of_int 71)</code>),
but that's still a bit verbose, so you might prefer to (re)define a prefix operator for it, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="o">(</span> <span class="o">!</span> <span class="o">)</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">of_int</span><span class="o">;;</span>
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">!</span><span class="mi">71</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="o">{</span><span class="n">connector_id</span> <span class="o">=</span> <span class="mi">71</span><span class="o">;</span> <span class="o">...}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(note: <code>!</code> is the only single-character prefix operator available in OCaml)</p>
<h3 id="modes">Modes</h3>
<p>Modes are shown in abbreviated form in the connector output.
To see the full list:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Mode_info</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[</span><span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">30</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">29</span><span class="o">.</span><span class="mi">97</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">2560</span><span class="n">x1440</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1920</span><span class="n">x1200</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1920</span><span class="n">x1080</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1920</span><span class="n">x1080</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1600</span><span class="n">x1200</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1680</span><span class="n">x1050</span> <span class="mi">59</span><span class="o">.</span><span class="mi">95</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1600</span><span class="n">x900</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x1024</span> <span class="mi">75</span><span class="o">.</span><span class="mi">02</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x1024</span> <span class="mi">60</span><span class="o">.</span><span class="mi">02</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1440</span><span class="n">x900</span> <span class="mi">59</span><span class="o">.</span><span class="mi">89</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x800</span> <span class="mi">59</span><span class="o">.</span><span class="mi">81</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1152</span><span class="n">x864</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1280</span><span class="n">x720</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">1280</span><span class="n">x720</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">75</span><span class="o">.</span><span class="mi">03</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">70</span><span class="o">.</span><span class="mi">07</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">1024</span><span class="n">x768</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">832</span><span class="n">x624</span> <span class="mi">74</span><span class="o">.</span><span class="mi">55</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">72</span><span class="o">.</span><span class="mi">19</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">800</span><span class="n">x600</span> <span class="mi">60</span><span class="o">.</span><span class="mi">32</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">800</span><span class="n">x600</span> <span class="mi">56</span><span class="o">.</span><span class="mi">25</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">75</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">72</span><span class="o">.</span><span class="mi">81</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">66</span><span class="o">.</span><span class="mi">67</span><span class="n">Hz</span><span class="o">;</span>
</span><span class="line"> <span class="mi">640</span><span class="n">x480</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">640</span><span class="n">x480</span> <span class="mi">59</span><span class="o">.</span><span class="mi">94</span><span class="n">Hz</span><span class="o">;</span> <span class="mi">720</span><span class="n">x400</span> <span class="mi">70</span><span class="o">.</span><span class="mi">08</span><span class="n">Hz</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: I annotated various pretty-printer functions with <code>[@@ocaml.toplevel_printer]</code>,
which causes utop to use them by default to display values of the corresponding type.
For example, showing a list of modes uses this short summary form.
Displaying an individual mode shows all the information.
Here's the first mode:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">#</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Mode_info</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="n">name</span> <span class="o">=</span> <span class="s2">&quot;3840x2160&quot;</span><span class="o">;</span>
</span><span class="line"> <span class="n">typ</span> <span class="o">=</span> <span class="n">preferred</span><span class="o">+</span><span class="n">driver</span><span class="o">;</span>
</span><span class="line"> <span class="n">flags</span> <span class="o">=</span> <span class="n">phsync</span><span class="o">+</span><span class="n">nvsync</span><span class="o">;</span>
</span><span class="line"> <span class="n">stereo_mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">aspect_ratio</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">clock</span> <span class="o">=</span> <span class="mi">533250</span><span class="o">;</span>
</span><span class="line"> <span class="n">hdisplay</span><span class="o">,</span><span class="n">vdisplay</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line"> <span class="n">hsync_start</span> <span class="o">=</span> <span class="mi">3888</span><span class="o">;</span>
</span><span class="line"> <span class="n">hsync_end</span> <span class="o">=</span> <span class="mi">3920</span><span class="o">;</span>
</span><span class="line"> <span class="n">htotal</span> <span class="o">=</span> <span class="mi">4000</span><span class="o">;</span>
</span><span class="line"> <span class="n">hskew</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="n">vsync_start</span> <span class="o">=</span> <span class="mi">2163</span><span class="o">;</span>
</span><span class="line"> <span class="n">vsync_end</span> <span class="o">=</span> <span class="mi">2168</span><span class="o">;</span>
</span><span class="line"> <span class="n">vtotal</span> <span class="o">=</span> <span class="mi">2222</span><span class="o">;</span>
</span><span class="line"> <span class="n">vscan</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="n">vrefresh</span> <span class="o">=</span> <span class="mi">60</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="properties">Properties</h3>
<p>Some resources can also have extra <a href="https://drmdb.emersion.fr/properties">properties</a>.
Use <code>get_properties</code> to fetch them:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span> <span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="nc">EDID</span> <span class="o">=</span> <span class="mi">92</span><span class="o">;</span> <span class="nc">DPMS</span> <span class="o">=</span> <span class="nc">On</span><span class="o">;</span> <span class="nc">TILE</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">link</span><span class="o">-</span><span class="n">status</span> <span class="o">=</span> <span class="nc">Good</span><span class="o">;</span> <span class="n">non</span><span class="o">-</span><span class="n">desktop</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="nc">HDR_OUTPUT_METADATA</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">scaling</span> <span class="n">mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">underscan</span> <span class="o">=</span> <span class="n">off</span><span class="o">;</span>
</span><span class="line"> <span class="n">underscan</span> <span class="n">hborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">vborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">max</span> <span class="n">bpc</span> <span class="o">=</span> <span class="mi">8</span><span class="o">;</span>
</span><span class="line"> <span class="nc">Colorspace</span> <span class="o">=</span> <span class="nc">Default</span><span class="o">;</span> <span class="n">vrr_capable</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">subconnector</span> <span class="o">=</span> <span class="nc">Native</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Linux only returns a subset of the properties until you enable the <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Client_cap/index.html#val-atomic">atomic</a> feature.
Let's turn that on now:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Client_cap</span><span class="p">.</span><span class="o">(</span><span class="n">set</span> <span class="n">atomic</span><span class="o">)</span> <span class="n">dev</span> <span class="bp">true</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">(</span><span class="kt">unit</span><span class="o">,</span> <span class="nn">Unix</span><span class="p">.</span><span class="n">error</span><span class="o">)</span> <span class="n">result</span> <span class="o">=</span> <span class="nc">Ok</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(<code>Module.(expr)</code> is a short-hand that brings all of <code>Module</code>'s symbols into scope for <code>expr</code>,
so we don't have to repeat the module name for both <code>set</code> and <code>atomic</code>)</p>
<p>And getting the properties again, we now have an extra <code>CRTC_ID</code>,
telling us which controller this connector is currently using:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">c_props</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span> <span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">c_props</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Connector</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line"><span class="o">{</span><span class="nc">EDID</span> <span class="o">=</span> <span class="mi">92</span><span class="o">;</span> <span class="nc">DPMS</span> <span class="o">=</span> <span class="nc">On</span><span class="o">;</span> <span class="nc">TILE</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">link</span><span class="o">-</span><span class="n">status</span> <span class="o">=</span> <span class="nc">Good</span><span class="o">;</span> <span class="n">non</span><span class="o">-</span><span class="n">desktop</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"> <span class="nc">HDR_OUTPUT_METADATA</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="n">scaling</span> <span class="n">mode</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line"> <span class="n">underscan</span> <span class="o">=</span> <span class="n">off</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">hborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">underscan</span> <span class="n">vborder</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">max</span> <span class="n">bpc</span> <span class="o">=</span> <span class="mi">8</span><span class="o">;</span>
</span><span class="line"> <span class="nc">Colorspace</span> <span class="o">=</span> <span class="nc">Default</span><span class="o">;</span> <span class="n">vrr_capable</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">subconnector</span> <span class="o">=</span> <span class="nc">Native</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="encoders">Encoders</h3>
<p><a href="https://www.kernel.org/doc/html/latest/gpu/drm-kms.html">The Linux documentation</a> says:</p>
<blockquote>
<p>Those are really just internal artifacts of the helper libraries used to
implement KMS drivers. Besides that they make it unnecessarily more
complicated for userspace to figure out which connections between a CRTC and
a connector are possible, and what kind of cloning is supported, they serve
no purpose in the userspace API. Unfortunately encoders have been exposed to
userspace, hence can’t remove them at this point. Furthermore the exposed
restrictions are often wrongly set by drivers, and in many cases not powerful
enough to express the real restrictions.</p>
</blockquote>
<p>OK. Well, let's take a look anyway:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">e</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="o">(</span><span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">c</span><span class="o">.</span><span class="n">encoder_id</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">e</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">encoder_id</span> <span class="o">=</span> <span class="mi">70</span><span class="o">;</span>
</span><span class="line">   <span class="n">encoder_type</span> <span class="o">=</span> <span class="nc">TMDS</span><span class="o">;</span>
</span><span class="line">   <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">   <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1f</span><span class="o">;</span>
</span><span class="line">   <span class="n">possible_clones</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: We need <code>Option.get</code> here because a connector might not have an encoder set yet.
Where the C API uses 0 to indicate no resource,
the OCaml API uses <code>None</code> to force us to think about that case.</p>
<p>As the documentation says, the encoder is mainly useful to get the CRTC ID:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">e</span><span class="o">.</span><span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">crtc_id</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">57</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We could instead have got that directly from the connector using its properties:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="nn">Values</span><span class="p">.</span><span class="n">get_value_exn</span> <span class="n">c_props</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Crtc</span> <span class="o">]</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Id</span><span class="p">.</span><span class="n">t</span> <span class="n">option</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="crt-controllers">CRT Controllers</h3>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">crtc</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="n">crtc_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">crtc</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">crtc_id</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">   <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">93</span><span class="o">;</span>
</span><span class="line">   <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">   <span class="n">width</span><span class="o">,</span><span class="n">height</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line">   <span class="n">mode</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">3840</span><span class="n">x2160</span> <span class="mi">60</span><span class="o">.</span><span class="mi">00</span><span class="n">Hz</span><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An active CRTC has a mode set (presumably from the connector's list of supported modes),
and a framebuffer with the image to be displayed.</p>
<p>If I keep calling <code>Crtc.get</code>, I see that it is sometimes showing framebuffer 93 and sometimes 94.
My Wayland compositor (Sway) updates one framebuffer while the other is being shown, then switches which one is displayed.</p>
<h3 id="framebuffers">Framebuffers</h3>
<p>My CRTC is currently displaying the contents of framebuffer 93:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">fb_id</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="n">crtc</span><span class="o">.</span><span class="n">fb_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">fb_id</span> <span class="o">:</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Kms</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">93</span>
</span></code></pre></td></tr></tbody></table></div></figure><figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span> <span class="n">fb_id</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">fb</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
</span><span class="line">  <span class="o">{</span><span class="n">fb_id</span> <span class="o">=</span> <span class="mi">93</span><span class="o">;</span>
</span><span class="line">   <span class="n">width</span><span class="o">,</span><span class="n">height</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">,</span><span class="mi">2160</span><span class="o">;</span>
</span><span class="line">   <span class="n">pixel_format</span><span class="o">,</span> <span class="n">modifier</span> <span class="o">=</span> <span class="nc">XR24</span><span class="o">,</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">   <span class="n">interlaced</span> <span class="o">=</span> <span class="bp">false</span><span class="o">;</span>
</span><span class="line">   <span class="n">planes</span> <span class="o">=</span> <span class="o">[{</span><span class="n">handle</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="n">pitch</span> <span class="o">=</span> <span class="mi">15360</span><span class="o">;</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span><span class="o">}]}</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A framebuffer has up to 4 framebuffer planes (not to be confused with CRTC planes; see later),
each of which references a <em>buffer object</em> (also known as a <em>BO</em> and referenced with a <em>GEM handle</em>).</p>
<p>This framebuffer is using the <code>XR24</code> format, where there is a single BO with 32 bits for each pixel
(8 for red, 8 green, 8 blue and 8 unused).
Some formats use e.g. a separate buffer for each component
(or a different part of the same buffer, using <code>offset</code>).</p>
<p>Modern graphics cards also support format <em>modifiers</em>, but my card is too old so I just get <code>None</code>.
Linux's <a href="https://github.com/torvalds/linux/blob/master/include/uapi/drm/drm_fourcc.h">fourcc.h</a> header file describes the various formats and modifiers.
Modifiers seem to be mainly used to specify the <a href="https://docs.mesa3d.org/isl/tiling.html">tiling</a>.</p>
<p>I don't have permission to see the buffer object, so it appears as (<code>handle = None</code>).
The <code>pitch</code> is the number of bytes from one row to the next (also known as the <em>stride</em>).
Here, the 15360 is simply the width (3840) multiplied by the 4 bytes per pixel.</p>
<h3 id="crtc-planes">CRTC planes</h3>
<p>In fact, <code>Crtc.get</code> is an old API that only covers the basic case of a single framebuffer.
In reality, a CRTC can combine multiple <em>CRTC planes</em>, which for some reason aren't returned with the other resources
and must be requested separately:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">plane_ids</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">list</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">plane_ids</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="kt">list</span> <span class="o">=</span> <span class="o">[</span><span class="mi">40</span><span class="o">;</span> <span class="mi">43</span><span class="o">;</span> <span class="mi">46</span><span class="o">;</span> <span class="mi">49</span><span class="o">;</span> <span class="mi">52</span><span class="o">;</span> <span class="mi">55</span><span class="o">;</span> <span class="mi">58</span><span class="o">;</span> <span class="mi">61</span><span class="o">;</span> <span class="mi">64</span><span class="o">;</span> <span class="mi">67</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>(note: you need to enable &quot;atomic&quot; mode before requesting planes; we already did that above)</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">planes</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">get</span> <span class="n">dev</span><span class="o">)</span> <span class="n">plane_ids</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">planes</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">XR24</span><span class="o">;</span> <span class="nc">AR24</span><span class="o">;</span> <span class="nc">RA24</span><span class="o">;</span> <span class="nc">XR30</span><span class="o">;</span> <span class="nc">XB30</span><span class="o">;</span> <span class="nc">AR30</span><span class="o">;</span> <span class="nc">AB30</span><span class="o">;</span> <span class="nc">XR48</span><span class="o">;</span> <span class="nc">XB48</span><span class="o">;</span> 
</span><span class="line">               <span class="nc">AR48</span><span class="o">;</span> <span class="nc">AB48</span><span class="o">;</span> <span class="nc">XB24</span><span class="o">;</span> <span class="nc">AB24</span><span class="o">;</span> <span class="nc">RG16</span><span class="o">;</span> <span class="nc">XR4H</span><span class="o">;</span> <span class="nc">AR4H</span><span class="o">;</span> <span class="nc">XB4H</span><span class="o">;</span> <span class="nc">AB4H</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">40</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x10</span><span class="o">};</span>
</span><span class="line">   <span class="o">...</span>
</span><span class="line">  <span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>A lot of these planes aren't being used (don't have a CRTC),
which we can check for with a helper function:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">has_crtc</span> <span class="o">(</span><span class="n">x</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">crtc_id</span> <span class="o">&lt;&gt;</span> <span class="nc">None</span><span class="o">);;</span>
</span><span class="line"><span class="k">val</span> <span class="n">has_crtc</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">bool</span> <span class="o">=</span> <span class="o">&lt;</span><span class="k">fun</span><span class="o">&gt;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Looks like Sway is using two planes at the moment:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">active_planes</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="n">has_crtc</span> <span class="n">planes</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">active_planes</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line">  <span class="o">[{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">XR24</span><span class="o">;</span> <span class="nc">AR24</span><span class="o">;</span> <span class="nc">RA24</span><span class="o">;</span> <span class="nc">XR30</span><span class="o">;</span> <span class="nc">XB30</span><span class="o">;</span> <span class="nc">AR30</span><span class="o">;</span> <span class="nc">AB30</span><span class="o">;</span> <span class="nc">XR48</span><span class="o">;</span> <span class="nc">XB48</span><span class="o">;</span> 
</span><span class="line">               <span class="nc">AR48</span><span class="o">;</span> <span class="nc">AB48</span><span class="o">;</span> <span class="nc">XB24</span><span class="o">;</span> <span class="nc">AB24</span><span class="o">;</span> <span class="nc">RG16</span><span class="o">;</span> <span class="nc">XR4H</span><span class="o">;</span> <span class="nc">AR4H</span><span class="o">;</span> <span class="nc">XB4H</span><span class="o">;</span> <span class="nc">AB4H</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">52</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">94</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">};</span>
</span><span class="line">   <span class="o">{</span><span class="n">formats</span> <span class="o">=</span> <span class="o">[</span><span class="nc">AR24</span><span class="o">];</span>
</span><span class="line">    <span class="n">plane_id</span> <span class="o">=</span> <span class="mi">55</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">57</span><span class="o">;</span>
</span><span class="line">    <span class="n">fb_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="mi">98</span><span class="o">;</span>
</span><span class="line">    <span class="n">crtc_x</span><span class="o">,</span><span class="n">crtc_y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">x</span><span class="o">,</span><span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">;</span>
</span><span class="line">    <span class="n">possible_crtcs</span> <span class="o">=</span> <span class="mh">0x1</span><span class="o">}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>More information is available as properties:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="k">let</span> <span class="n">active_plane_ids</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="n">active_planes</span><span class="o">;;</span>
</span><span class="line"><span class="k">val</span> <span class="n">active_plane_ids</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">id</span> <span class="kt">list</span> <span class="o">=</span> <span class="o">[</span><span class="mi">52</span><span class="o">;</span> <span class="mi">55</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">List</span><span class="p">.</span><span class="n">map</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">get_properties</span> <span class="n">dev</span><span class="o">)</span> <span class="n">active_plane_ids</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="o">[</span> <span class="o">`</span><span class="nc">Plane</span> <span class="o">]</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Properties</span><span class="p">.</span><span class="n">t</span> <span class="kt">list</span> <span class="o">=</span>
</span><span class="line"><span class="o">[{</span><span class="nc">CRTC_H</span> <span class="o">=</span> <span class="mi">2160</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="nc">CRTC_W</span> <span class="o">=</span> <span class="mi">3840</span><span class="o">;</span> <span class="nc">CRTC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="nc">CRTC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">  <span class="nc">FB_ID</span> <span class="o">=</span> <span class="mi">93</span><span class="o">;</span> <span class="nc">IN_FENCE_FD</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="nc">SRC_H</span> <span class="o">=</span> <span class="mi">141557760</span><span class="o">;</span> <span class="nc">SRC_W</span> <span class="o">=</span> <span class="mi">251658240</span><span class="o">;</span>
</span><span class="line">  <span class="nc">SRC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="nc">SRC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">rotation</span> <span class="o">=</span> <span class="o">[</span><span class="n">rotate</span><span class="o">-</span><span class="mi">0</span><span class="o">];</span> <span class="k">type</span> <span class="o">=</span> <span class="nc">Primary</span><span class="o">;</span> <span class="n">zpos</span> <span class="o">=</span> <span class="mi">0</span><span class="o">};</span>
</span><span class="line"> <span class="o">{</span><span class="nc">CRTC_H</span> <span class="o">=</span> <span class="mi">128</span><span class="o">;</span> <span class="nc">CRTC_ID</span> <span class="o">=</span> <span class="mi">57</span><span class="o">;</span> <span class="nc">CRTC_W</span> <span class="o">=</span> <span class="mi">128</span><span class="o">;</span> <span class="nc">CRTC_X</span> <span class="o">=</span> <span class="mi">3105</span><span class="o">;</span> <span class="nc">CRTC_Y</span> <span class="o">=</span> <span class="mi">1518</span><span class="o">;</span>
</span><span class="line">  <span class="nc">FB_ID</span> <span class="o">=</span> <span class="mi">98</span><span class="o">;</span> <span class="nc">IN_FENCE_FD</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="nc">SRC_H</span> <span class="o">=</span> <span class="mi">8388608</span><span class="o">;</span> <span class="nc">SRC_W</span> <span class="o">=</span> <span class="mi">8388608</span><span class="o">;</span> <span class="nc">SRC_X</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">  <span class="nc">SRC_Y</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="k">type</span> <span class="o">=</span> <span class="nc">Cursor</span><span class="o">;</span> <span class="n">zpos</span> <span class="o">=</span> <span class="mi">255</span><span class="o">}]</span>
</span></code></pre></td></tr></tbody></table></div></figure><ul>
<li>Plane 52 is a <code>Primary</code> plane and is using framebuffer 93 (as we saw before).
</li>
<li>Plane 55 is a <code>Cursor</code> plane, using framebuffer 98 (and the <code>AR24</code> format, with alpha/transparency).
</li>
</ul>
<p>A plane chooses which part of the frame buffer to show (<code>SRC_X</code>, <code>SRC_Y</code>, <code>SRC_W</code> and <code>SRC_H</code>)
and where it should appear on the screen (<code>CRTC_X</code>, <code>CRTC_Y</code>, <code>CRTC_W</code> and <code>CRTC_H</code>).
The source values are in 16.16 format (i.e. shifted left 16 bits).</p>
<p>Oddly, <code>Plane.get</code> returned <code>crtc_x,crtc_y = 0,0</code> for both planes, but
the properties show the correct cursor location (<code>CRTC_X = 3105; CRTC_Y = 1518;</code>).</p>
<p>Having the cursor on a separate plane avoids having to modify the main screen image
whenever the mouse pointer moves, which is good for low latency
(especially if the GPU is busy rendering something else at the time),
power consumption (the GPU can stay powered down),
and allows showing an application's buffer full screen without the compositor
needing to modify the application's buffer.</p>
<p>You might also have some <code>Overlay</code> planes,
which can be <a href="https://zamundaaa.github.io/wayland/2025/10/23/more-kms-offloading.html">useful for displaying video</a>.
My graphics card seems to be too old for that.</p>
<h3 id="expanded-resources-diagram">Expanded resources diagram</h3>
<p>Here's an expanded diagram showing some more possibilities:</p>
<p><a href="/blog/images/libdrm/arch.svg"><span class="caption-wrapper center"><img src="/blog/images/libdrm/arch.svg" title="Expanded resources diagram" class="caption"/><span class="caption-text">Expanded resources diagram</span></span></a></p>
<ul>
<li>Some framebuffer formats take the input data from multiple buffers.
</li>
<li>A framebuffer can be shared by multiple CRTCs (perhaps with each plane showing a different part of it).
</li>
<li>A CRTC can have multiple planes (e.g. primary and cursor).
</li>
<li>A single CRTC can show the same image on multiple monitors.
</li>
</ul>
<h2 id="making-changes">Making changes</h2>
<p>If I try turning off the CRTC (by setting the mode to <code>None</code>) from my desktop environment it fails:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">set</span> <span class="n">dev</span> <span class="n">crtc_id</span> <span class="o">~</span><span class="n">pos</span><span class="o">:(</span><span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">)</span> <span class="o">~</span><span class="n">connectors</span><span class="o">:</span><span class="bp">[]</span> <span class="nc">None</span><span class="o">;;</span>
</span><span class="line"><span class="nc">Exception</span><span class="o">:</span> <span class="nn">Unix</span><span class="p">.</span><span class="nc">Unix_error</span><span class="o">(</span><span class="nn">Unix</span><span class="p">.</span><span class="nc">EACCES</span><span class="o">,</span> <span class="s2">&quot;drmModeSetCrtc&quot;</span><span class="o">,</span> <span class="s2">&quot;&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The reason is that I'm currently running a graphical desktop and Sway owns the device
(so my <code>dev</code> is not the DRM &quot;master&quot;):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">utop</span> <span class="o">#</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Device</span><span class="p">.</span><span class="n">is_master</span> <span class="n">dev</span><span class="o">;;</span>
</span><span class="line"><span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">false</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>That can be fixed by switching to a different <a href="https://en.wikipedia.org/wiki/Virtual_console">VT</a> (e.g. with Ctrl-Alt-F2) and running it there.
However, this will result in a second problem: I won't be able to see what I'm doing!</p>
<p>If you have a second computer then you can SSH in and test things out from there, but
for simplicity we'll leave the utop REPL at this point and write some programs instead.</p>
<p>For example, <a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/query.ml">query.ml</a> shows the information we discovered above:</p>
<pre><code>dune exec -- ./examples/query.exe
</code></pre>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">devices</span><span class="o">:</span>                              
</span><span class="line">  <span class="o">[{</span><span class="n">primary_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/card0&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="n">render_node</span> <span class="o">=</span> <span class="nc">Some</span> <span class="s2">&quot;/dev/dri/renderD128&quot;</span><span class="o">;</span>
</span><span class="line"><span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="non-atomic-mode-setting">Non-atomic mode setting</h3>
<p>Linux provides two ways to configure modes: the old <em>non-atomic</em> API and the newer <em>atomic</em> one.</p>
<p><a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/nonatomic.ml">examples/nonatomic.ml</a> contains a simple example of the older (but simpler) API.
It starts by finding a device (the first one with a primary node supporting KMS), then
finds all connected connectors (as we did above), and calls <code>show_test_page</code> on each one:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">with_device</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="nn">Utils</span><span class="p">.</span><span class="n">is_connected</span> <span class="n">t</span><span class="o">.</span><span class="n">connectors</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">restoring_afterwards</span> <span class="n">t</span> <span class="o">@@</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">List</span><span class="p">.</span><span class="n">iter</span> <span class="o">(</span><span class="n">show_test_page</span> <span class="n">t</span><span class="o">)</span> <span class="n">connected</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Unix</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>restoring_afterwards</code> stores the current configuration, runs the callback,
and then puts things back to normal when that finishes (or you press Ctrl-C).</p>
<p>The program waits for 2 seconds after showing the test page before exiting.</p>
<p><code>show_test_page</code> finds the CRTC (as we did above),
takes the first supported mode, creates a test framebuffer of that size,
and configures the CRTC to display it:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">show_test_page</span> <span class="o">(</span><span class="n">t</span> <span class="o">:</span> <span class="nn">Resources</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">(</span><span class="n">c</span> <span class="o">:</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">t</span><span class="o">)</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="n">c</span><span class="o">.</span><span class="n">encoder_id</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">println</span> <span class="s2">&quot;%a has no encoder (skipping)&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">encoder_id</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">match</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Encoder</span><span class="p">.</span><span class="n">get</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">encoder_id</span> <span class="k">with</span>
</span><span class="line">    <span class="o">|</span> <span class="o">{</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">None</span><span class="o">;</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">println</span> <span class="s2">&quot;%a&#39;s encoder has no CRTC (skipping)&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span>
</span><span class="line">    <span class="o">|</span> <span class="o">{</span> <span class="n">crtc_id</span> <span class="o">=</span> <span class="nc">Some</span> <span class="n">crtc_id</span><span class="o">;</span> <span class="o">_</span> <span class="o">}</span> <span class="o">-&gt;</span>
</span><span class="line">      <span class="n">println</span> <span class="s2">&quot;Showing test page on %a&quot;</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Connector</span><span class="p">.</span><span class="n">pp_name</span> <span class="n">c</span><span class="o">;</span>
</span><span class="line">      <span class="k">let</span> <span class="n">mode</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">hd</span> <span class="n">c</span><span class="o">.</span><span class="n">modes</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">size</span> <span class="o">=</span> <span class="o">(</span><span class="n">mode</span><span class="o">.</span><span class="n">hdisplay</span><span class="o">,</span> <span class="n">mode</span><span class="o">.</span><span class="n">vdisplay</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">Test_image</span><span class="p">.</span><span class="n">create</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">      <span class="nn">K</span><span class="p">.</span><span class="nn">Crtc</span><span class="p">.</span><span class="n">set</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">crtc_id</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">mode</span><span class="o">)</span> <span class="o">~</span><span class="n">fb</span> <span class="o">~</span><span class="n">pos</span><span class="o">:(</span><span class="mi">0</span><span class="o">,</span><span class="mi">0</span><span class="o">)</span>
</span><span class="line">        <span class="o">~</span><span class="n">connectors</span><span class="o">:[</span><span class="n">c</span><span class="o">.</span><span class="n">connector_id</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>If the connector doesn't have a CRTC, we could find a suitable one and use that,
but for simplicity the example just skips such connectors.</p>
<p>To run the example (switch away from any graphical desktop first or it won't work):</p>
<pre><code>dune exec -- ./examples/nonatomic.exe
</code></pre>
<h3 id="dumb-buffers">Dumb buffers</h3>
<p>Typically the pixel data to be displayed comes from some complex rendering pipeline,
but Linux also provides <em>dumb buffers</em> for simple cases such as testing.
The <code>Test_image.create</code> function used above creates a dumb buffer with a test pattern:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">create_dumb</span> <span class="n">dev</span> <span class="n">size</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">dumb_buffer</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Buffer</span><span class="p">.</span><span class="nn">Dumb</span><span class="p">.</span><span class="n">create</span> <span class="n">dev</span> <span class="o">~</span><span class="n">bpp</span><span class="o">:</span><span class="mi">32</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">arr</span> <span class="o">=</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Buffer</span><span class="p">.</span><span class="nn">Dumb</span><span class="p">.</span><span class="n">map</span> <span class="n">dev</span> <span class="n">dumb_buffer</span> <span class="nc">Int32</span> <span class="k">in</span>
</span><span class="line">  <span class="k">for</span> <span class="n">row</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">snd</span> <span class="n">size</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span>
</span><span class="line">    <span class="k">for</span> <span class="n">col</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">fst</span> <span class="n">size</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span>
</span><span class="line">      <span class="k">let</span> <span class="n">c</span> <span class="o">=</span>
</span><span class="line">        <span class="o">(</span><span class="n">row</span> <span class="ow">land</span> <span class="mh">0xff</span><span class="o">)</span> <span class="ow">lor</span>
</span><span class="line">        <span class="o">((</span><span class="n">col</span> <span class="ow">land</span> <span class="mh">0xff</span><span class="o">)</span> <span class="ow">lsl</span> <span class="mi">8</span><span class="o">)</span> <span class="ow">lor</span>
</span><span class="line">        <span class="o">(((</span><span class="n">row</span> <span class="n">lsr</span> <span class="mi">8</span><span class="o">)</span> <span class="ow">lor</span> <span class="o">(</span><span class="n">col</span> <span class="n">lsr</span> <span class="mi">8</span><span class="o">))</span> <span class="ow">lsl</span> <span class="mi">18</span><span class="o">)</span>
</span><span class="line">      <span class="k">in</span>
</span><span class="line">      <span class="n">arr</span><span class="o">.{</span><span class="n">row</span><span class="o">,</span> <span class="n">col</span><span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Int32</span><span class="p">.</span><span class="n">of_int</span> <span class="n">c</span>
</span><span class="line">    <span class="k">done</span><span class="o">;</span>
</span><span class="line">  <span class="k">done</span><span class="o">;</span>
</span><span class="line">  <span class="n">dumb_buffer</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Dumb.create</code> allocates memory for the image data.
<code>Dumb.map</code> makes it appear in host-memory as an OCaml bigarray.
The loop sets each 32-bit int in the image to some colour <code>c</code>.</p>
<p>Then we wrap this data up as an XR24-format framebuffer with a single plane:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">create</span> <span class="n">dev</span> <span class="n">size</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">buffer</span> <span class="o">=</span> <span class="n">create_dumb</span> <span class="n">dev</span> <span class="n">size</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">planes</span> <span class="o">=</span> <span class="o">[</span><span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">v</span> <span class="n">buffer</span><span class="o">.</span><span class="n">handle</span> <span class="o">~</span><span class="n">pitch</span><span class="o">:</span><span class="n">buffer</span><span class="o">.</span><span class="n">pitch</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">K</span><span class="p">.</span><span class="nn">Fb</span><span class="p">.</span><span class="n">add</span> <span class="n">dev</span> <span class="o">~</span><span class="n">size</span> <span class="o">~</span><span class="n">planes</span> <span class="o">~</span><span class="n">pixel_format</span><span class="o">:</span><span class="nn">Drm</span><span class="p">.</span><span class="nn">Fourcc</span><span class="p">.</span><span class="n">xr24</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="atomic-mode-setting">Atomic mode setting</h3>
<p><a href="https://github.com/talex5/libdrm-ocaml/blob/main/examples/atomic.ml">examples/atomic.ml</a> demonstrates the newer atomic API:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Utils</span><span class="p">.</span><span class="n">with_device</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">t</span> <span class="o">-&gt;</span>
</span><span class="line">  <span class="nn">Drm</span><span class="p">.</span><span class="nn">Client_cap</span><span class="p">.</span><span class="o">(</span><span class="n">set_exn</span> <span class="n">atomic</span><span class="o">)</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="bp">true</span><span class="o">;</span>
</span><span class="line">  <span class="k">let</span> <span class="n">connected</span> <span class="o">=</span> <span class="nn">List</span><span class="p">.</span><span class="n">filter</span> <span class="nn">Utils</span><span class="p">.</span><span class="n">is_connected</span> <span class="n">t</span><span class="o">.</span><span class="n">connectors</span> <span class="k">in</span>
</span><span class="line">  <span class="n">println</span> <span class="s2">&quot;Found %d connected connectors&quot;</span> <span class="o">(</span><span class="nn">List</span><span class="p">.</span><span class="n">length</span> <span class="n">connected</span><span class="o">);</span>
</span><span class="line">  <span class="k">let</span> <span class="n">free_planes</span> <span class="o">=</span> <span class="n">ref</span> <span class="o">(</span><span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">list</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">rq</span> <span class="o">=</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">create</span> <span class="bp">()</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">List</span><span class="p">.</span><span class="n">iter</span> <span class="o">(</span><span class="n">show_test_page</span> <span class="o">~</span><span class="n">free_planes</span> <span class="n">t</span> <span class="n">rq</span><span class="o">)</span> <span class="n">connected</span><span class="o">;</span>
</span><span class="line">  <span class="n">println</span> <span class="s2">&quot;Checking that commit will work...&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="k">match</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">commit</span> <span class="o">~</span><span class="n">test_only</span><span class="o">:</span><span class="bp">true</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">rq</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="k">exception</span> <span class="nn">Unix</span><span class="p">.</span><span class="nc">Unix_error</span> <span class="o">(</span><span class="n">code</span><span class="o">,</span> <span class="o">_,</span> <span class="o">_)</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">println</span> <span class="s2">&quot;Mode-setting would fail with error: %s&quot;</span> <span class="o">(</span><span class="nn">Unix</span><span class="p">.</span><span class="n">error_message</span> <span class="n">code</span><span class="o">)</span>
</span><span class="line">  <span class="o">|</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">println</span> <span class="s2">&quot;Pre-commit test passed.&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Utils</span><span class="p">.</span><span class="n">restoring_afterwards</span> <span class="n">t</span> <span class="o">@@</span> <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">commit</span> <span class="n">t</span><span class="o">.</span><span class="n">dev</span> <span class="n">rq</span><span class="o">;</span>
</span><span class="line">    <span class="nn">Unix</span><span class="p">.</span><span class="n">sleep</span> <span class="mi">2</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The steps are:</p>
<ol>
<li>Use <code>set_exn atomic</code> to enable atomic mode.
</li>
<li>Create an <em>atomic request</em> (<code>rq</code>).
</li>
<li>Use <code>show_test_page</code> to populate it with the desired property changes.
</li>
<li>(optional) Check that it will work (<code>~test_only:true</code>).
</li>
<li>Commit the changes (<code>Atomic_req.commit</code>).
</li>
</ol>
<p>The advantage here is that either all changes are successfully applied at once or nothing changes.
This avoids various problems with flickering or trying to roll back partial changes.</p>
<p><code>show_test_page</code> needs a couple of modifications.
First, we have to find a plane (rather than using the old <code>Crtc.set</code> which assumes a single plane),
and then we set the plane's <code>FB_ID</code> property to the new framebuffer in the request:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">add_property</span> <span class="n">rq</span> <span class="n">plane</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">fb_id</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">fb</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>For the example, I actually set more properties and defined an operator to make the code a bit neater:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="o">(</span> <span class="o">.%{}&lt;-</span> <span class="o">)</span> <span class="n">obj</span> <span class="n">prop</span> <span class="n">value</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">K</span><span class="p">.</span><span class="nn">Atomic_req</span><span class="p">.</span><span class="n">add_property</span> <span class="n">rq</span> <span class="n">obj</span> <span class="n">prop</span> <span class="n">value</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">fb_id</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nc">Some</span> <span class="n">fb</span><span class="o">;</span>
</span><span class="line"><span class="c">(* Source region on frame-buffer: *)</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_x</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_y</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_w</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="o">(</span><span class="n">fst</span> <span class="n">size</span><span class="o">);</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">src_h</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="nn">Drm</span><span class="p">.</span><span class="nn">Ufixed</span><span class="p">.</span><span class="n">of_int</span> <span class="o">(</span><span class="n">snd</span> <span class="n">size</span><span class="o">);</span>
</span><span class="line"><span class="c">(* Destination region on CRTC: *)</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_x</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_y</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_w</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="n">fst</span> <span class="n">size</span><span class="o">;</span>
</span><span class="line"><span class="n">plane</span><span class="o">.%{</span> <span class="nn">K</span><span class="p">.</span><span class="nn">Plane</span><span class="p">.</span><span class="n">crtc_h</span> <span class="o">}</span> <span class="o">&lt;-</span> <span class="n">snd</span> <span class="n">size</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>In libdrm-ocaml, properties are typed, so you can't forget to convert the source values to fixed point format.</p>
<h2 id="d-rendering">3D rendering</h2>
<p>The examples above use a dumb-buffer, but it's fairly simple to replace that with a Vulkan buffer.
The code in the last post <a href="https://github.com/talex5/vulkan-test/blob/5a93c76eb4e0205c63e0d0c5f7b4785cd15c208a/vulkan/swap_chain.ml#L42">exported the image memory from Vulkan</a> as a dmabuf FD and sent it to the Wayland compositor.
Now, instead of sending it we just need to import it into our device (with <code>Drm.Dmabuf.to_handle</code>)
and use that handle instead of the dumb-buffer one.</p>
<p>I added a simple <a href="https://github.com/talex5/vulkan-test/commit/731e90d1d61d78b4ac557f9861839dbcc233e06b">surface abstraction</a> to the test code, wrapping the <code>Window</code> module's API
so that the rendering code doesn't need to care whether it's rendering to a Wayland window or directly to the screen.
Then I made a <a href="https://github.com/talex5/vulkan-test/blob/3f767e64f38b75e216a530b6b9747020d958f11b/src/vt.ml">Vt module</a> implementing the new <code>Surface.t</code> type for rendering directly to a Linux VT.</p>
<p>To get the animation working, I used <code>K.Crtc.page_flip</code> to update the framebuffer (I could also have used the atomic API).
The kernel waits until the encoder has finishing sending the current frame before switching to the new one,
which avoids tearing.
We also need to ask the kernel to tell us when this happens, which is done by setting the optional <code>~event</code> argument to some number.
You can read events from the device file and parse them with <a href="https://talex5.github.io/libdrm-ocaml/libdrm/Drm/Event/index.html#val-parse">Drm.Event.parse</a>.</p>
<p>If you want to try it, this should produce an animated room:</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b kms-3d
cd vulkan-test
nix develop
make download-example
dune exec -- ./src/main.exe 10000 viking_room.obj viking_room.png
</code></pre>
<p>If run with <code>$WAYLAND_DISPLAY</code> set, it will open a Wayland window (as before),
but if run from a text console then it should render the animation directly using KMS.</p>
<h2 id="linux-vts">Linux VTs</h2>
<p>When the user switches to another virtual terminal (e.g. with Ctrl-Alt-F3),
we should call <code>Drm.Device.drop_master</code> to give up being the master,
allowing the application running on the new terminal to take over.</p>
<p>We should also switch the VT to <code>KD_GRAPHICS</code> mode while using it,
to stop the kernel trying to manage it.</p>
<p>I didn't implement either of these features, but see <a href="https://dvdhrm.wordpress.com/2013/08/24/how-vt-switching-works/">How VT-switching works</a> for details.</p>
<h2 id="debugging">Debugging</h2>
<p>If you get an unhelpful error code from the kernel (e.g. <code>EINVAL</code>), enabling debug messages is often helpful.
Writing <code>4</code> to <code>/sys/module/drm/parameters/debug</code> enables KMS debug messages, which can be seen in the <code>dmesg</code> output.
Write <code>0</code> to the file afterwards to turn the messages off again.
<code>modinfo -p drm</code> lists the various options.</p>
<h2 id="conclusions">Conclusions</h2>
<p>I hope you found being able to explore the libdrm API interactively from the OCaml top-level
made it easier to learn about how Linux manages displays.
As when doing <a href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/">Vulkan in OCaml</a>,
a lot of the noise from C is removed and I think that the essentials of what is going on are easier to see.</p>
<p>I used <a href="https://github.com/yallop/ocaml-ctypes">ocaml-ctypes</a> for the C bindings, and this was my first time using it in &quot;stubs&quot; mode
(where it pre-generates C bindings from OCaml definitions).
This has the advantage that the C type checker checks that the definitions are correct,
and it worked well.
Dune's <a href="https://dune.readthedocs.io/en/latest/foreign-code.html#stub-generation-with-dune-ctypes">Stub Generation</a> feature generates the build rules for this semi-automatically.</p>
<p>Deciding what OCaml types to use for the C types was quite difficult.
For example, C has many different integer types (<code>int</code>, <code>long</code>, <code>uint32_t</code>, etc),
but using lots of types is more painful in OCaml where e.g. <code>+</code> only works on <code>int</code>.
I used OCaml's int type when possible, and other types only when the value might not fit
(e.g. an image size on a 32-bit platform might not fit into an OCaml int, which is one bit shorter).</p>
<p>The C API is somewhat inconsistent about types.
e.g. <code>drmModePageFlipTarget</code> takes a <code>uint32_t target_vblank</code> argument for the sequence number,
while <code>page_flip_handler</code> confirms the event by giving it as <code>unsigned int sequence</code>.
Meanwhile, the <code>sequence_handler</code> event gives it as <code>uint64_t sequence</code>.
I'm not sure what happens if the sequence number gets too large to fit in a 32-bit integer.</p>
<p>Anyway, I think I understand mode setting a lot better now,
and I'm getting faster at debugging graphics problems on Linux
(e.g. when <a href="https://github.com/NixOS/nixpkgs/issues/458132">element-desktop failed to start</a> recently after I updated it).</p>
<p>Thanks to the OCaml Software Foundation for sponsoring this work.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Vulkan graphics in OCaml vs C</title>
    <link href="https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan/"></link>
    <updated>2025-09-20T09:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2025/09/20/ocaml-vulkan</id>
    <content type="html"><![CDATA[<p>I convert my Vulkan test program from C to OCaml and compare the results,
then continue the Vulkan tutorial in OCaml, adding 3D, textures and depth buffering.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#introduction">Introduction</a>
</li>
<li><a href="#running-it-yourself">Running it yourself</a>
</li>
<li><a href="#the-direct-port">The direct port</a>
<ul>
<li><a href="#labelled-arguments">Labelled arguments</a>
</li>
<li><a href="#enums-and-bit-fields">Enums and bit-fields</a>
</li>
<li><a href="#optional-fields">Optional fields</a>
</li>
<li><a href="#loading-shaders">Loading shaders</a>
</li>
<li><a href="#logging">Logging</a>
</li>
<li><a href="#error-handling">Error handling</a>
</li>
</ul>
</li>
<li><a href="#refactored-version">Refactored version</a>
<ul>
<li><a href="#olivine-wrappers">Olivine wrappers</a>
</li>
<li><a href="#using-fibers--effects-for-control-flow">Using fibers / effects for control flow</a>
</li>
<li><a href="#using-the-cpu-and-gpu-in-parallel">Using the CPU and GPU in parallel</a>
</li>
<li><a href="#resizing-and-resource-lifetimes">Resizing and resource lifetimes</a>
</li>
</ul>
</li>
<li><a href="#the-3d-version">The 3D version</a>
</li>
<li><a href="#garbage-collection">Garbage collection</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://lobste.rs/s/pzhqdb/vulkan_graphics_ocaml_vs_c">Lobsters</a> )</p>
<h2 id="introduction">Introduction</h2>
<p>In <a href="https://roscidus.com/blog/blog/2025/06/24/graphics/">Investigating Linux graphics</a>,
I wrote a little C program to help me learn about GPUs by drawing a triangle.
But I wondered if using OCaml instead would make my life easier.
It didn't, because there were no released OCaml Vulkan bindings,
but I found some <a href="https://github.com/Octachron/olivine">unfinished ones</a> by Florian Angeletti.
The bindings are generated mostly automatically from <a href="https://github.com/KhronosGroup/Vulkan-Docs/tree/main/xml">the Vulkan XML specification</a>,
and with <a href="https://github.com/talex5/olivine/commits/blog">a bit of effort</a> I got them working well enough to
continue with the <a href="https://docs.vulkan.org/tutorial/latest/00_Introduction.html">Vulkan tutorial</a>,
which resulted in this nice <a href="https://sketchfab.com/3d-models/viking-room-a49f1b8e4f5c4ecf9e1fe7d81915ad38">Viking room</a>:</p>
<p><a href="/blog/images/vulkan-ocaml/viking.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/viking.png" title="Vulkan tutorial in OCaml" class="caption"/><span class="caption-text">Vulkan tutorial in OCaml</span></span></a></p>
<p>In this post, I'll be looking at how the C code compares to the OCaml.
First, I did a direct line-by-line port of the C, then I refactored it to take better advantage of OCaml.</p>
<p>(Note: the Vulkan tutorial is actually <a href="https://docs.vulkan.org/tutorial/latest/_attachments/28_model_loading.cpp">using C++</a>, but I'm comparing my C version to OCaml)</p>
<h2 id="running-it-yourself">Running it yourself</h2>
<p>If you want to try it yourself (note: it requires Wayland):</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b ocaml
cd vulkan-test
nix develop
dune exec -- ./src/main.exe 200
</code></pre>
<p>As the OCaml Vulkan bindings (Olivine) are unreleased,
I included a copy of my patched version in <code>vendor/olivine</code>.
The <code>dune exec</code> command will build them automatically.</p>
<p>The <code>ocaml</code> branch above just draws one triangle.
If you want to see the 3D room pictured above, use <code>ocaml-3d</code> instead:</p>
<pre><code>git clone https://github.com/talex5/vulkan-test -b ocaml-3d
cd vulkan-test
nix develop
make download-example
dune exec -- ./src/main.exe 10000 viking_room.obj viking_room.png
</code></pre>
<h2 id="the-direct-port">The direct port</h2>
<p>Porting the code directly, line by line, was pretty straight-forward:</p>
<p><a href="/blog/images/vulkan-ocaml/meld.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/meld.png" title="Comparing the code with meld" class="caption"/><span class="caption-text">Comparing the code with meld</span></span></a></p>
<p><a href="https://github.com/talex5/vulkan-test/tree/direct-port/src">The code</a> ended up slightly shorter, but not by much:</p>
<pre><code> 28 files changed, 1223 insertions(+), 1287 deletions(-)
</code></pre>
<p>This is only approximate; sometimes I added or removed blank lines, etc.
Some things were a bit easier and others a bit harder. It mostly balanced out.</p>
<p>As an example, one thing that makes the OCaml shorter is that arrays are passed as a single item,
whereas C takes the length separately.
On the other hand, single-item arrays can be passed in C by just giving the address of the pointer,
whereas OCaml requires an array to be constructed separately.
Also, I had to include some bindings for the libdrm C library.</p>
<h3 id="labelled-arguments">Labelled arguments</h3>
<p>The OCaml bindings use labelled arguments
(e.g. the <code>VK_TRUE</code> argument in the screenshot above became <code>~wait_all:true</code> in the OCaml),
which is longer but clearer.</p>
<p>The OCaml code uses functions to create C structures, which looks pretty similar due to labels.
For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">const</span><span class="w"> </span><span class="n">VkSemaphoreGetFdInfoKHR</span><span class="w"> </span><span class="n">get_fd_info</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">sType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_STRUCTURE_TYPE_SEMAPHORE_GET_FD_INFO_KHR</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">semaphore</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">semaphore</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">handleType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_SYNC_FD_BIT</span><span class="p">,</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">get_fd_info</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore_get_fd_info_khr</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">semaphore</span>
</span><span class="line">    <span class="o">~</span><span class="n">handle_type</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">External_semaphore_handle_type_flags</span><span class="p">.</span><span class="n">sync_fd</span>
</span><span class="line"><span class="k">in</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>An advantage is that the <code>sType</code> field gets filled in automatically.</p>
<h3 id="enums-and-bit-fields">Enums and bit-fields</h3>
<p>Enumerations and bit-fields are namespaced, which is a lot clearer
as you can see which part is the name of the enum and which part is the particular value.
For example, <code>VK_ATTACHMENT_STORE_OP_STORE</code> becomes <code>Vkt.Attachment_store_op.Store</code>.
Also, OCaml usually knows the expected type and you can omit the module, so:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="n">VkAttachmentDescription</span><span class="w"> </span><span class="n">colorAttachment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">format</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_SAMPLE_COUNT_1_BIT</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">loadOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_LOAD_OP_CLEAR</span><span class="p">,</span><span class="w">  </span><span class="c1">// Clear framebuffer before rendering</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">storeOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_STORE_OP_STORE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">stencilLoadOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_LOAD_OP_DONT_CARE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">stencilStoreOp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_ATTACHMENT_STORE_OP_DONT_CARE</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">initialLayout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_IMAGE_LAYOUT_UNDEFINED</span><span class="p">,</span>
</span><span class="line"><span class="w">    </span><span class="p">.</span><span class="n">finalLayout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_IMAGE_LAYOUT_GENERAL</span><span class="p">,</span>
</span><span class="line"><span class="p">};</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">color_attachment</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Attachment_description</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">format</span><span class="o">:</span><span class="n">format</span>
</span><span class="line">    <span class="o">~</span><span class="n">samples</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Sample_count_flags</span><span class="p">.</span><span class="n">n1</span>
</span><span class="line">    <span class="o">~</span><span class="n">load_op</span><span class="o">:</span><span class="nc">Clear</span>	<span class="c">(* Clear framebuffer before rendering *)</span>
</span><span class="line">    <span class="o">~</span><span class="n">store_op</span><span class="o">:</span><span class="nc">Store</span>
</span><span class="line">    <span class="o">~</span><span class="n">stencil_load_op</span><span class="o">:</span><span class="nc">Dont_care</span>
</span><span class="line">    <span class="o">~</span><span class="n">stencil_store_op</span><span class="o">:</span><span class="nc">Dont_care</span>
</span><span class="line">    <span class="o">~</span><span class="n">initial_layout</span><span class="o">:</span><span class="nc">Undefined</span>
</span><span class="line">    <span class="o">~</span><span class="n">final_layout</span><span class="o">:</span><span class="nc">General</span>
</span><span class="line"><span class="k">in</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Bit-fields and enums get their own types (they're not just integers), so you can't use them in the wrong place
or try to combine things that aren't bit-fields (and so the <code>_BIT</code> suffix isn't needed).
One particularly striking example of the difference is that</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="p">.</span><span class="n">colorWriteMask</span><span class="w"> </span><span class="o">=</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_R_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_G_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_B_BIT</span><span class="w"> </span><span class="o">|</span>
</span><span class="line"><span class="w">    </span><span class="n">VK_COLOR_COMPONENT_A_BIT</span><span class="p">,</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">~</span><span class="n">color_write_mask</span><span class="o">:</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Color_component_flags</span><span class="p">.</span><span class="o">(</span><span class="n">r</span> <span class="o">+</span> <span class="n">g</span> <span class="o">+</span> <span class="n">b</span> <span class="o">+</span> <span class="n">a</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Vkt.Color_component_flags.(...)</code> brings all the module's symbols into scope,
including the <code>+</code> operator for combining the flags.</p>
<h3 id="optional-fields">Optional fields</h3>
<p>The specification says which fields are optional. In C you can ignore that, but OCaml enforces it.
This can be annoying sometimes, e.g.</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="p">.</span><span class="n">blendEnable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VK_FALSE</span><span class="p">,</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="o">~</span><span class="n">blend_enable</span><span class="o">:</span><span class="bp">false</span>
</span><span class="line"><span class="o">~</span><span class="n">src_color_blend_factor</span><span class="o">:</span><span class="nc">One</span>
</span><span class="line"><span class="o">~</span><span class="n">dst_color_blend_factor</span><span class="o">:</span><span class="nc">Zero</span>
</span><span class="line"><span class="o">~</span><span class="n">color_blend_op</span><span class="o">:</span><span class="nc">Add</span>
</span><span class="line"><span class="o">~</span><span class="n">src_alpha_blend_factor</span><span class="o">:</span><span class="nc">One</span>
</span><span class="line"><span class="o">~</span><span class="n">dst_alpha_blend_factor</span><span class="o">:</span><span class="nc">Zero</span>
</span><span class="line"><span class="o">~</span><span class="n">alpha_blend_op</span><span class="o">:</span><span class="nc">Add</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>because the spec says these are all non-optional, rather than that they are only needed when blending is enabled.</p>
<p>There's a similar situation with the Wayland code:
the OCaml compiler requires you to provide a handler for all possible events.
For example, OCaml forced me to write a handler for the window <code>close</code> event
(and so closing the window works in the OCaml version, but not in the C one).
Likewise, if the compositor returns an error from <code>create_immed</code> the OCaml version logs it,
while the C version ignored the error message, because the C compiler didn't remind me about that.</p>
<h3 id="loading-shaders">Loading shaders</h3>
<p>Loading the shaders was easier.
The C version has code to load the shader bytecode from disk, but in the OCaml I used <a href="https://github.com/johnwhitington/ppx_blob">ppx_blob</a>
to include it at compile time, producing a self-contained executable file:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">load_shader_module</span> <span class="n">device</span> <span class="o">[%</span><span class="n">blob</span> <span class="s2">&quot;./vert.spv&quot;</span><span class="o">]</span>
</span></code></pre></td></tr></tbody></table></div></figure><h3 id="logging">Logging</h3>
<p>OCaml has a somewhat standard logging library, so I was able to get the log messages shown as I wanted
without having to pipe the output through <code>awk</code>.
And, as a bonus, the log messages get written in the correct order now.
e.g. the C libwayland logs:</p>
<pre><code>wl_display#1.delete_id(3)
...
wl_callback#3.done(59067)
</code></pre>
<p>which appears to show a callback firing some time after it was deleted,
while <a href="https://github.com/talex5/ocaml-wayland">ocaml-wayland</a> logs:</p>
<pre><code>&lt;- wl_callback@3.done callback_data:1388855
&lt;- wl_display@1.delete_id id:3
</code></pre>
<h3 id="error-handling">Error handling</h3>
<p>The OCaml bindings return a <code>result</code> type for functions that can return errors,
using polymorphic variants to say exactly which errors can be returned by each function.
That's clever, but I found it pretty useless in practice and I followed the Olivine example code
in immediately turning every <code>Error</code> result into an exception.
You can then handle errors at a higher level (unlike the C, which just calls <code>exit</code>).
Maybe Olivine should be changed to do that itself.</p>
<p>I thought I'd been rigorous about checking for errors in the C, but I missed some places (e.g. <code>vkMapMemory</code>).
The OCaml compiler forced me to handle those too, of course.</p>
<h2 id="refactored-version">Refactored version</h2>
<p>One reason to switch to OCaml was because I was finding it hard to see how all the C code fit together.
I felt that the overall structure was getting lost in the noise.
While the initial OCaml version was similar to the C,
I think <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src">the refactored version</a> is quite a bit easier to read.</p>
<p>Moving code to separate files is much easier than in C.
There, you typically need to write a header file too, and then include it from the other files.
But in the OCaml I could just move e.g. <code>export_semaphore</code> to <code>export</code> in a new file called <code>semaphore.ml</code> and
refer to it as <code>Semaphore.export</code>.
Because each file gets its own namespace, you don't have to guess where functions are defined,
and you don't get naming conflicts between symbols in different files.
The build system (dune) automatically builds all modules in the correct order.</p>
<h3 id="olivine-wrappers">Olivine wrappers</h3>
<p>I added a <code>vulkan</code> directory with wrappers around the auto-generated Vulkan functions
with the aim of removing some noise.
For example, the wrappers take OCaml lists and convert them to C arrays as needed,
and raise exceptions on error instead of returning a result type.</p>
<p>Sometimes they do more, as in the case of <code>queue_submit</code>.
That took separate <code>wait_semaphores</code> and <code>wait_dst_stage_mask</code> arrays,
requiring them to be the same length.
By taking a list of tuples, the wrapper avoids the possibility of this error.
The old submit code:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">wait_semaphores</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">t</span><span class="o">.</span><span class="n">image_available</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">wait_stages</span> <span class="o">=</span> <span class="o">[</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">color_attachment_output</span><span class="o">]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">submit_info</span> <span class="o">=</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Submit_info</span><span class="p">.</span><span class="n">make</span> <span class="bp">()</span>
</span><span class="line">    <span class="o">~</span><span class="n">wait_semaphores</span>
</span><span class="line">    <span class="o">~</span><span class="n">wait_dst_stage_mask</span><span class="o">:(</span><span class="nn">A</span><span class="p">.</span><span class="n">of_list</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">ctype</span> <span class="n">wait_stages</span><span class="o">)</span>
</span><span class="line">    <span class="o">~</span><span class="n">command_buffers</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Command_buffer</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">t</span><span class="o">.</span><span class="n">command_buffer</span><span class="o">])</span>
</span><span class="line">    <span class="o">~</span><span class="n">signal_semaphores</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Semaphore</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">frame_state</span><span class="o">.</span><span class="n">render_finished</span><span class="o">])</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="nn">Vkc</span><span class="p">.</span><span class="n">queue_submit</span> <span class="n">t</span><span class="o">.</span><span class="n">graphics_queue</span> <span class="bp">()</span>
</span><span class="line">  <span class="o">~</span><span class="n">submits</span><span class="o">:(</span><span class="nn">Vkt</span><span class="p">.</span><span class="nn">Submit_info</span><span class="p">.</span><span class="n">array</span> <span class="o">[</span><span class="n">submit_info</span><span class="o">])</span>
</span><span class="line">  <span class="o">~</span><span class="n">fence</span><span class="o">:</span><span class="n">t</span><span class="o">.</span><span class="n">in_flight_fence</span> <span class="o">&lt;?&gt;</span> <span class="s2">&quot;queue_submit&quot;</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>becomes:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Vulkan</span><span class="p">.</span><span class="nn">Cmd</span><span class="p">.</span><span class="n">submit</span> <span class="n">device</span> <span class="n">t</span><span class="o">.</span><span class="n">command_buffer</span>
</span><span class="line">  <span class="o">~</span><span class="n">wait</span><span class="o">:[</span><span class="n">t</span><span class="o">.</span><span class="n">image_available</span><span class="o">,</span> <span class="nn">Vkt</span><span class="p">.</span><span class="nn">Pipeline_stage_flags</span><span class="p">.</span><span class="n">color_attachment_output</span><span class="o">]</span>
</span><span class="line">  <span class="o">~</span><span class="n">signal_semaphores</span><span class="o">:[</span><span class="n">frame_state</span><span class="o">.</span><span class="n">render_finished</span><span class="o">]</span>
</span><span class="line">  <span class="o">~</span><span class="n">fence</span><span class="o">:</span><span class="n">t</span><span class="o">.</span><span class="n">in_flight_fence</span><span class="o">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Sometimes the new API drops features I don't use (or don't currently understand).
For example, my new <code>submit</code> only lets you submit one command buffer at a time
(though each buffer can have many commands).</p>
<p>I moved various generic helper functions like <code>find_memory_type</code> to the wrapper library,
getting them out of the main application code.</p>
<p>Separating out these libraries made the code longer, but I think it makes it easier to read:</p>
<pre><code> 20 files changed, 843 insertions(+), 663 deletions(-)
</code></pre>
<h3 id="using-fibers--effects-for-control-flow">Using fibers / effects for control flow</h3>
<p>The C code has a single thread with a single stack,
using callbacks to redraw when the compositor is ready.
OCaml has fibers (light-weight cooperative threads), so we can use a plain loop:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">while</span> <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;</span> <span class="n">frame_limit</span> <span class="k">do</span>
</span><span class="line">  <span class="k">let</span> <span class="n">next_frame_due</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">frame</span> <span class="n">window</span> <span class="k">in</span>
</span><span class="line">  <span class="n">draw_frame</span> <span class="n">t</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">next_frame_due</span><span class="o">;</span>
</span><span class="line">  <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;-</span> <span class="n">t</span><span class="o">.</span><span class="n">frame</span> <span class="o">+</span> <span class="mi">1</span>
</span><span class="line"><span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The <code>Promise.await</code> suspends this fiber, allowing e.g. the Wayland code to handle incoming events.
I find that makes the logic easier to follow.</p>
<h3 id="using-the-cpu-and-gpu-in-parallel">Using the CPU and GPU in parallel</h3>
<p>Next I split off the input handling from the huge <code>render.ml</code> file into <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src/input.ml">input.ml</a>.</p>
<p>The Vulkan tutorial creates one uniform buffer for the input data for each frame-buffer, but this seems wasteful.
I think we only need at most two: one for the GPU to read, and one for the CPU to write for the next frame,
if we want to do that in parallel.</p>
<p>To allow this parallel operation I also had to create a pair of command buffers.
The <a href="https://github.com/talex5/vulkan-test/tree/ocaml/src/duo.ml">duo.ml</a> module holds the two (input, command-buffer) jobs and swaps them on submit.</p>
<h3 id="resizing-and-resource-lifetimes">Resizing and resource lifetimes</h3>
<p>When the window size changes we need to destroy the old swap-chain and recreate all the images, views
and framebuffers.
My C code didn't bother, and just kept things at 640x480.</p>
<p>The main problem here is how to clean up the old resources.
We could use the garbage collector, but the framebuffers are rather large and I'd like to get them freed promptly.
Also, Vulkan requires things to be freed in the correct order, which the GC wouldn't ensure.</p>
<p>I added code to free resources by having each constructor take a <code>sw</code> switch argument.
When the switch is turned off, all resources attached to it are freed.
That makes it easy to scope things to the stack: when the <code>Switch.run</code> block ends, all resources it created are freed.</p>
<p>But the life-cycle of the swap-chain is a little complicated.
I don't want to clutter the main application loop with the logic of adapting to size changes.
Again, OCaml's fibers system makes it easy to have multiple stacks so I have another fiber run:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">render_loop</span> <span class="n">t</span> <span class="n">duo</span> <span class="o">=</span>
</span><span class="line">  <span class="k">while</span> <span class="bp">true</span> <span class="k">do</span>
</span><span class="line">    <span class="k">let</span> <span class="n">geometry</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">geometry</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="k">in</span>
</span><span class="line">    <span class="nn">Switch</span><span class="p">.</span><span class="n">run</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">sw</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="k">let</span> <span class="n">framebuffers</span> <span class="o">=</span> <span class="n">create_swapchain</span> <span class="o">~</span><span class="n">sw</span> <span class="n">t</span> <span class="n">geometry</span> <span class="k">in</span>
</span><span class="line">    <span class="k">while</span> <span class="n">geometry</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">geometry</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="k">do</span>
</span><span class="line">      <span class="k">let</span> <span class="n">fb</span> <span class="o">=</span> <span class="nn">Vulkan</span><span class="p">.</span><span class="nn">Swap_chain</span><span class="p">.</span><span class="n">get_framebuffer</span> <span class="n">framebuffers</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">redraw_needed</span> <span class="o">=</span> <span class="n">next_as_promise</span> <span class="n">t</span><span class="o">.</span><span class="n">redraw_needed</span> <span class="k">in</span>
</span><span class="line">      <span class="k">let</span> <span class="n">job</span> <span class="o">=</span> <span class="nn">Duo</span><span class="p">.</span><span class="n">get</span> <span class="n">duo</span> <span class="k">in</span>
</span><span class="line">      <span class="n">record_commands</span> <span class="n">t</span> <span class="n">job</span> <span class="n">fb</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Duo</span><span class="p">.</span><span class="n">submit</span> <span class="n">duo</span> <span class="n">fb</span> <span class="n">job</span><span class="o">.</span><span class="n">command_buffer</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Window</span><span class="p">.</span><span class="n">attach</span> <span class="n">t</span><span class="o">.</span><span class="n">window</span> <span class="o">~</span><span class="n">buffer</span><span class="o">:</span><span class="n">fb</span><span class="o">.</span><span class="n">wl_buffer</span><span class="o">;</span>
</span><span class="line">      <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">redraw_needed</span>
</span><span class="line">    <span class="k">done</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>The C code created a fixed set of 4 framebuffers on each resize, but the OCaml only creates them as needed.
When dragging the window to resize that means we may only need to create one at each size,
and when keeping a steady size, it seems I only need 3 framebuffers with Sway.</p>
<p>The main loop changes slightly so that it just triggers the <code>render_loop</code> fiber:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">while</span> <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;</span> <span class="n">frame_limit</span> <span class="k">do</span>
</span><span class="line">  <span class="k">let</span> <span class="n">next_frame_due</span> <span class="o">=</span> <span class="nn">Window</span><span class="p">.</span><span class="n">frame</span> <span class="n">window</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Render</span><span class="p">.</span><span class="n">trigger_redraw</span> <span class="n">render</span><span class="o">;</span>
</span><span class="line">  <span class="nn">Promise</span><span class="p">.</span><span class="n">await</span> <span class="n">next_frame_due</span><span class="o">;</span>
</span><span class="line">  <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">&lt;-</span> <span class="n">render</span><span class="o">.</span><span class="n">frame</span> <span class="o">+</span> <span class="mi">1</span>
</span><span class="line"><span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>I'm not sure if freeing the framebuffers immediately is safe,
since in theory the GPU might still be using them if the display server requests a new frame
at a new size before the previous one has finished rendering on the GPU.
Possibly freed OCaml resources should instead get added to
a list of things to free on the C side the next time the GPU is idle.</p>
<h2 id="the-3d-version">The 3D version</h2>
<p>Although it looks a lot more impressive, the <a href="https://github.com/talex5/vulkan-test/commit/ocaml-3d">3D version</a> isn't that much more work than the 2D triangle.</p>
<p>I used the <a href="https://github.com/Chris00/ocaml-cairo">Cairo</a> library to load the PNG file with the textures
and then added a Vulkan <em>sampler</em> for it.
The shader code has to be modified to read the colour from the texture.
The most complex bit is that the texture needs to be copied
from Cairo's memory to host memory that's visible to the GPU,
and from there to fast local memory on the GPU (see <a href="https://github.com/talex5/vulkan-test/tree/ocaml-3d/src/texture.ml">texture.ml</a>).</p>
<p>Other changes needed:</p>
<ul>
<li>There's a bit of matrix stuff to position the model and project it in 3D.
</li>
<li>I added <a href="https://github.com/talex5/vulkan-test/tree/ocaml-3d/src/obj_format.ml">obj_format.ml</a> to parse the model data.
</li>
<li>The pipeline adds a depth buffer so near things obscure things behind them, regardless of the drawing order.
</li>
</ul>
<p>I didn't get my C version to do the 3D bits, but for comparison here's the Vulkan tutorial's official <a href="https://docs.vulkan.org/tutorial/latest/_attachments/28_model_loading.cpp">C++ version</a>.</p>
<h2 id="garbage-collection">Garbage collection</h2>
<p>To render smoothly at 60Hz, we have about 16ms for each frame.
You might wonder if using a garbage collector would introduce pauses and cause us to miss frames,
but this doesn't seem to be a problem.</p>
<p>In C, you can improve performance for frame-based applications by using a <a href="https://en.wikipedia.org/wiki/Region-based_memory_management">bump allocator</a>:</p>
<ol>
<li>Create a fixed buffer with enough space for every allocation needed for one frame.
</li>
<li>Allocate memory just by allocating sequentially in the region (bumping the next-free-address pointer).
</li>
<li>At the end of each frame, reset the pointer.
</li>
</ol>
<p>This makes allocation really fast and freeing things at the end costs nothing.
Implementing this in C requires special code,
but OCaml works this way by default, allocating new values sequentially onto the <em>minor heap</em>.
At the end of each frame, we can call <code>Gc.minor</code> to reset the heap.</p>
<p><code>Gc.minor</code> scans the stack looking for pointers to values that are still in use
and copies any it finds to the <em>major heap</em>.
However, since we're at the end of the frame, the stack is pretty much empty and there's almost nothing to scan.
I captured a trace of running the 3D room version with a forced minor GC at the end of every frame:</p>
<pre><code>make &amp;&amp; eio-trace run ./_build/default/src/main.exe
</code></pre>
<p><a href="/blog/images/vulkan-ocaml/trace-3d.png"><span class="caption-wrapper center"><img src="/blog/images/vulkan-ocaml/trace-3d.png" title="Tracing the full 3D version" class="caption"/><span class="caption-text">Tracing the full 3D version</span></span></a></p>
<p>The four long grey horizontal bars are the main fibers.
From top to bottom they are:</p>
<ul>
<li>The main application loop (incrementing the frame counter and triggering the render loop fiber).
</li>
<li>An <a href="https://github.com/talex5/ocaml-wayland">ocaml-wayland</a> fiber, receiving messages from the display server
(and spawning some short-lived sending fibers).
</li>
<li>The <code>render_loop</code> fiber (sending graphics commands to the GPU).
</li>
<li>A fiber used internally by the IO system.
</li>
</ul>
<p>The green sections show when each fiber is running and the yellow background indicates when the process is sleeping.
The thin red columns indicate time spent in GC (which we're here triggering after every frame).</p>
<p>If I remove the forced <code>Gc.minor</code> after each frame then the GC happens less often,
but can take a bit longer when it does.
Still not nearly long enough to miss the deadline for rendering the frame though.</p>
<p>Collection of the major heap is done incrementally in small slices and doesn't cause any trouble.</p>
<p>So, we're only using a tiny fraction of the available time.
Also, I suspect the CPU is running in a slow power-saving mode due to all the sleeping;
if we had more work to do then it would probably speed up.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Doing Vulkan programming in OCaml has advantages (clearer code, easier refactoring),
but also disadvantages (unfinished and unreleased Vulkan bindings, some friction using a C API from OCaml,
and I had to write more support code, such as some bindings for libdrm).</p>
<p>As a C API, Vulkan is not safe and will happily segfault if passed incorrect arguments.
The OCaml bindings do not fix this, and so care is still needed.
I didn't bother about that because it wasn't a problem in practice,
and properly protecting against use-after-free will probably require some changes to OCaml
(e.g. <a href="https://github.com/ocaml/ocaml/pull/389">unmapping memory</a> isn't safe without something like the &quot;modes&quot; being prototyped in
<a href="https://oxcaml.org/">OxCaml</a>).</p>
<p>I'm slowly upstreaming my changes to Olivine; hopefully this will all be easier to use one day!</p>
]]></content>
  </entry>
  <entry>
    <title type="html">OCaml 5 performance part 2</title>
    <link href="https://roscidus.com/blog/blog/2024/07/22/performance-2/"></link>
    <updated>2024-07-22T11:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2024/07/22/performance-2</id>
    <content type="html"><![CDATA[<p>The <a href="/blog/blog/2024/07/22/performance/">last post</a> looked at using various tools to understand why an OCaml 5 program was waiting a long time for IO.
In this post, I'll be trying out some tools to investigate a compute-intensive program that uses multiple CPUs.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The problem</a>
</li>
<li><a href="#threadsanitizer">ThreadSanitizer</a>
</li>
<li><a href="#perf">perf</a>
</li>
<li><a href="#mpstat">mpstat</a>
</li>
<li><a href="#offcputime">offcputime</a>
</li>
<li><a href="#the-ocaml-garbage-collector">The OCaml garbage collector</a>
</li>
<li><a href="#statmemprof">statmemprof</a>
</li>
<li><a href="#magic-trace">magic-trace</a>
</li>
<li><a href="#tuning-gc-parameters">Tuning GC parameters</a>
</li>
<li><a href="#simplifying-further">Simplifying further</a>
</li>
<li><a href="#perf-sched">perf sched</a>
</li>
<li><a href="#olly">olly</a>
</li>
<li><a href="#magic-trace-on-the-simple-allocator">magic-trace on the simple allocator</a>
</li>
<li><a href="#perf-annotate">perf annotate</a>
</li>
<li><a href="#perf-c2c">perf c2c</a>
</li>
<li><a href="#perf-stat">perf stat</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
<li><a href="#update-2024-08-22">Update 2024-08-22</a>
</li>
</ul>
<p>Further discussion about this post can be found on <a href="https://discuss.ocaml.org/t/ocaml-5-performance/15014/15">discuss.ocaml.org</a>.</p>
<h2 id="the-problem">The problem</h2>
<p>OCaml 4 allowed running multiple &quot;system threads&quot;, but only one can have the OCaml runtime lock,
so only one can be running OCaml code at a time.
OCaml 5 allows running multiple &quot;domains&quot;, all of which can be running OCaml code at the same time
(each domain can also have multiple system threads; only one system thread can be running OCaml code per domain).</p>
<p>The <a href="https://github.com/ocurrent/ocaml-ci/">ocaml-ci</a> service provides CI for many OCaml programs,
and its first step when testing a commit is to run a solver to select compatible versions for its dependencies.
Running a solve typically only takes about a second, but it has to do it for each possible test platform,
which includes versions of the OCaml compiler from 4.02 to 4.14 and 5.0 to 5.2,
multiple architectures (32-bit and 64-bit x86, 32-bit and 64-bit ARM, PPC64 and s390x),
operating systems (Alpine, Debian, Fedora, FreeBSD, macos, OpenSUSE and Ubuntu, in multiple versions), etc.
In total, this currently does 132 solver runs per commit being tested
(which seems too high to me, but let's ignore that for now).</p>
<p>The solves are done by <a href="https://github.com/ocurrent/solver-service">the solver-service</a>,
which runs on a couple of ARM machines with 160 cores each.
The old OCaml 4 version used to work by spawning lots of sub-processes,
but when OCaml 5 came out, I ported it to use a single process with multiple domains.
That removed the need for lots of communication logic,
and allowed sharing common data such as the package definitions.
The code got a lot shorter and simpler, and I'm told it's been much more reliable too.</p>
<p>But the performance was surprisingly bad.
Here's a graph showing how the number of solves per second scales with the number of CPUs (workers) being used:</p>
<p><a href="/blog/images/perf/solver-arm-orig.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-orig.svg" title="Processes scaling better than domains" class="caption"/><span class="caption-text">Processes scaling better than domains</span></span></a></p>
<p>The &quot;Processes&quot; line shows performance when forking multiple processes to do the work, which looks pretty good.
The &quot;Domains&quot; line shows what happens if you instead spawn domains inside a single process.</p>
<p>Note: The original service used many libraries (a mix of Eio and Lwt ones),
but to make investigation easier I simplified it by removing most of them.
The <a href="https://github.com/talex5/solver-service/tree/simplify">simplified version</a> doesn't use Eio or Lwt;
it just spawns some domains/processes and has each of them do the same solve in a loop a fixed number of times.</p>
<h2 id="threadsanitizer">ThreadSanitizer</h2>
<p>When converting a single-domain OCaml 4 program to use multiple cores it's easy to introduce races.
OCaml has <a href="https://ocaml.org/manual/5.2/tsan.html">ThreadSanitizer</a> (TSan) support which can detect these.
To use it, install an OCaml compiler with the <code>tsan</code> option:</p>
<pre><code>$ opam switch create 5.2.0-tsan ocaml-variants.5.2.0+options ocaml-option-tsan
</code></pre>
<p>Things run a lot slower and require more memory with this compiler, but it's good to check:</p>
<pre><code>$ ./_build/default/stress/stress.exe --internal-workers=2
[...]
WARNING: ThreadSanitizer: data race (pid=133127)
  Write of size 8 at 0x7ff2b7814d38 by thread T4 (mutexes: write M88):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Previous write of size 8 at 0x7ff2b7814d38 by thread T1 (mutexes: write M83):
    #0 camlOpam_0install__Model.group_ors_1288 lib/model.ml:70 (stress.exe+0x1d2bba)
    #1 camlOpam_0install__Model.group_ors_1288 lib/model.ml:120 (stress.exe+0x1d2b47)
    ...

  Mutex M88 (0x558368b95358) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Mutex M83 (0x558368b95240) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1295 (libtsan.so.2+0x50468)
    #1 caml_plat_mutex_init runtime/platform.c:57 (stress.exe+0x4763b2)
    #2 caml_init_domains runtime/domain.c:943 (stress.exe+0x44ebfe)
    ...

  Thread T4 (tid=133132, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

  Thread T1 (tid=133129, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
    #1 caml_domain_spawn runtime/domain.c:1265 (stress.exe+0x4504c4)
    ...

SUMMARY: ThreadSanitizer: data race lib/model.ml:70 in camlOpam_0install__Model.group_ors_1288
</code></pre>
<p>The two mutexes mentioned in the output, M83 and M88, are the <code>domain_lock</code>,
used to ensure only one sys-thread runs at a time in each domain.
In this program we only have one sys-thread per domain and so can ignore them.</p>
<p>The output reveals that the solver used a global variable to generate unique IDs:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">fresh_id</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line">  <span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
</span><span class="line">    <span class="n">incr</span> <span class="n">i</span><span class="o">;</span>           <span class="c">(* model.ml:70 *)</span>
</span><span class="line">    <span class="o">!</span><span class="n">i</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>With that fixed, TSan finds no further problems (in this simplified version).
This gives us good confidence that there isn't any shared state:
TSan would report use of shared state not protected by a mutex,
and since the program was written for OCaml 4 it won't be using any mutexes.</p>
<p>That's good, because if one thread writes to a location that another reads then that requires coordination between CPUs,
which is relatively slow
(though we could still experience slow-downs due to <a href="https://en.wikipedia.org/wiki/False_sharing">false sharing</a>,
where two separate mutable items end up in the same cache line).
However, while important for correctness, it didn't make any noticeable difference to the benchmark results.</p>
<h2 id="perf">perf</h2>
<p><a href="https://perf.wiki.kernel.org/">perf</a> is the obvious tool to use when facing CPU performance problems.
<code>perf record -g PROG</code> takes samples of the program's stack regularly,
so that functions that run a lot or for a long time will appear often.
<code>perf report</code> provides a UI to explore the results:</p>
<pre><code>$ perf report
  Children      Self  Command     Shared Object      Symbol
+   59.81%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.do_solve_2283
+   59.44%     0.00%  stress.exe  stress.exe         [.] Opam_0install.Solver.solve_1428
+   59.25%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Domain_worker.solve_951
+   58.88%     0.00%  stress.exe  stress.exe         [.] Dune.exe.Stress.run_worker_332
+   58.18%     0.00%  stress.exe  stress.exe         [.] Stdlib.Domain.body_735
+   57.91%     0.00%  stress.exe  stress.exe         [.] caml_start_program
+   34.39%     0.69%  stress.exe  stress.exe         [.] Stdlib.List.iter_366
+   34.39%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.lookup_845
+   34.39%     0.09%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.process_dep_2024
+   33.14%     0.03%  stress.exe  stress.exe         [.] Zeroinstall_solver.Sat.run_solver_1446
+   27.28%     0.00%  stress.exe  stress.exe         [.] Zeroinstall_solver.Solver_core.build_problem_2092
+   26.27%     0.02%  stress.exe  stress.exe         [.] caml_call_gc
</code></pre>
<p>Looks like we're spending most of our time solving, as expected.
But this can be misleading.
Because perf only records stack traces when the code is running, it doesn't report any time the process spent sleeping.</p>
<pre><code>$ /usr/bin/time ./_build/default/stress/stress.exe --count=10 --internal-workers=7
73.08user 0.61system 0:12.65elapsed 582%CPU (0avgtext+0avgdata 596608maxresident)k
</code></pre>
<p>With 7 workers, we'd expect to see <code>700%CPU</code>, but we only see <code>582%</code>.</p>
<h2 id="mpstat">mpstat</h2>
<p><a href="https://www.man7.org/linux/man-pages/man1/mpstat.1.html">mpstat</a> can show a per-CPU breakdown.
Here are a couple of one second intervals on my machine while the solver was running:</p>
<pre><code>$ mpstat --dec=0 -P ALL 1
16:24:39     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:40     all      78      1       2       1       0       0      18
16:24:40       0      19      1       0       1       0       1      78
16:24:40       1      88      1       0       1       0       0      10
16:24:40       2      88      1       0       1       0       0      10
16:24:40       3      88      0       0       0       0       1      11
16:24:40       4      89      1       0       0       0       0      10
16:24:40       5      90      0       0       1       0       0       9
16:24:40       6      79      1       0       1       1       1      17
16:24:40       7      86      0      12       1       1       0       0

16:24:40     CPU    %usr   %sys %iowait    %irq   %soft  %steal   %idle
16:24:41     all      80      1       2       1       0       0      17
16:24:41       0      85      0      12       1       0       1       1
16:24:41       1      91      1       0       1       0       0       7
16:24:41       2      90      0       0       1       1       0       8
16:24:41       3      89      1       0       1       0       0       9
16:24:41       4      67      1       0       1       0       0      31
16:24:41       5      52      1       0       0       0       1      46
16:24:41       6      76      1       0       1       0       0      22
16:24:41       7      90      1       0       0       0       0       9
</code></pre>
<p>Note: I removed some columns with all zero values to save space.</p>
<p>We might expect to see 7 CPUs running at 100% and one idle CPU,
but in fact they're all moderately busy.
On the other hand, none of them spent more than 91% of its time running the solver code.</p>
<h2 id="offcputime">offcputime</h2>
<p><a href="https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html">offcputime</a> will show why a process wasn't using a CPU
(it's like <code>offwaketime</code>, which we saw earlier, but doesn't record the waker).
Here I'm using <a href="https://www.man7.org/linux/man-pages/man1/pidstat.1.html">pidstat</a> to see all running threads and then examining one of the workers,
to avoid the problem we saw last time where the diagram included multiple threads:</p>
<pre><code>$ pidstat 1 -t
...
^C
Average:      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000     78304         -  550.50    9.41    0.00    0.00  559.90     -  stress.exe
Average:     1000         -     78305   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78307    8.42    0.99    0.00    0.00    9.41     -  |__stress.exe
Average:     1000         -     78308   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78310   90.59    1.49    0.00    0.00   92.08     -  |__stress.exe
Average:     1000         -     78312   91.09    1.49    0.00    0.00   92.57     -  |__stress.exe
Average:     1000         -     78314   89.11    1.49    0.00    0.00   90.59     -  |__stress.exe
Average:     1000         -     78316   89.60    1.98    0.00    0.00   91.58     -  |__stress.exe

$ sudo offcputime-bpfcc -f -t 78310 &gt; off-cpu
</code></pre>
<p>Note: The ARM machine's kernel was too old to run <code>offcputime</code>, so I ran this on my machine instead,
with one main domain and six workers.
As I needed good stacks for C functions too, I ran stress.exe in an Ubuntu 24.04 docker container,
as recent versions of Ubuntu compile with <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html">frame pointers by default</a>.</p>
<p>The raw output was very noisy, showing it waiting in many different places.
Looking at a few, it was clear it was mostly the GC (which can run from almost anywhere).
The output is just a text-file with one line per stack-trace, and bit of <code>sed</code> cleaned it up:</p>
<pre><code>$ sed -E 's/stress.exe;.*;(caml_call_gc|caml_handle_gc_interrupt|caml_poll_gc_work|asm_sysvec_apic_timer_interrupt|asm_sysvec_reschedule_ipi);/stress.exe;\\1;/' off-cpu &gt; off-cpu-gc
$ flamegraph.pl --colors=blue off-cpu-gc &gt; off-cpu-gc.svg
</code></pre>
<p>That removes the part of the stack-trace before any of various interrupt-type functions that can be called from anywhere.
The graph is blue to indicate that it shows time when the process wasn't running.</p>
<p><a href="/blog/images/perf/off-cpu-gc.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/off-cpu-gc.svg" title="Time spent off-CPU" class="caption"/><span class="caption-text">Time spent off-CPU</span></span></a></p>
<p>There are rather a lot of traces where we missed the user stack.
However, the results seem clear enough: when our worker is waiting, it's in the garbage collector,
calling <code>caml_plat_spin_wait</code>.
This is used to sleep when a spin-lock has been spinning for too long (after 1000 iterations).</p>
<h2 id="the-ocaml-garbage-collector">The OCaml garbage collector</h2>
<p>OCaml has a <em>major heap</em> for long-lived values, plus one fixed-size <em>minor heap</em> for each domain.
New allocations are made sequentially on the allocating domain's minor heap
(which is very fast, just adjusting a pointer by the size required).</p>
<p>When the minor heap is full the program performs a <em>minor GC</em>,
moving any values that are still reachable to the major heap
and leaving the minor heap empty.</p>
<p>Garbage collection of the major heap is done in small slices so that the application doesn't pause for long,
and domains can do marking and sweeping work without needing to coordinate
(except at the very end of a major cycle, when they briefly synchronise to agree a new cycle is starting).</p>
<p>However, as minor GCs move values that other domains may be using, they do require all domains to stop.</p>
<p>Although the simplified test program doesn't use Eio, we can still use <a href="https://github.com/ocaml-multicore/eio-trace">eio-trace</a> to record GC events
(we just don't see any fibers).
Here's a screenshot of the solver running with 24 domains on the ARM machine,
showing it performing GC work (not all domains are visible in the picture):</p>
<p><a href="/blog/images/perf/solver-arm-gc-24.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-gc-24.svg" title="GC work shown in eio-trace" class="caption"/><span class="caption-text">GC work shown in eio-trace</span></span></a></p>
<!-- 12.5503s -->
<p>The orange/red parts show when the GC is running and the yellow regions show when the domain is waiting for other domains.
The thick columns with yellow edges are minor GCs,
while the thin (almost invisible) red columns without any yellow between them are major slices.
The second minor GC from the left took longer than usual because the third domain from the top took a while to respond.
It also didn't do a major slice before that; perhaps it was busy doing something, or maybe Linux scheduled a different process to run then.</p>
<p>Traces recorded by eio-trace can also be viewed in Perfetto, which shows the nesting better:
Here's a close-up of a single minor GC, corresponding to the bottom two domains from the second column from the left:</p>
<p><a href="/blog/images/perf/solver-arm-gc-24-perfetto.png"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-gc-24-perfetto.png" title="Close-up in Perfetto" class="caption"/><span class="caption-text">Close-up in Perfetto</span></span></a></p>
<ul>
<li>The domain triggering the GC (the bottom one here) enters a &quot;stw_leader&quot; (stop-the-world) phase
and waits for the other domains to stop.
</li>
<li>One by one, the other domains stop and enter &quot;stw_api_barrier&quot; until all domains have stopped.
</li>
<li>All domains perform a minor GC, clearing their minor heaps.
</li>
<li>They then enter a &quot;minor_leave_barrier&quot; phase, waiting until all domains have finished.
</li>
<li>Each domain returns to running application code.
</li>
</ul>
<p>We can now see why the solver spends so much time sleeping;
when a domain performs a minor GC, it spends most of the time waiting for other domains.</p>
<p>(the above is a slight simplification; domains may do some work on the major GC while waiting)</p>
<h2 id="statmemprof">statmemprof</h2>
<p>One obvious solution to GC slowness is to produce less garbage in the first place.
To do that, we need to find out where the most costly allocations are coming from.
Tracing every memory allocation tends to make programs unusably slow,
so OCaml instead provides a <em>statistical</em> memory profiler.</p>
<p>It was temporarily removed in OCaml 5 because it needed updating for the new multicore GC,
but has recently been brought back and will be in OCaml 5.3.
There's a backport to 5.2, but <a href="https://github.com/janestreet/memtrace/pull/22#issuecomment-2199600729">I couldn't get it to work</a>,
so I just removed the domains stuff from the test and did a single-domain run on OCaml 4.14.
You need the <a href="https://github.com/janestreet/memtrace">memtrace</a> library to collect samples and <a href="https://github.com/janestreet/memtrace_viewer">memtrace_viewer</a> to view them:</p>
<pre><code>$ opam install memtrace memtrace_viewer
</code></pre>
<p>Put this at the start of the program to enable it:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Memtrace</span><span class="p">.</span><span class="n">trace_if_requested</span> <span class="o">~</span><span class="n">context</span><span class="o">:</span><span class="s2">&quot;solver-test&quot;</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then running with <code>MEMTRACE</code> set records a trace:</p>
<pre><code>$ MEMTRACE=solver.ctf ./stress.exe --count=10
Solved warm-up request in: 1.99s
Running another 10 * 1 solves...

$ memtrace-viewer solver.ctf
Processing solver.ctf...
Serving http://localhost:8080/
</code></pre>
<p><a href="/blog/images/perf/memtrace-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/memtrace-1.png" title="The memtrace viewer UI" class="caption"/><span class="caption-text">The memtrace viewer UI</span></span></a></p>
<p>The flame graph in the middle shows functions scaled by the amount of memory they allocated.
Initially it showed two groups, one for the warm-up request and one for the 10 runs.
To simplify the display, I used the filter panel (on the left) to show only allocations after the 2 second warm-up.
We can immediately see that <code>OpamVersionCompare.compare</code> is the source of most memory use.</p>
<p>Focusing on that function shows that it performed 54.1% of all allocations.
The display now shows allocations performed within it above it (in green),
and all the places it's called from in blue below:</p>
<p><a href="/blog/images/perf/memtrace-2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/memtrace-2.png" title="The compare function is expensive!" class="caption"/><span class="caption-text">The compare function is expensive!</span></span></a></p>
<p>The bulk of the allocations are coming from <a href="https://github.com/ocaml/opam/blob/a1c9c34417735687fd9310e7dc5c4c177e020441/src/core/opamVersionCompare.ml#L20-L27">this loop</a>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="c">(* [skip_while_from i f w m] yields the index of the leftmost character</span>
</span><span class="line"><span class="c"> * in the string [s], starting from [i], and ending at [m], that does</span>
</span><span class="line"><span class="c"> * not satisfy the predicate [f], or [length w] if no such index exists.  *)</span>
</span><span class="line"><span class="k">let</span> <span class="n">skip_while_from</span> <span class="n">i</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="n">i</span> <span class="o">=</span>
</span><span class="line">    <span class="k">if</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m</span> <span class="k">then</span> <span class="n">i</span>
</span><span class="line">    <span class="k">else</span> <span class="k">if</span> <span class="n">f</span> <span class="n">w</span><span class="o">.[</span><span class="n">i</span><span class="o">]</span> <span class="k">then</span> <span class="n">loop</span> <span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="k">else</span> <span class="n">i</span>
</span><span class="line">  <span class="k">in</span> <span class="n">loop</span> <span class="n">i</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">skip_zeros</span> <span class="n">x</span> <span class="n">xi</span> <span class="n">xl</span> <span class="o">=</span> <span class="n">skip_while_from</span> <span class="n">xi</span> <span class="o">(</span><span class="k">fun</span> <span class="n">c</span> <span class="o">-&gt;</span> <span class="n">c</span> <span class="o">=</span> <span class="sc">&#39;0&#39;</span><span class="o">)</span> <span class="n">x</span> <span class="n">xl</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It's used when processing a version like <code>1.2.3</code> to skip any leading &quot;0&quot; characters
(so that would compare equal to <code>1.02.3</code>).
The <code>loop</code> function refers to other variables (such as <code>f</code>) from its context,
and so OCaml allocates a closure on the heap to hold these variables.
Even though these allocations are small, we have to do it for every component of every version.
And we compare versions a lot:
for every version of a package that says it requires e.g. <code>libfoo { &gt;= &quot;1.2&quot; }</code>,
we have to check the formula against every version of libfoo.</p>
<p>The solution is rather simple (and shorter than the original!):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="k">rec</span> <span class="n">skip_while_from</span> <span class="n">i</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m</span> <span class="k">then</span> <span class="n">i</span>
</span><span class="line">  <span class="k">else</span> <span class="k">if</span> <span class="n">f</span> <span class="n">w</span><span class="o">.[</span><span class="n">i</span><span class="o">]</span> <span class="k">then</span> <span class="n">skip_while_from</span> <span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="n">f</span> <span class="n">w</span> <span class="n">m</span> <span class="k">else</span> <span class="n">i</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Removing the other allocations from <code>compare</code> too reduces total memory allocations
from 21.8G to 9.6G!
The processes benchmark got about 14% faster, while the domains one was 23% faster:</p>
<p><a href="/blog/images/perf/solver-arm-no-alloc.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-arm-no-alloc.svg" title="Effect of reducing allocations. Old values are shown in grey." class="caption"/><span class="caption-text">Effect of reducing allocations. Old values are shown in grey.</span></span></a></p>
<p>A nice optimisation,
but using domains is still nowhere close to even the original version with separate processes.</p>
<h2 id="magic-trace">magic-trace</h2>
<p>The traces above show the solver taking a long time for all domains to enter the <code>stw_api_barrier</code> phase.
What was the slow domain doing to cause that?
<code>magic-trace</code> lets us tell it when to save the ring buffer and we can use this to get detailed information.
Tracing multiple threads with magic-trace doesn't seem to work well
(each thread gets a very small buffer, they don't stop at quite the same time, and triggers don't work)
so I find it's better to trace just one thread.</p>
<p>I modified the OCaml runtime so that the leader (the domain requesting the GC) records the time.
As each domain enters <code>stw_api_barrier</code> it checks how late it is and calls a function to print a warning if it's above a threshold.
Then I attached magic-trace to one of the worker threads and told it to save a sample when that function got called:</p>
<p><a href="/blog/images/perf/gc-magic-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/gc-magic-1.png" title="A domain being slow to join a minor GC" class="caption"/><span class="caption-text">A domain being slow to join a minor GC</span></span></a></p>
<p>In the example above,
magic-trace saved about 7ms of the history of a domain up to the point where it entered <code>stw_api_barrier</code>.
The first few ms show the solver working normally.
Then it needs to do a minor GC and tries to become the leader.
But another domain has the lock and so it spins, calling <code>handle_incoming</code> 293,711 times in a loop for 2.5ms.</p>
<p>I had a look at the code in the OCaml runtime.
When a domain wants to perform a minor GC, the steps are:</p>
<ol>
<li>Acquire <code>all_domains_lock</code>.
</li>
<li>Populate the <code>stw_request</code> global.
</li>
<li>Interrupt all domains.
</li>
<li>Release <code>all_domains_lock</code>.
</li>
<li>Wait for all domains to get the interrupt.
</li>
<li>Mark self as ready, allowing GC work to start.
</li>
<li>Do minor GC.
</li>
<li>The last domain to finish its minor GC signals <code>all_domains_cond</code> and everyone resumes.
</li>
</ol>
<p>I added some extra event reporting to the GC, showing when a domain is trying to perform a GC (<code>try</code>),
when the leader is signalling other domains (<code>signal</code>), and when a domain is sleeping waiting for something (<code>sleep</code>).
Here's what that looks like (in some places):</p>
<p><a href="/blog/images/perf/solver-try.png"><span class="caption-wrapper center"><img src="/blog/images/perf/solver-try.png" title="One sleeping domain delays all the others" class="caption"/><span class="caption-text">One sleeping domain delays all the others</span></span></a></p>
<ol>
<li>The top domain finished its minor collection quickly (as it's mostly idle and had nothing to do),
and started waiting for the other domains to finish. For some reason, this sleep call took 3ms to run.
</li>
<li>The other domains resume work. One by one, they fill their minor heaps and try to start a GC.
</li>
<li>They can't start a new GC, as the old one hasn't completely finished yet, so they spin.
</li>
<li>Eventually the top domain wakes up and finishes the previous STW section.
</li>
<li>One of the other domains immediately starts a new minor GC and the pattern repeats.
</li>
</ol>
<p>These <code>try</code> events seem useful;
the program is spending much more time stuck in GC than the original traces indicated!</p>
<p>One obvious improvement here would be for idle domains to opt out of GC.
Another would be to tell the kernel when to wake instead of using sleeps —
and I see there's a PR already:
<a href="https://github.com/ocaml/ocaml/pull/12579">OS-based Synchronisation for Stop-the-World Sections</a>.</p>
<p>Another possibility would be to let domains perform minor GCs independently.
The OCaml developers did make a version that worked that way,
but it requires changes to all C code that uses the OCaml APIs,
since a value in another domain's minor heap might move while it's running.</p>
<p>Finally, I wonder if the code could be simplified a bit using a compare-and-set instead of taking a lock to become leader.
That would eliminate the <code>try</code> state, where a domain knows another domain is the leader, but doesn't know what it wants to do.
It's also strange that there's a state where
the top domain has finished its critical section and allowed the other domains to resume,
but is not quite finished enough to let a new GC start.</p>
<p>We can work around this problem by having the main domain do work too.
That could be a problem for interactive applications (where the main domain is running the UI and needs to respond fast),
but it should be OK for the solver service.
This was about 15% faster on my machine, but appeared to have no effect on the ARM server.
Lesson: get traces on the target machine!</p>
<h2 id="tuning-gc-parameters">Tuning GC parameters</h2>
<p>Another way to reduce the synchronisation overhead of minor GCs is to make them less frequent.
We can do that by increasing the size of the minor heap,
doing a few long GCs rather than many short ones.
The size is controlled by the setting e.g. <code>OCAMLRUNPARAM=s=8192k</code>.
On my machine, this actually makes things slower, but it's about 18% faster on the ARM server with 80 domains.</p>
<p>Here are the first few domains (from a total of 24) on the ARM server with different minor heap sizes
(both are showing 1s of execution):</p>
<p><a href="/blog/images/perf/small-heap-24.png"><span class="caption-wrapper center"><img src="/blog/images/perf/small-heap-24.png" title="The default minor heap size (256k words)" class="caption"/><span class="caption-text">The default minor heap size (256k words)</span></span></a>
<a href="/blog/images/perf/big-heap-24.png"><span class="caption-wrapper center"><img src="/blog/images/perf/big-heap-24.png" title="With a larger minor heap (8192k words)" class="caption"/><span class="caption-text">With a larger minor heap (8192k words)</span></span></a>
Note that the major slices also get fewer and larger, as they happen half way between minor slices.</p>
<p>Also, there's still a lot of variation between the time each domain spends doing GC
(despite the fact that they're all running exactly the same task), so they still end up waiting a lot.</p>
<h2 id="simplifying-further">Simplifying further</h2>
<p>This is all still pretty odd, though.
We're getting small performance increases, but still nothing like when forking.
Can the test-case be simplified further?
Yes, it turns out!
This <a href="https://gitlab.com/talex5/slow">simple function</a> takes much longer to run when using domains, compared to forking!</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">run_worker</span> <span class="n">n</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="n">n</span> <span class="o">*</span> <span class="mi">10000000</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>ref ()</code> allocates a small block (2 words, including the header) on the minor heap.
<code>opaque_identity</code> is to make sure the compiler doesn't optimise this pointless allocation away.</p>
<p><a href="/blog/images/perf/loop-arm.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/loop-arm.svg" title="Time to run the loop on the 160-core ARM server (lower is better)" class="caption"/><span class="caption-text">Time to run the loop on the 160-core ARM server (lower is better)</span></span></a></p>
<p>Here's what I would expect here:</p>
<ol>
<li>The domains all start to fill their minor heaps. One fills it and triggers a minor GC.
</li>
<li>The triggering domain sets an indicator in each domain saying a GC is due.
None of the domains is sleeping, so the OS isn't involved in any wake-ups here.
</li>
<li>The other domains check the indicator on their next allocation,
which happens immediately since that's all they're doing.
</li>
<li>The GCs all proceed quickly, since there's nothing to scan and nothing to promote
(except possibly the current single allocation).
</li>
<li>They all resume quickly and continue.
</li>
</ol>
<p>So ideally the lines would be flat.
In practice, we may hit physical limits due to memory bandwidth, CPU temperature or kernel limitations;
I assume this is why the &quot;Processes&quot; time starts to rise eventually.
But it looks like this minor slow-down causes knock-on effects in the &quot;Domains&quot; case.</p>
<p>If I remove the allocation, then the domains and processes versions take the same amount of time.</p>
<h2 id="perf-sched">perf sched</h2>
<p><code>perf sched record</code> records kernel scheduling events, allowing it to show what is running on each CPU at all times.
<code>perf sched timehist</code> displays a report:</p>
<pre><code>$ sudo perf sched record -k CLOCK_MONOTONIC
^C

$ sudo perf sched timehist
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  185296.715345 [0000]  sway[175042]                        1.694      0.025      0.775 
  185296.716024 [0002]  crosvm_vcpu2[178276/178217]         0.012      0.000      2.957 
  185296.717031 [0003]  main.exe[196519]                    0.006      0.000      4.004 
  185296.717044 [0003]  rcu_preempt[18]                     4.004      0.015      0.012 
  185296.717260 [0001]  main.exe[196526]                    1.760      0.000      2.633 
  185296.717455 [0001]  crosvm_vcpu1[193502/193445]        63.809      0.015      0.194 
  ...
</code></pre>
<p>The first line here shows that <code>sway</code> needed to wait for 1.694 ms for some reason (possibly a sleep),
and then once it was due to resume, had to wait a further 0.025 ms for CPU 0 to be free. It then ran for 0.775 ms.
I decided to use <code>perf sched</code> to find out what the system was doing when a domain failed to respond quickly.</p>
<p>To make the output easier to read, I hacked eio-trace to display it on the traces.
<code>perf script -g python</code> will generate a skeleton Python script that can format all the events found in the <code>perf.data</code> file,
and I used that to convert the output to CSV.
To correlate OCaml domains with Linux threads, I also modified OCaml to report the thread ID (TID) for each new domain
(it was previously reporting the PID instead for some reason).</p>
<p>Here's a trace of the simple allocator from the previous section:</p>
<p><a href="/blog/images/perf/slow-no-affinity1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity1.png" title="eio-trace with perf sched data" class="caption"/><span class="caption-text">eio-trace with perf sched data</span></span></a></p>
<!-- 404ms -->
<p>Note: the colour of <code>stw_api_barrier</code> has changed: previously eio-trace coloured it yellow to indicate sleeping,
but now we have the individual <code>sleep</code> events we can see exactly which part of it was sleeping.</p>
<p>The horizontal green bars show when each domain was running on the CPU.
Here, we see that most of the domains ran until they called <code>sleep</code>.
When the sleep timeout expires, the thread is ready to run again and goes on the run-queue.
Time spent waiting on the queue is shown with a black bar.</p>
<p>When switching to or from another process, the process name is shown.
Here we can see that <code>crosvm_vcpu6</code> interrupted one of our domains, making it late to respond to the GC request.</p>
<p>Here we see another odd feature of the protocol: even though the late domain was the last to be ready,
it wasn't able to start its GC even then, because only the leader is allowed to say when everyone is ready.
Several domains wake after the late one is ready and have to go back to sleep again.</p>
<p>The diagram also shows when Linux migrated our OCaml domains between CPUs.
For example:</p>
<ol>
<li>The bottom domain was initially running on CPU 0.
</li>
<li>After sleeping briefly, it spent a while waiting to resume and Linux moved it to CPU 6 (the leader domain, which was idle then).
</li>
<li>Once there, the bottom domain slept briefly again, and again was slow to wake, getting moved to CPU 7.
</li>
</ol>
<p>Here's another example:</p>
<p><a href="/blog/images/perf/slow-no-affinity2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity2.png" title="Two domains on the same CPU" class="caption"/><span class="caption-text">Two domains on the same CPU</span></span></a></p>
<ol>
<li>The bottom domain's sleep finished a while ago, and it's been stuck on the queue because it's on the same CPU as another domain.
</li>
<li>All the other domains are spinning, trying to become the leader for the next minor GC.
</li>
<li>Eventually, Linux preempts the 5th domain from the top to run the bottom domain
(the vertical green line indicates a switch between domains in the same process).
</li>
<li>The bottom domain finishes the previous minor GC, allowing the 3rd from top to start a new one.
</li>
<li>The new GC is delayed because the 5th domain is now waiting while the bottom domain spins.
</li>
<li>Eventually the bottom domain sleeps, allowing 5 to join and the GC starts.
</li>
</ol>
<p>I tried using the <a href="https://github.com/haesbaert/ocaml-processor">processor</a> package to pin each domain to a different CPU.
That cleaned up the traces a fair bit, but didn't make much difference to the runtime on my machine.</p>
<p>I also tried using <a href="https://www.man7.org/linux/man-pages/man1/chrt.1.html">chrt</a> to run the program as a high-priority &quot;real-time&quot; task,
which also didn't seem to help.
I wrote a <code>bpftrace</code> script to report if one of our domains was ready to resume and the scheduler instead ran something else.
That showed various things.
Often Linux was migrating something else out of the way and we had to wait for that,
but there were also some kernel tasks that seemed to be even higher priority, such as GPU drivers or uring workers.
I suspect to make this work you'd need to set the affinity of all the other processes to keep them away from the cores being used
(but that wouldn't work in this example because I'm using all of them!).
Come to think of it, running a CPU intensive task on every CPU at realtime priority was a dumb idea;
had it worked I wouldn't have been able to do anything else with the computer!</p>
<h2 id="olly">olly</h2>
<p>Exploring the scheduler behaviour was interesting, and might be needed for latency-sensitive tasks,
but how often do migrations and delays really cause trouble?
The slow GCs are interesting, but there are also sections like this where everything is going smoothly,
and minor GCs take less than 4 microseconds:</p>
<p><a href="/blog/images/perf/slow-no-affinity3.png"><span class="caption-wrapper center"><img src="/blog/images/perf/slow-no-affinity3.png" title="GCs going well" class="caption"/><span class="caption-text">GCs going well</span></span></a></p>
<p><a href="https://github.com/tarides/runtime_events_tools/">olly</a> can be used get summary statistics:</p>
<pre><code>$ olly gc-stats './_build/default/stress/stress.exe --count=6 --internal-workers=24'
...
Solved 144 requests in 25.44s (0.18s/iter) (5.66 solves/s)

Execution times:
Wall time (s):	28.17
CPU time (s):	1.66
GC time (s):	169.88
GC overhead (% of CPU time):	10223.84%

GC time per domain (s):
Domain0: 	0.47
Domain1: 	9.34
Domain2: 	6.90
Domain3: 	6.97
Domain4: 	6.68
Domain5: 	6.85
Domain6: 	6.59
...
</code></pre>
<p>10223.84% GC overhead sounds like a lot but I think this is a misleading, for a few reasons:</p>
<ol>
<li>The CPU time looks wrong. <code>time</code> reports about 6 minutes, which sounds more likely.
</li>
<li>GC time (as we've seen) includes time spent sleeping, while CPU time doesn't.
</li>
<li>It doesn't include time spent trying to become a GC leader.
</li>
</ol>
<p>To double-check, I modified eio-trace to report GC statistics for a saved trace:</p>
<pre><code>Solved 144 requests in 26.84s (0.19s/iter) (5.36 solves/s)
...

$ eio-trace gc-stats trace.fxt
./trace.fxt:

Ring  GC/s     App/s    Total/s   %GC
  0   10.255   19.376   29.631    34.61
  1    7.986   10.201   18.186    43.91
  2    8.195   10.648   18.843    43.49
  3    9.521   14.398   23.919    39.81
  4    9.775   16.537   26.311    37.15
  5    8.084   10.635   18.719    43.19
  6    7.977   10.356   18.333    43.51
...
 24    7.920   10.802   18.722    42.30

All  213.332  308.578  521.910    40.88

Note: all times are wall-clock and so include time spent blocking.
</code></pre>
<p>It ran slightly slower under eio-trace, perhaps because recording a trace file is more work than maintaining some counters,
but it's similar.
So this indicates that with 24 domains GC is taking about 40% of the total time (including time spent sleeping).</p>
<p>But something doesn't add up, on my machine at least:</p>
<ul>
<li>With processes, the simple allocator test's main process spends 2% of its time in GC and takes 2.4s to run.
</li>
<li>With domains, the main domain spends 20% of its time in GC and takes 8.2s.
</li>
</ul>
<p>Even if that 20% were removed completely, it should only save 20% of the 8.2s.
So with domains, the code must be running more slowly even when it's not in the GC.</p>
<h2 id="magic-trace-on-the-simple-allocator">magic-trace on the simple allocator</h2>
<p>I tried running magic-trace to see what it was doing outside of the GC.
Since it wasn't calling any functions, it didn't show anything, but we can fix that:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">foo</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="mi">100</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span><span class="line"><span class="o">[@@</span><span class="n">inline</span> <span class="n">never</span><span class="o">]</span> <span class="o">[@@</span><span class="n">local</span> <span class="n">never</span><span class="o">]</span> <span class="o">[@@</span><span class="n">specialise</span> <span class="n">never</span><span class="o">]</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">run_worker</span> <span class="n">n</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="n">n</span> <span class="o">*</span> <span class="mi">100000</span> <span class="k">do</span>
</span><span class="line">    <span class="n">foo</span> <span class="bp">()</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here we do blocks of 100 allocations in a function called <code>foo</code>.
The annotations are to ensure the compiler doesn't inline it.
The trace was surprisingly variable!</p>
<p><a href="/blog/images/perf/foo-magic.png"><span class="caption-wrapper center"><img src="/blog/images/perf/foo-magic.png" title="magic-trace of foo between GCs" class="caption"/><span class="caption-text">magic-trace of foo between GCs</span></span></a></p>
<p>I see times for <code>foo</code> ranging from 50ns to around 750ns!</p>
<p>Note: the extra <code>foo</code> call above was probably due to a missed end event somewhere.</p>
<h2 id="perf-annotate">perf annotate</h2>
<p>I ran <code>perf record</code> on the simplified version:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">foo</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">for</span> <span class="o">_</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">to</span> <span class="mi">100</span> <span class="k">do</span>
</span><span class="line">    <span class="n">ignore</span> <span class="o">(</span><span class="nn">Sys</span><span class="p">.</span><span class="n">opaque_identity</span> <span class="o">(</span><span class="n">ref</span> <span class="bp">()</span><span class="o">))</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here the code is simple enough that we don't need stack-traces (so no <code>-g</code>):</p>
<pre><code>$ sudo perf record ./_build/default/main.exe
$ sudo perf annotate

       │    camlDune__exe__Main.foo_273():
       │      mov  $0x3,%eax
  0.04 │      cmp  $0xc9,%rax
       │    ↓ jg   39
  7.34 │ d:   sub  $0x10,%r15
 13.37 │      cmp  (%r14),%r15
  0.09 │    ↓ jb   3f
  0.21 │16:   lea  0x8(%r15),%rbx
 70.26 │      movq $0x400,-0x8(%rbx)
  6.66 │      movq $0x1,(%rbx)
  0.73 │      mov  %rax,%rbx
  0.00 │      add  $0x2,%rax
  0.01 │      cmp  $0xc9,%rbx
  0.66 │    ↑ jne  d
  0.28 │39:   mov  $0x1,%eax
  0.34 │    ← ret
  0.00 │3f: → call caml_call_gc
       │    ↑ jmp  16
</code></pre>
<p>The code starts by (pointlessly) checking if 1 &gt; 100 in case it can skip the whole loop.
After being disappointed, it:</p>
<ol>
<li>Decreases <code>%r15</code> (<code>young_ptr</code>) by 0x10 (two words).
</li>
<li>Checks if that's now below <code>young_limit</code>, calling <code>caml_call_gc</code> if so to clear the minor heap.
</li>
<li>Writes 0x400 to the first newly-allocated word (the block header, indicating 1 word of data).
</li>
<li>Writes 1 to the second word, which represents <code>()</code>.
</li>
<li>Increments the loop counter and loops, unless we're at the end.
</li>
<li>Returns <code>()</code>.
</li>
</ol>
<p>Looks like we spent most of the time (77%) writing the block, which makes sense.
Reading <code>young_limit</code> took 13% of the time, which seems reasonable too.
If there was contention between domains, we'd expect to see it here.</p>
<p>The output looked similar whether using domains or processes.</p>
<h2 id="perf-c2c">perf c2c</h2>
<p>To double-check, I also tried <code>perf c2c</code>.
This reports on cache-to-cache transfers, where two CPUs are accessing the same memory,
which requires the processors to communicate and is therefore relatively slow.</p>
<pre><code>$ sudo perf c2c record
^C

$ sudo perf c2c report
  Load Operations                   :      11898
  Load L1D hit                      :       4140
  Load L2D hit                      :         93
  Load LLC hit                      :       3750
  Load Local HITM                   :        251
  Store Operations                  :     116386
  Store L1D Hit                     :     104763
  Store L1D Miss                    :      11622
...
# ----- HITM -----  ------- Store Refs ------  ------- CL --------                      ---------- cycles ----------    Total       cpu                                    Shared                       
# RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A    Off  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol    Object      Source:Line  Node
...
      7        0        7        4        0        0      0x7f90b4002b80
  ----------------------------------------------------------------------
    0.00%  100.00%    0.00%    0.00%    0.00%    0x0     0       1            0x44a704         0       144       107        8         1  [.] Dune.exe.Main.foo_273       main.exe  main.ml:7        0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7b9         0         0         0        1         1  [.] caml_interrupt_all_signal_  main.exe  domain.c:318     0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x0     0       1            0x4ba7e2         0         0       323       49         1  [.] caml_reset_young_limit      main.exe  domain.c:1658    0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ce94d         0         0         0        1         1  [.] caml_empty_minor_heap_prom  main.exe  minor_gc.c:622   0
    0.00%    0.00%   25.00%    0.00%    0.00%    0x8     0       1            0x4ceed2         0         0         0        1         1  [.] caml_alloc_small_dispatch   main.exe  minor_gc.c:874   0
</code></pre>
<p>This shows a list of cache lines (memory addresses) and how often we loaded from a modified address.
There's a lot of information here and I don't understand most of it.
But I think the above is saying that address 0x7f90b4002b80 (<code>young_limit</code>, at offset 0) was accessed by these places across domains:</p>
<ul>
<li><code>main.ml:7</code> (<code>ref ()</code>) checks against <code>young_limit</code> to see if we need to call into the GC.
</li>
<li><code>domain.c:318</code> sets the limit to <code>UINTNAT_MAX</code> to signal that another domain wants a GC.
</li>
<li><code>domain.c:1658</code> sets it back to <code>young_trigger</code> after being signalled.
</li>
</ul>
<p>The same cacheline was also accessed at offset 8, which contains <code>young_ptr</code> (address of last allocation):</p>
<ul>
<li><code>minor_gc.c:622</code> sets <code>young_ptr</code> to <code>young_end</code> after a GC.
</li>
<li><code>minor_gc.c:874</code> adjusts <code>young_ptr</code> to re-do the allocation that triggered the GC.
</li>
</ul>
<p>This indicates false sharing: <code>young_ptr</code> only gets accessed from one domain but it's in the same cache line as <code>young_limit</code>.</p>
<p>The main thing is that the counts are all very low, indicating that this doesn't happen often.</p>
<p>I tried adding an <code>incr x</code> on a global variable in the loop, and got some more operations reported.
But using <code>Atomic.incr</code> massively increased the number of records:</p>
<table class="table"><thead><tr><th> </th><th style="text-align: right">    Original </th><th style="text-align: right">  incr     </th><th style="text-align: right"> Atomic.incr</th></tr></thead><tbody><tr><td>Load Operations    </td><td style="text-align: right">     11,898  </td><td style="text-align: right">  25,860   </td><td style="text-align: right">  2,658,364</td></tr><tr><td>Load L1D hit       </td><td style="text-align: right">      4,140  </td><td style="text-align: right">  15,181   </td><td style="text-align: right">    326,236</td></tr><tr><td>Load L2D hit       </td><td style="text-align: right">         93  </td><td style="text-align: right">     163   </td><td style="text-align: right">        295</td></tr><tr><td>Load LLC hit       </td><td style="text-align: right">      3,750  </td><td style="text-align: right">   3,173   </td><td style="text-align: right">  2,321,704</td></tr><tr><td>Load Local HITM    </td><td style="text-align: right">        251  </td><td style="text-align: right">     299   </td><td style="text-align: right">  2,317,885</td></tr><tr><td>Store Operations   </td><td style="text-align: right">    116,386  </td><td style="text-align: right"> 462,162   </td><td style="text-align: right">  3,909,500</td></tr><tr><td>Store L1D Hit      </td><td style="text-align: right">    104,763  </td><td style="text-align: right"> 389,492   </td><td style="text-align: right">  3,908,947</td></tr><tr><td>Store L1D Miss     </td><td style="text-align: right">     11,622  </td><td style="text-align: right">  72,667   </td><td style="text-align: right">        550</td></tr></tbody></table><p>See <a href="https://joemario.github.io/blog/2016/09/01/c2c-blog/">C2C - False Sharing Detection in Linux Perf</a> for more information about all this.</p>
<h2 id="perf-stat">perf stat</h2>
<p><code>perf stat</code> shows statistics about a process.
I ran it with <code>-I 1000</code> to collect one-second samples.
Here are two samples from the test case on my machine,
one when it was running processes and one while it was using domains:</p>
<pre><code>$ perf stat -I 1000

# Processes
      8,032.71 msec cpu-clock         #    8.033 CPUs utilized
         2,475      context-switches  #  308.115 /sec
            51      cpu-migrations    #    6.349 /sec
            44      page-faults       #    5.478 /sec
35,268,665,452      cycles            #    4.391 GHz
48,673,075,188      instructions      #    1.38  insn per cycle
 9,815,905,270      branches          #    1.222 G/sec
    48,986,037      branch-misses     #    0.50% of all branches

# Domains
      8,008.11 msec cpu-clock         #    8.008 CPUs utilized
        10,970      context-switches  #    1.370 K/sec
           133      cpu-migrations    #   16.608 /sec
           232      page-faults       #   28.971 /sec
34,606,498,021      cycles            #    4.321 GHz
25,120,741,129      instructions      #    0.73  insn per cycle
 5,028,578,807      branches          #  627.936 M/sec
    24,402,161      branch-misses     #    0.49% of all branches
</code></pre>
<p>We're doing a lot more context switches with domains, as expected due to the sleeps,
and we're executing many fewer instructions, which isn't surprising.
Reporting the counts for individual CPUs gets more interesting though:</p>
<pre><code>$ sudo perf stat -I 1000 -e instructions -Aa
# Processes
     1.000409485 CPU0        5,106,261,160      instructions
     1.000409485 CPU1        2,746,012,554      instructions
     1.000409485 CPU2       14,235,084,764      instructions
     1.000409485 CPU3        7,545,940,906      instructions
     1.000409485 CPU4        2,605,655,333      instructions
     1.000409485 CPU5        6,023,131,238      instructions
     1.000409485 CPU6        2,860,656,865      instructions
     1.000409485 CPU7        8,195,416,048      instructions
     2.001406580 CPU0        5,674,686,033      instructions
     2.001406580 CPU1        2,774,756,912      instructions
     2.001406580 CPU2       12,231,014,682      instructions
     2.001406580 CPU3        8,292,824,909      instructions
     2.001406580 CPU4        2,592,461,540      instructions
     2.001406580 CPU5        7,182,922,668      instructions
     2.001406580 CPU6        2,742,731,223      instructions
     2.001406580 CPU7        7,219,186,119      instructions
     3.002394302 CPU0        4,676,179,731      instructions
     3.002394302 CPU1        2,773,345,921      instructions
     3.002394302 CPU2       13,236,080,365      instructions
     3.002394302 CPU3        5,142,640,767      instructions
     3.002394302 CPU4        2,580,401,766      instructions
     3.002394302 CPU5       13,600,129,246      instructions
     3.002394302 CPU6        2,667,830,277      instructions
     3.002394302 CPU7        4,908,168,984      instructions

$ sudo perf stat -I 1000 -e instructions -Aa
# Domains
     1.002680009 CPU0        3,134,933,139      instructions
     1.002680009 CPU1        3,140,191,650      instructions
     1.002680009 CPU2        3,155,579,241      instructions
     1.002680009 CPU3        3,059,035,269      instructions
     1.002680009 CPU4        3,102,718,089      instructions
     1.002680009 CPU5        3,027,660,263      instructions
     1.002680009 CPU6        3,167,151,483      instructions
     1.002680009 CPU7        3,214,267,081      instructions
     2.003692744 CPU0        3,009,806,420      instructions
     2.003692744 CPU1        3,015,194,636      instructions
     2.003692744 CPU2        3,093,562,866      instructions
     2.003692744 CPU3        3,005,546,617      instructions
     2.003692744 CPU4        3,067,126,726      instructions
     2.003692744 CPU5        3,042,259,123      instructions
     2.003692744 CPU6        3,073,514,980      instructions
     2.003692744 CPU7        3,158,786,841      instructions
     3.004694851 CPU0        3,069,604,047      instructions
     3.004694851 CPU1        3,063,976,761      instructions
     3.004694851 CPU2        3,116,761,158      instructions
     3.004694851 CPU3        3,045,677,304      instructions
     3.004694851 CPU4        3,101,053,228      instructions
     3.004694851 CPU5        2,973,005,489      instructions
     3.004694851 CPU6        3,109,177,113      instructions
     3.004694851 CPU7        3,158,349,130      instructions
</code></pre>
<p>In the domains case all CPUs are doing roughly the same amount of work.
But when running separate processes the CPUs differ wildly!
Over the last 1-second interval, for example, CPU5 executed 5.3 times as many instructions as CPU4.
And indeed, some of the test processes are finishing much sooner than the others,
even though they all do the same work.</p>
<p>Setting <code>/sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference</code> to <code>performance</code> didn't make it faster,
but setting it to <code>power</code> (power-saving mode) did make the processes benchmark much slower,
while having little effect on the domains case!</p>
<p>So I <em>think</em> what's happening here with separate processes is that
the CPU is boosting the performance of one or two cores at a time,
allowing them to make lots of progress.</p>
<p>But with domains this doesn't happen, either because no domain runs long enough before sleeping to trigger the boost,
or because as soon as it does it needs to stop and wait for the other domains for a GC and loses it.</p>
<h2 id="conclusions">Conclusions</h2>
<p>The main profiling and tracing tools used were:</p>
<ul>
<li><code>perf</code> to take samples of CPU use, find hot functions and hot instructions within them,
record process scheduling, look at hardware counters, and find sources of cache contention.
</li>
<li><code>statmemprof</code> to find the source of allocations.
</li>
<li><code>eio-trace</code> to visualise GC events and as a generic canvas for custom visualisations.
</li>
<li><code>magic-trace</code> to see very detailed traces of recent activity when something goes wrong.
</li>
<li><code>olly</code> to report on GC statistics.
</li>
<li><code>bpftrace</code> for quick experiments about kernel behaviour.
</li>
<li><code>offcputime</code> to see why a process is sleeping.
</li>
</ul>
<p>I think OCaml 5's runtime events tracing was the star of the show here, making it much easier to see what was going on with GC,
especially in combination with <code>perf sched</code>.
<code>statmemprof</code> is also an essential tool for OCaml, and I'll be very glad to get it back with OCaml 5.3.
I think I need to investigate <code>perf</code> more; I'd never used many of these features before.
Though it is important to use it with <code>offcputime</code> etc to check you're not missing samples due to sleeping.</p>
<p>Unlike the previous post's example, where the cause was pretty obvious and led to a massive easy speed-up,
this one took a lot of investigation and revealed several problems, none of which seem very easy to fix.
I'm also a lot less confident that I really understand what's happening here, but here is a summary of my current guess:</p>
<ul>
<li>OCaml applications typically allocate lots of short-lived values.
</li>
<li>With a single domain this isn't much of a problem; minor GCs are fast.
With multiple domains however we have to wait for every domain to enter the
GC, and then wait again for them all to exit.
</li>
<li>This can be very fast (4 microseconds or so per GC),
but if one domain is late due to OS scheduling then it can be much longer
(several ms in some cases).
</li>
<li>When a domain needs to wait for another it spins for a bit and then sleeps.
If the other domain runs on the same CPU then spinning delays it from running.
On the other hand, sleeping introduces longer delays and can cause the CPU to slow down.
</li>
<li>Idle domains are currently expensive.
An idle domain requires a syscall to wake it, and often causes all the other domains to sleep waiting for it.
When the idle domain does wake, it still can't start the GC and has to wait again for the leader.
</li>
<li>If the leader gets suspended while holding the lock, all the other domains will spin waiting for it (without ever sleeping).
This time isn't accounted for in the GC events reported by OCaml 5.2.
</li>
</ul>
<p>Since the sleeping mechanism will be changing in OCaml 5.3,
it would probably be worthwhile checking how that performs too.
I think there are some opportunities to improve the GC, such as letting idle domains opt out of GC after one collection,
and it looks like there are opportunities to reduce the amount of synchronisation done
(e.g. by letting late arrivers start the GC without having to wait for the leader,
or using a lock-free algorithm for becoming leader).</p>
<p>For the solver, it would be good to try experimenting with CPU affinity to keep a subset of the 160 cores reserved for the solver.
Increasing the minor heap size and doing work in the main domain should also reduce the overhead of GC,
and improving the version compare function in the opam library would greatly reduce the need for it.
And if my goal was really to make it fast (rather than to improve multicore OCaml and its tooling)
then I'd probably switch it back to using processes.</p>
<p>Finally, it was really useful that both of these blog posts examined performance regressions,
so I knew it must be possible to go faster.
Without a good idea of how fast something should be, it's easy to give up too early.</p>
<p>Anyway, I hope you found some useful new tool in these posts!</p>
<h2 id="update-2024-08-22">Update 2024-08-22</h2>
<p>I reported above that using <code>chrt</code> to make the process high priority didn't help on my machine.
It also didn't help on the ARM server using the real service.
However, it <em>did</em> help <em>a lot</em> when running the simplified version of the solver on the ARM server!</p>
<p>Some more investigation showed that the real service had an additional problem:
it spawned a <code>git log</code> subprocess after every solve, and this was causing all the domains to pause
for about 50ms during the fork operation.
<a href="https://github.com/ocurrent/solver-service/pull/79">Use OCaml code to find the oldest commit</a>
eliminated the problem (and is also faster, as it can cache the history, although that doesn't matter much).</p>
<p>Here are some benchmarks of various combinations of fixes:</p>
<ul>
<li>
<p><code>Processes, sched-other</code> is the original version using multiple processes.
Oddly, using <code>sched-rr</code> to make it high-priority actually slows this one down!</p>
</li>
<li>
<p><code>Domains, sched-other</code> is the original version using domains.
The results are noisy but using <code>sched-rr</code> doesn't seem to help.</p>
</li>
<li>
<p><code>ocaml-git</code> indicates that the <code>git log</code> subprocess has been replaced by OCaml code.</p>
</li>
<li>
<p><code>new opam</code> indicates using the latest Git version of the opam library,
which includes my <a href="https://github.com/ocaml/opam/pull/6144">Reduce allocations in opamVersionCompare</a>
as well as Kate's <a href="https://github.com/ocaml/opam/pull/5518">Speedup OpamVersionCompare by 25%</a>.</p>
<p><a href="/blog/images/perf/real-service.png"><span class="caption-wrapper center"><img src="/blog/images/perf/real-service.png" title="Performance of the full service" class="caption"/><span class="caption-text">Performance of the full service</span></span></a></p>
</li>
</ul>
<p>And with that, the new multicore solver service is finally faster than the old process-based one!</p>
]]></content>
  </entry>
  <entry>
    <title type="html">OCaml 5 performance problems</title>
    <link href="https://roscidus.com/blog/blog/2024/07/22/performance/"></link>
    <updated>2024-07-22T10:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2024/07/22/performance</id>
    <content type="html"><![CDATA[<p>Linux and OCaml provide a huge range of tools for investigating performance problems.
In this post I try using some of them to understand a network performance problem.
In <a href="/blog/blog/2024/07/22/performance-2/">part 2</a>, I'll investigate a problem in a CPU-intensive multicore program.</p>
<!-- more -->
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The problem</a>
</li>
<li><a href="#time">time</a>
</li>
<li><a href="#eio-trace">eio-trace</a>
</li>
<li><a href="#strace">strace</a>
</li>
<li><a href="#bpftrace">bpftrace</a>
</li>
<li><a href="#tcpdump">tcpdump</a>
</li>
<li><a href="#ss">ss</a>
</li>
<li><a href="#offwaketime">offwaketime</a>
</li>
<li><a href="#magic-trace">magic-trace</a>
</li>
<li><a href="#summary-script">Summary script</a>
</li>
<li><a href="#fixing-it">Fixing it</a>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<h2 id="the-problem">The problem</h2>
<p>While porting <a href="https://github.com/mirage/capnp-rpc">capnp-rpc</a> from <a href="https://github.com/ocsigen/lwt/">Lwt</a> to <a href="https://github.com/ocaml-multicore/eio">Eio</a>,
to take advantage of OCaml 5's new effects system,
I tried running the benchmark to see if it got any faster:</p>
<pre><code>$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44933.359573 # The old Lwt version
echo_bench.exe: [INFO] rate = 511.963565   # The (buggy) Eio version
</code></pre>
<p>The benchmark records the number of echo RPCs per second.
Clearly, something is very wrong here!
In fact, the new version was so slow I had to reduce the number of iterations so it would finish.</p>
<h2 id="time">time</h2>
<p>The old <code>time</code> command can immediately give us a hint:</p>
<pre><code>$ /usr/bin/time ./echo_bench.exe
1.85user 0.42system 0:02.31elapsed 98%CPU  # Lwt
0.16user 0.05system 0:01.95elapsed 11%CPU  # Eio (buggy)
</code></pre>
<p>(many shells provide their own <code>time</code> built-in with different output formats; I'm using <code>/usr/bin/time</code> here)</p>
<p><code>time</code>'s output shows time spent in user-mode (running the application's code on the CPU),
time spent in the kernel, and the total wall-clock time.
Both versions ran for around 2 seconds (doing a different number of iterations),
but the Lwt version was using the CPU 98% of the time, while the Eio version was mostly sleeping.</p>
<h2 id="eio-trace">eio-trace</h2>
<p><a href="https://github.com/ocaml-multicore/eio-trace">eio-trace</a> can be used to see what an Eio program is doing.
Tracing is always available (you don't need to recompile the program to get it).</p>
<pre><code>$ eio-trace run -- ./echo_bench.exe
</code></pre>
<p><code>eio-trace run</code> runs the command and displays the trace in a window.
You can also use <code>eio-trace record</code> to save a trace and examine it later.</p>
<p><a href="/blog/images/perf/capnp-eio-slow-many.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-many.png" title="Trace of slow benchmark (12 concurrent requests)" class="caption"/><span class="caption-text">Trace of slow benchmark (12 concurrent requests)</span></span></a></p>
<p>The benchmark runs 12 test clients at once, making it a bit noisy.
To simplify things, I set it to run only one client:</p>
<p><a href="/blog/images/perf/capnp-eio-slow.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow.png" title="Trace of slow benchmark (one request at a time)" class="caption"/><span class="caption-text">Trace of slow benchmark (one request at a time)</span></span></a></p>
<p>I've zoomed the image to show the first four iterations.
The first is so quick it's not really visible, but the next three take about 40ms each.
The yellow regions labelled &quot;suspend-domain&quot; show when the program is sleeping, waiting for an event from Linux.
Each horizontal bar is a fiber (a light-weight thread). From top to bottom they are:</p>
<ul>
<li>Three rows for the test client:
<ul>
<li>The main application fiber performing the RPC call (mostly awaiting responses).
</li>
<li>The network's write fiber, sending outgoing messages (mostly waiting for something to send).
</li>
<li>The network's read fiber, reading incoming messages (mostly waiting to something to read).
</li>
</ul>
</li>
<li>Four rows for the server:
<ul>
<li>A loop accepting new incoming TCP connections.
</li>
<li>A short-lived fiber that accepts the new connection, then short-lived fibers each handling one request.
</li>
<li>The server's network write fiber.
</li>
<li>The server's network read fiber.
</li>
</ul>
</li>
<li>One fiber owned by Eio itself (used to wake up the event loop in some situations).
</li>
</ul>
<p>This trace immediately raises a couple of questions:</p>
<ul>
<li>
<p>Why is there a 40ms delay in each iteration of the test loop?</p>
</li>
<li>
<p>Why does the program briefly wake up in the middle of the first delay, do nothing, and return to sleep?
(notice the extra &quot;suspend-domain&quot; at the top)</p>
</li>
</ul>
<p>Zooming in on a section between the delays, let's see what it's doing when it's not sleeping:</p>
<p><a href="/blog/images/perf/capnp-eio-slow-zoom1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-zoom1.png" title="Zoomed in on the active part" class="caption"/><span class="caption-text">Zoomed in on the active part</span></span></a></p>
<p>After a 40ms delay, the server's read fiber receives the next request (the running fiber is shown in green).
The read fiber spawns a fiber to handle the request, which finishes quickly, starts the next read,
and then the write fiber transmits the reply.</p>
<p>The client's read fiber gets the reply, the write fiber outputs a message, then the application fiber runs
and another message is sent.
The server reads something (presumably the first message, though it happens after the client had sent both),
then there is another long 40ms delay, then (far off the right of the image) the pattern repeats.</p>
<p>To get more context in the trace,
I <a href="https://ocaml-multicore.github.io/eio/eio/Eio/Private/Trace/index.html#val-log">configured</a>
the logging library to write the (existing) debug-level log messages to the trace buffer too:</p>
<p><a href="/blog/images/perf/capnp-eio-slow-zoom1-debug.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-eio-slow-zoom1-debug.png" title="With log messages" class="caption"/><span class="caption-text">With log messages</span></span></a></p>
<p>Log messages tend to be a bit long for the trace display, so they overlap and you have to zoom right in to read them,
but they do help navigate.
With this, I can see that the first client write is &quot;Send finish&quot; and the second is &quot;Calling Echo.ping&quot;.</p>
<p>Looks like we're not buffering the output, so it's doing two separate writes rather than combining them.
That's a little inefficient, and if you've done much network programming,
you also probably already know why this might cause a 40ms delay,
but let's pretend we don't know so we can play with a few more tools...</p>
<h2 id="strace">strace</h2>
<p><a href="https://github.com/strace/strace">strace</a> can be used to trace interactions between applications and the Linux kernel
(<code>-tt -T</code> shows when each call was started and how long it took):</p>
<pre><code>$ strace -tt -T ./echo_bench.exe
...
11:38:58.079200 write(2, &quot;echo_bench.exe: [INFO] Accepting&quot;..., 73) = 73 &lt;0.000008&gt;
11:38:58.079253 io_uring_enter(4, 4, 0, 0, NULL, 8) = 4 &lt;0.000032&gt;
11:38:58.079341 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000020&gt;
11:38:58.079408 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000021&gt;
11:38:58.079471 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000018&gt;
11:38:58.079525 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000019&gt;
11:38:58.079580 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 &lt;0.000013&gt;
11:38:58.079611 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 &lt;0.000009&gt;
11:38:58.079637 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffc1661a480, 24) = -1 ETIME (Timer expired) &lt;0.018913&gt;
11:38:58.098669 futex(0x5584542b767c, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000105&gt;
11:38:58.098889 futex(0x5584542b7690, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000059&gt;
11:38:58.098976 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 &lt;0.021355&gt;
</code></pre>
<p>On Linux, Eio defaults to using the <a href="https://github.com/axboe/liburing">io_uring</a> mechanism for submitting work to the kernel.
<code>io_uring_enter(4, 2, 0, 0, NULL, 8) = 2</code> means we asked to submit 2 new operations to the ring on FD 4,
and the kernel accepted them.</p>
<p>The call at <code>11:38:58.079637</code> timed out after 19ms.
It then woke up some <a href="https://www.man7.org/linux/man-pages/man2/futex.2.html">futexes</a> and then waited again, getting woken up after a further 21ms (for a total of 40ms).</p>
<p>Futexes are used to coordinate between system threads.
<code>strace -f</code> will follow all spawned threads (and processes), not just the main one:</p>
<pre><code>$ strace -T -f ./echo_bench.exe
...
[pid 48451] newfstatat(AT_FDCWD, &quot;/etc/resolv.conf&quot;, {st_mode=S_IFREG|0644, st_size=40, ...}, 0) = 0 &lt;0.000011&gt;
...
[pid 48451] futex(0x561def43296c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY &lt;unfinished ...&gt;
...
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffe1d5d1c90, 24) = -1 ETIME (Timer expired) &lt;0.018899&gt;
[pid 48449] futex(0x561def43296c, FUTEX_WAKE_PRIVATE, 1) = 1 &lt;0.000106&gt;
[pid 48451] &lt;... futex resumed&gt;)        = 0 &lt;0.019981&gt;
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 &lt;unfinished ...&gt;
...
[pid 48451] exit(0)                     = ?
[pid 48451] +++ exited with 0 +++
[pid 48449] &lt;... io_uring_enter resumed&gt;) = 0 &lt;0.021205&gt;
...
</code></pre>
<p>The benchmark connects to <code>&quot;127.0.0.1&quot;</code> and Eio uses <code>getaddrinfo</code> to look up addresses (we can't use uring for this).
Since <code>getaddrinfo</code> can block for a long time, Eio creates a new system thread (pid 48451) to handle it
(we can guess this thread is doing name resolution because we see it read <code>resolv.conf</code>).</p>
<p>As creating system threads is a little slow, Eio keeps the thread around for a bit after it finishes in case it's needed again.
The timeout is when Eio decides that the thread isn't needed any longer and asks it to exit.
So this isn't relevant to our problem (and only happens on the first 40ms delay, since we don't look up any further addresses).</p>
<p>However, strace doesn't tell us what the uring operations were, or their return values.
One option is to switch to the <code>posix</code> backend (which is the default on Unix systems).
In fact, it's a good idea with any performance problem to check if it still happens with a different backend:</p>
<pre><code>$ EIO_BACKEND=posix strace -T -tt ./echo_bench.exe
...
11:53:52.935976 writev(7, [{iov_base=&quot;\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0&quot;..., iov_len=40}], 1) = 40 &lt;0.000170&gt;
11:53:52.936308 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, {tv_sec=0, tv_nsec=0}, NULL, 8) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=0, tv_nsec=0}) &lt;0.000044&gt;
11:53:52.936500 writev(7, [{iov_base=&quot;\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0&quot;..., iov_len=136}], 1) = 136 &lt;0.000055&gt;
11:53:52.936831 readv(8, [{iov_base=&quot;\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0&quot;..., iov_len=4096}], 1) = 40 &lt;0.000056&gt;
11:53:52.937516 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, NULL, NULL, 8) = 1 ([{fd=8, revents=POLLIN}]) &lt;0.038972&gt;
11:53:52.977751 readv(8, [{iov_base=&quot;\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0&quot;..., iov_len=4096}], 1) = 136 &lt;0.000398&gt;
</code></pre>
<p>(to reduce clutter, I removed calls that returned <code>EAGAIN</code> and <code>ppoll</code> calls that returned 0 ready descriptors)</p>
<p>The problem still occurs, and now we can see the two writes:</p>
<ul>
<li>The client writes 40 bytes to its end of the socket (FD 7), after which the server's end (FD 8) is ready for reading (<code>revents=POLLIN</code>).
</li>
<li>The client then writes another 136 bytes.
</li>
<li>The server reads 40 bytes and then uses <code>ppoll</code> to await further data.
</li>
<li>After 39ms, <code>ppoll</code> says FD 8 is now ready, and the server reads the other 136 bytes.
</li>
</ul>
<h2 id="bpftrace">bpftrace</h2>
<p>Alternatively, we can trace uring operations using <a href="https://github.com/bpftrace/bpftrace">bpftrace</a>.
bpftrace is a little scripting language similar to awk,
except that instead of editing a stream of characters,
it live-patches the running Linux kernel.
Apparently this is safe to run in production
(and I haven't managed to crash my kernel with it yet).</p>
<p>Here is a list of uring tracepoints we can probe:</p>
<pre><code>$ sudo bpftrace -l 'tracepoint:io_uring:*'
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_cqe_overflow
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_defer
tracepoint:io_uring:io_uring_fail_link
tracepoint:io_uring:io_uring_file_get
tracepoint:io_uring:io_uring_link
tracepoint:io_uring:io_uring_local_work_run
tracepoint:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_req_failed
tracepoint:io_uring:io_uring_short_write
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_task_add
tracepoint:io_uring:io_uring_task_work_run
</code></pre>
<p><code>io_uring_complete</code> looks promising:</p>
<pre><code>$ sudo bpftrace -vl tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_complete
    void * ctx
    void * req
    u64 user_data
    int res
    unsigned cflags
    u64 extra1
    u64 extra2
</code></pre>
<p>Here's a script to print out the time, process, operation name and result for each completion:</p>
<figure class="code"><figcaption><span>uringtrace.bt</span></figcaption><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nc">BEGIN</span> <span class="o">{</span>
</span><span class="line">  <span class="o">@</span><span class="n">op</span><span class="o">[</span><span class="nc">IORING_OP_NOP</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;NOP&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="o">@</span><span class="n">op</span><span class="o">[</span><span class="nc">IORING_OP_READV</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;READV&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="n">tracepoint</span><span class="o">:</span><span class="n">io_uring</span><span class="o">:</span><span class="n">io_uring_complete</span> <span class="o">{</span>
</span><span class="line">  <span class="o">$</span><span class="n">req</span> <span class="o">=</span> <span class="o">(</span><span class="k">struct</span> <span class="n">io_kiocb</span> <span class="o">*)</span> <span class="n">args</span><span class="o">-&gt;</span><span class="n">req</span><span class="o">;</span>
</span><span class="line">  <span class="n">printf</span><span class="o">(</span><span class="s2">&quot;%dms: %s: %s %d</span><span class="se">\n</span><span class="s2">&quot;</span><span class="o">,</span>
</span><span class="line">    <span class="n">elapsed</span> <span class="o">/</span> <span class="mf">1e6</span><span class="o">,</span>
</span><span class="line">    <span class="n">comm</span><span class="o">,</span>
</span><span class="line">    <span class="o">@</span><span class="n">op</span><span class="o">[$</span><span class="n">req</span><span class="o">-&gt;</span><span class="n">opcode</span><span class="o">],</span>
</span><span class="line">    <span class="n">args</span><span class="o">-&gt;</span><span class="n">res</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="nc">END</span> <span class="o">{</span>
</span><span class="line">  <span class="n">clear</span><span class="o">(@</span><span class="n">op</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></tbody></table></div></figure><pre><code>$ sudo bpftrace uringtrace.bt
Attaching 3 probes...
...
1743ms: echo_bench.exe: WRITE_FIXED 40
1743ms: echo_bench.exe: READV 40
1743ms: echo_bench.exe: WRITE_FIXED 136
1783ms: echo_bench.exe: READV 136
</code></pre>
<p>In this output, the order is slightly different:
we see the server's read get the 40 bytes before the client sends the rest,
but we still see the 40ms delay between the completion of the second write and the corresponding read.
The change in order is because we're seeing when the kernel knew the read was complete,
not when the application found out about it.</p>
<h2 id="tcpdump">tcpdump</h2>
<p>An obvious step with any networking problem is the look at the packets going over the network.
<a href="https://www.tcpdump.org/">tcpdump</a> can be used to capture packets, which can be displayed on the console or in a GUI with <a href="https://www.wireshark.org/">wireshark</a>.</p>
<pre><code>$ sudo tcpdump -n -ttttt -i lo
...
...041330 IP ...37640 &gt; ...7000: Flags [P.], ..., length 40
...081975 IP ...7000 &gt; ...37640: Flags [.], ..., length 0
...082005 IP ...37640 &gt; ...7000: Flags [P.], ..., length 136
...082071 IP ...7000 &gt; ...37640: Flags [.], ..., length 0
</code></pre>
<p>Here we see the client (on port 37640) sending 40 bytes to the server (port 7000),
and the server replying with an ACK (with no payload) 40ms later.
After getting the ACK, the client socket sends the remaining 136 bytes.</p>
<p>Here we can see that while the application made the two writes in quick succession,
TCP waited before sending the second one.
Searching for &quot;delayed ack 40ms&quot; will turn up an explanation.</p>
<h2 id="ss">ss</h2>
<p><a href="https://www.man7.org/linux/man-pages/man8/ss.8.html">ss</a> displays socket statistics.
<code>ss -tin</code> shows all TCP sockets (<code>-t</code>) with internals (<code>-i</code>):</p>
<pre><code>$ ss -tin 'sport = 7000 or dport = 7000'
State   Recv-Q   Send-Q  Local Address:Port  Peer Address:Port
ESTAB   0        0       127.0.0.1:7000      127.0.0.1:56224
 ato:40 lastsnd:34 lastrcv:34 lastack:34
ESTAB   0        176     127.0.0.1:56224     127.0.0.1:7000
 ato:40 lastsnd:34 lastrcv:34 lastack:34 unacked:1 notsent:136
</code></pre>
<p>There's a lot of output here; I've removed the irrelevant bits.
<code>ato:40</code> says there's a 40ms timeout for &quot;delay ack mode&quot;.
<code>lastsnd</code>, etc, say that nothing had happened for 34ms when this information was collected.
<code>unacked</code> and <code>notsent</code> aren't documented in the man-page,
but I guess it means that the client (now port 56224) is waiting for 1 packet to be ack'd and has 136 bytes waiting until then.</p>
<p>The client socket still has both messages (176 bytes total) in its queue;
it can't forget about the first message until the server confirms receiving it,
since the client might need to send it again if it got lost.</p>
<p>This doesn't quite lead us to the solution, though.</p>
<h2 id="offwaketime">offwaketime</h2>
<p><a href="https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html">offwaketime</a> records why a program stopped using the CPU, and what caused it to resume:</p>
<pre><code>$ sudo offwaketime-bpfcc -f -p (pgrep echo_bench.exe) &gt; wakes
$ flamegraph.pl --colors=chain wakes &gt; wakes.svg
</code></pre>
<p><a href="/blog/images/perf/wakes.svg"><span class="caption-wrapper center"><img src="/blog/images/perf/wakes.svg" title="Time spent suspended along with wakeup reason" class="caption"/><span class="caption-text">Time spent suspended along with wakeup reason</span></span></a></p>
<p><code>offwaketime</code> records a stack-trace when a process is suspended (shown at the bottom and going up)
and pairs it with the stack-trace of the thread that caused it to be resumed (shown above it and going down).</p>
<p>The taller column on the right shows Eio being woken up due to TCP data being received from the network,
confirming that it was the TCP ACK that got things going again.</p>
<p>The shorter column on the left was unexpected, and the <code>[UNKNOWN]</code> in the stack is annoying
(probably C code compiled without frame pointers).
<code>gdb</code> gets a better stack trace.
It turned out to be OCaml's tick thread, which wakes every 50ms to prevent one sys-thread from hogging the CPU:</p>
<pre><code>$ strace -T -e pselect6 -p (pgrep echo_bench.exe) -f
strace: Process 20162 attached with 2 threads
...
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) &lt;0.050441&gt;
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) &lt;0.050318&gt;
</code></pre>
<p>Having multiple threads shown on the same diagram is a bit confusing.
I should probably have used <code>-t</code> to focus only on the main one.</p>
<p>Also, note that when using profiling tools that record the OCaml stack,
it's useful to compile with frame pointers enabled.
To install e.g. OCaml 5.2.0 with frame pointers enabled, use:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="sh"><span class="line">$<span class="w"> </span>opam<span class="w"> </span>switch<span class="w"> </span>create<span class="w"> </span><span class="m">5</span>.2.0-fp<span class="w"> </span>ocaml-variants.5.2.0+options<span class="w"> </span>ocaml-option-fp
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="magic-trace">magic-trace</h2>
<p><a href="https://magic-trace.org/">magic-trace</a> allows capturing a short trace of everything the CPUs were doing just before some event.
It uses Intel Processor Trace to have the CPU record all control flow changes (calls, branches, etc) to a ring-buffer,
with fairly low overhead (2% to 10%, due to extra memory bandwidth needed).
When something interesting happens, we save the buffer and use it to reconstruct the recent history.</p>
<p>Normally we'd need to set up a trigger to grab the buffer at the right moment,
but since this program is mostly idle it doesn't record much
and I just attached at a random point and immediately pressed Ctrl-C to grab a snapshot and detach:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="sh"><span class="line">$<span class="w"> </span>sudo<span class="w"> </span>magic-trace<span class="w"> </span>attach<span class="w"> </span>-multi-thread<span class="w"> </span>-trace-include-kernel<span class="w"> </span><span class="se">\</span>
</span><span class="line"><span class="w">    </span>-p<span class="w"> </span><span class="o">(</span>pgrep<span class="w"> </span>echo_bench.exe<span class="o">)</span>
</span><span class="line"><span class="o">[</span><span class="w"> </span>Attached.<span class="w"> </span>Press<span class="w"> </span>Ctrl-C<span class="w"> </span>to<span class="w"> </span>stop<span class="w"> </span>recording.<span class="w"> </span><span class="o">]</span>
</span><span class="line">^C
</span></code></pre></td></tr></tbody></table></div></figure><p>As before, we see 40ms periods of waiting, with bursts of activity between them:</p>
<p><a href="/blog/images/perf/capnp-magic-1.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-magic-1.png" title="Magic trace showing 40ms delays" class="caption"/><span class="caption-text">Magic trace showing 40ms delays</span></span></a></p>
<p>The output is a bit messed up because magic-trace doesn't understand that there are multiple OCaml fibers here,
each with their own stack. It also doesn't seem to know that exceptions unwind the stack.</p>
<p>In each 40ms column, <code>Eio_posix.Flow.single_read</code> (3rd line from top) tried to do a read
with <code>readv</code>, which got <code>EAGAIN</code> and called <code>Sched.next</code> to switch to the next fiber.
Since there was nothing left to run, the Eio scheduler called <code>ppoll</code>.
Linux didn't have anything ready for this process,
and called the <code>schedule</code> kernel function to switch to another process.</p>
<p>I recorded an eio-trace at the same time, to see the bigger picture.
Here's the eio-trace zoomed in to show the two client writes (just before the 40ms wait),
with the relevant bits of the magic-trace stack pasted below them:</p>
<p><a href="/blog/images/perf/capnp-magic-2.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-magic-2.png" title="Zoomed in on the two client writes, showing eio-trace and magic-trace output together" class="caption"/><span class="caption-text">Zoomed in on the two client writes, showing eio-trace and magic-trace output together</span></span></a></p>
<p>We can see the OCaml code calling <code>writev</code>, entering the kernel, <code>tcp_write_xmit</code> being called to handle it,
writing the IP packet to the network and then, because this is the loopback interface, the network receive logic
handling the packet too.
The second call is much shorter; <code>tcp_write_xmit</code> returns quickly without sending anything.</p>
<p>Note: I used the <code>eio_posix</code> backend here so it's easier to correlate the kernel operations to the application calls
(uring queues them up and runs them later).
The <a href="https://github.com/koonwen/uring-trace">uring-trace</a> project should make this easier in future, but doesn't integrate with eio-trace yet.</p>
<p>Zooming in further, it's easy to see the difference between the two calls to <code>tcp_write_xmit</code>:</p>
<p><a href="/blog/images/perf/tcp_write_xmit.png"><span class="caption-wrapper center"><img src="/blog/images/perf/tcp_write_xmit.png" title="The start of the first tcp_write_xmit and the whole of the second" class="caption"/><span class="caption-text">The start of the first tcp_write_xmit and the whole of the second</span></span></a>
Looking at the source for <a href="https://github.com/torvalds/linux/blob/v6.6/net/ipv4/tcp_output.c#L2727-L2731"><code>tcp_write_xmit</code></a>,
we finally find the magic word &quot;<a href="https://en.wikipedia.org/wiki/Nagle's_algorithm">nagle</a>&quot;!</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="c"><span class="line"><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">tcp_nagle_test</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span><span class="w"> </span><span class="n">skb</span><span class="p">,</span><span class="w"> </span><span class="n">mss_now</span><span class="p">,</span>
</span><span class="line"><span class="w">			     </span><span class="p">(</span><span class="n">tcp_skb_is_last</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span><span class="w"> </span><span class="n">skb</span><span class="p">)</span><span class="w"> </span><span class="o">?</span>
</span><span class="line"><span class="w">			      </span><span class="nl">nonagle</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">TCP_NAGLE_PUSH</span><span class="p">))))</span>
</span><span class="line"><span class="w">	</span><span class="k">break</span><span class="p">;</span>
</span></code></pre></td></tr></tbody></table></div></figure><h2 id="summary-script">Summary script</h2>
<p>Having identified a load of interesting events
I wrote <a href="/blog/data/perf/summary-posix.bt">summary-posix.bt</a>, a bpftrace script to summarise them.
This includes log messages written by the application (by tracing <code>write</code> calls to stderr),
reads and writes on the sockets,
and various probed kernel functions seen in the magic-trace output and when reading the kernel source.</p>
<p>The output is specialised to this application (for example, TCP segments sent to port 7000
are displayed as &quot;to server&quot;, while others are &quot;to client&quot;).
I think this is a useful way to double-check my understanding, and any fix:</p>
<pre><code>$ sudo bpftrace summary-posix.bt
[...]
844ms: server: got ping request; sending reply
844ms: server reads from socket (EAGAIN)
844ms: server: writev(96 bytes)
844ms:   tcp_write_xmit (to client, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 96 bytes to client
844ms: tcp_v4_rcv: got 96 bytes
844ms:   timer_start (tcp_delack_timer, 40 ms)
844ms: client reads 96 bytes from socket
844ms: client: enqueue finish message
844ms: client: enqueue ping call
844ms: client reads from socket (EAGAIN)
844ms: client: writev(40 bytes)
844ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
844ms:   tcp_v4_send_check: sending 40 bytes to server
845ms: tcp_v4_rcv: got 40 bytes
845ms:   timer_start (tcp_delack_timer, 40 ms)
845ms: client: writev(136 bytes)
845ms:   tcp_write_xmit (to server, nagle-on, packets_out=1)
845ms: server reads 40 bytes from socket
845ms: server reads from socket (EAGAIN)
885ms: tcp_delack_timer_handler (ACK to client)
885ms:   tcp_v4_send_check: sending 0 bytes to client
885ms: tcp_delack_timer_handler (ACK to server)
885ms: tcp_v4_rcv: got 0 bytes
885ms:   tcp_write_xmit (to server, nagle-on, packets_out=0)
885ms:   tcp_v4_send_check: sending 136 bytes to server
</code></pre>
<ol>
<li>The server replies to a ping request, sending a 96 byte reply.
Nagle is on, but nothing is awaiting an ACK (<code>packets_out=0</code>) so it gets sent immediately.
</li>
<li>The client receives the data. It starts a 40ms timer to send an ACK for it.
</li>
<li>The client enqueues a &quot;finish&quot; message, followed by another &quot;ping&quot; request.
</li>
<li>The client's write fiber sends the 40 byte &quot;finish&quot; message.
Nothing is awaiting an ACK (<code>packets_out=0</code>) so the kernel sends it immediately.
</li>
<li>The client sends the 136 byte ping request. As the last message hasn't been ACK'd, it isn't sent yet.
</li>
<li>The server receives the 40 byte finish message.
</li>
<li>40ms pass. The server's delayed ACK timer fires and it sends the ACK to the client.
</li>
<li>The client's delayed ACK timer fires, but there's nothing to do (it sent the ACK with the &quot;finish&quot;).
</li>
<li>The client socket gets the ACK for its &quot;finish&quot; message and sends the delayed ping request.
</li>
</ol>
<h2 id="fixing-it">Fixing it</h2>
<p>The problem seemed clear: while porting from Lwt to Eio I'd lost the output buffering.
So I looked at the Lwt code to see how it did it and... it doesn't! So how was it working?</p>
<p>As I did with Eio, I set the Lwt benchmark's concurrency to 1 to simplify it for tracing,
and discovered that Lwt with 1 client thread has exactly the same problem as the Eio version.
Well, that's embarrassing!
But why is Lwt fast with 12 client threads?</p>
<p>With only minor changes (e.g. <code>write</code> vs <code>writev</code>), the summary script above also worked for tracing the Lwt version.
With 1 or 2 client threads, Lwt is slow, but with 3 it's fairly fast.
The delay only happens if the client sends a &quot;finish&quot; message when the server has no replies queued up
(otherwise the finish message unblocks the replies, which carry the ACK to the client immediately).
So, it works mostly by fluke!
Lwt just happens to schedule the threads in such a way that Nagle's algorithm mostly doesn't trigger with 12 concurrent requests.</p>
<p>Anyway, adding buffering to the Eio version fixed the problem:</p>
<p><a href="/blog/images/perf/capnp-before.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-before.png" title="Before" class="caption"/><span class="caption-text">Before</span></span></a>
<a href="/blog/images/perf/capnp-after.png"><span class="caption-wrapper center"><img src="/blog/images/perf/capnp-after.png" title="After (same scale)" class="caption"/><span class="caption-text">After (same scale)</span></span></a></p>
<p>An interesting thing to notice here is that not only did the long delay go away,
but the CPU operations while it was active were faster too!
I think the reason is that the CPU goes into power-saving mode during the long delays.
<code>cpupower monitor</code> shows my CPUs running at around 1 GHz with the old code and
around 4.7 GHz when running the new version.</p>
<p>Here are the results for the fixed version:</p>
<pre><code>$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44425.962625 # The old Lwt version
echo_bench.exe: [INFO] rate = 59653.451934 # The fixed Eio version
</code></pre>
<p>60k RPC requests per second doesn't seem that impressive, but at least it's faster than the old version,
which is good enough for now! There's clearly scope for improvement here (for example, the buffering I
added is quite inefficient, making two extra copies of every message, as the framing library copies it from
a cstruct to a string, and then I have to copy the string back to a cstruct for the kernel).</p>
<h2 id="conclusions">Conclusions</h2>
<p>There are lots of great tools available to help understand why something is running slowly (or misbehaving),
and since programmers usually don't have much time for profiling,
a little investigation will often turn up something interesting!
Even when things are working correctly, these tools are a good way to learn more about how things work.</p>
<p><code>time</code> will quickly tell you if the program is taking lots of time in application code, in the kernel, or just sleeping.
If the problem is sleeping, <code>offcputime</code> and <code>offwaketime</code> can tell you why it was waiting and what woke it in the end.
My own <code>eio-trace</code> tool will give a quick visual overview of what an Eio application is doing.
<code>strace</code> is great for tracing interactions between applications and the kernel,
but it doesn't help much when the application is using uring.
To fix that, you can either switch to the <code>eio_posix</code> backend or use <code>bpftrace</code> with the uring tracepoints.
<code>tcpdump</code>, <code>wireshark</code> and <code>ss</code> are all useful to examine network problems specifically.</p>
<p>I've found <code>bpftrace</code> to be really useful for all kinds of tasks.
Being able to write quick one-liners or short scripts gives it great flexibility.
Since the scripts run in the kernel you can also filter and aggregate data efficiently
without having to pass it all to userspace, and you can examine any kernel data structures.
We didn't need that here because the program was running so slowly, but it's great for many problems.
In addition to using well-defined tracepoints,
it can also probe any (non-inlined) function in the kernel or the application.
I also think using it to create a &quot;summary script&quot; to confirm a problem and its solution seems useful,
though this is the first time I've tried doing that.</p>
<p><code>magic-trace</code> is great for getting really detailed function-by-function tracing through the application and kernel.
Its ability to report the last few ms of activity after you notice a problem is extremely useful
(though not needed in this example).
It would be really useful if you could trigger magic-trace from a bpftrace script, but I didn't see a way to do that.</p>
<p>However, it was surprisingly difficult to get any of the tools to point directly
at the combination of Nagle's algorithm with delayed ACKs as the cause of this common problem!</p>
<p>This post was mainly focused on what was happening in the kernel.
In <a href="/blog/blog/2024/07/22/performance-2/">part 2</a>, I'll investigate a CPU-intensive problem instead.</p>
]]></content>
  </entry>
  <entry>
    <title type="html">Lambda Capabilities</title>
    <link href="https://roscidus.com/blog/blog/2023/04/26/lambda-capabilities/"></link>
    <updated>2023-04-26T10:00:00+00:00</updated>
    <id>https://roscidus.com/blog/blog/2023/04/26/lambda-capabilities</id>
    <content type="html"><![CDATA[<p>&quot;Is this software safe?&quot; is a question software engineers should be able to answer,
but doing so can be difficult.
Capabilities offer an elegant solution, but seem to be little known among functional programmers.
This post is an introduction to capabilities in the context of ordinary programming
(using plain functions, in the style of the lambda calculus).</p>
<!-- more -->
<p>Even if you're not interested in security,
capabilities provide a useful way to understand programs;
when trying to track down buggy behaviour,
it's very useful to know that some component <em>couldn't</em> have been the problem.</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#the-problem">The Problem</a>
</li>
<li><a href="#option-1-security-as-a-separate-concern">Option 1: Security as a separate concern</a>
</li>
<li><a href="#option-2-purity">Option 2: Purity</a>
</li>
<li><a href="#option-3-capabilities">Option 3: Capabilities</a>
<ul>
<li><a href="#attenuation">Attenuation</a>
</li>
<li><a href="#web-server-example">Web-server example</a>
</li>
<li><a href="#use-at-different-scales">Use at different scales</a>
</li>
<li><a href="#key-points">Key points</a>
</li>
</ul>
</li>
<li><a href="#practical-considerations">Practical considerations</a>
<ul>
<li><a href="#plumbing-capabilities-everywhere">Plumbing capabilities everywhere</a>
</li>
<li><a href="#levels-of-support">Levels of support</a>
</li>
<li><a href="#running-on-a-traditional-os">Running on a traditional OS</a>
</li>
<li><a href="#use-with-existing-security-mechanisms">Use with existing security mechanisms</a>
</li>
<li><a href="#thread-local-storage">Thread-local storage</a>
</li>
<li><a href="#symlinks">Symlinks</a>
</li>
<li><a href="#time-and-randomness">Time and randomness</a>
</li>
<li><a href="#power-boxes">Power boxes</a>
</li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
</li>
</ul>
<p>( this post also appeared on <a href="https://www.reddit.com/r/ProgrammingLanguages/comments/130an3z/lambda_capabilities/">Reddit</a>, <a href="https://news.ycombinator.com/item?id=35723557">Hacker News</a> and <a href="https://lobste.rs/s/uyj3vj/lambda_capabilities">Lobsters</a> )</p>
<h2 id="the-problem">The Problem</h2>
<p>We have some application (for example, a web-server) that we want to run.
The application is many thousands of lines long and depends on dozens of third-party libraries,
which get updated on a regular basis.
I would like to be able to check, quickly and easily, that the application cannot do any of these things:</p>
<ul>
<li>Delete my files.
</li>
<li>Append a line to my <code>~/.ssh/authorized_keys</code> file.
</li>
<li>Act as a relay, allowing remote machines to attack other computers on my local network.
</li>
<li>Send telemetry to a third-party.
</li>
<li>Anything else bad that I forget to think about.
</li>
</ul>
<p>For example, here are some of the OCaml packages I use just to generate this blog:</p>
<p><a href="/blog/images/lambda-caps/blog-deps.svg"><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/blog-deps.svg" title="Dependency graph for this blog" class="caption"/><span class="caption-text">Dependency graph for this blog</span></span></a></p>
<p>Having to read every line of every version of each of these packages in order to decide whether it's safe
to generate the blog clearly isn't practical.</p>
<p>I'll start by looking at traditional solutions to this problem, using e.g. containers or VMs,
and then show how to do better using capabilities.</p>
<h2 id="option-1-security-as-a-separate-concern">Option 1: Security as a separate concern</h2>
<p>A common approach to access control treats securing software as a separate activity to writing it.
Programmers write (insecure) software, and a security team writes a policy saying what it can do.
Examples include firewalls, containers, virtual machines, seccomp policies, SELinux and AppArmor.</p>
<p>The great advantage of these schemes is that security can be applied after the software is written, treating it as a black box.
However, it comes with many problems:</p>
<dl><dt>Confused deputy problem</dt>
<dd>
<p>Some actions are OK for one use but not for another.</p>
<p>For example, if the client of a web-server requests <code>https://example.com/../../etc/httpd/server-key.pem</code>
then we don't want the server to read this file and send it to them.
But the server does need to read this file for other reasons, so the policy must allow it.</p>
</dd>
<dt>Coarse-grained controls</dt>
<dd>
<p>All the modules making up the program are treated the same way,
even though you probably trust some more than others.</p>
<p>For example, we might trust the TLS implementation with the server's private key, but not the templating engine,
and I know the modules I wrote myself are not malicious.</p>
</dd>
<dt>Even well-typed programs go wrong</dt>
<dd>
<p>Programming in a language with static types is supposed to ensure that if the program compiles then it won't crash.
But the security policy can cause the program to fail even though it passed the compiler's checks.</p>
<p>For example, the server might sometimes need to send an email notification.
If it didn't do that while the security policy was being written, then that will be blocked.
Or perhaps the web-server didn't even have a notification system when the policy was written,
but has since been updated.</p>
</dd>
<dt>Policy language limitations</dt>
<dd>
<p>The security configuration is written in a new language, which must be learned.
It's usually not worth learning this just for one program,
so the people who write the program struggle to write the policy.
Also, the policy language often cannot express the desired policy,
since it may depend on concepts unique to the program
(e.g. controlling access based on a web-app user's ID, rather than local Unix user ID).</p>
</dd>
</dl>
<p>All of the above problems stem from trying to separate security from the code.
If the code were fully correct, we wouldn't need the security layer.
Checking that code is fully correct is hard,
but maybe there are easy ways to check automatically that it does at least satisfy our security requirements...</p>
<h2 id="option-2-purity">Option 2: Purity</h2>
<p>One way to prevent programs from performing unwanted actions is to prevent <em>all</em> actions.
In pure functional languages, such as Haskell, the only way to interact with the outside world is to return the action you want to perform from <code>main</code>. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="haskell"><span class="line"><span class="nf">f</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Int</span><span class="w"> </span><span class="ow">-&gt;</span><span class="w"> </span><span class="kt">String</span>
</span><span class="line"><span class="nf">f</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="nf">main</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">IO</span><span class="w"> </span><span class="nb">()</span>
</span><span class="line"><span class="nf">main</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="n">putStr</span><span class="w"> </span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="mi">42</span><span class="p">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Even if we don't look at the code of <code>f</code>, we can be sure it only returns a <code>String</code> and performs no other actions
(assuming <a href="https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/safe_haskell.html">Safe Haskell</a> is being used).
Assuming we trust <code>putStr</code>, we can be sure this program will only output a string to stdout and not perform any other actions.</p>
<p>However, writing only pure code is quite limiting. Also, we still need to audit all IO code.</p>
<h2 id="option-3-capabilities">Option 3: Capabilities</h2>
<p>Consider this code (written in a small OCaml-like functional language, where <code>ref n</code> allocates a new memory location
initially containing <code>n</code>, and <code>!x</code> reads the current value of <code>x</code>):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span> <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">5</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">y</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">10</span> <span class="k">in</span>
</span><span class="line">  <span class="n">f</span> <span class="n">x</span><span class="o">;</span>
</span><span class="line">  <span class="k">assert</span> <span class="o">(!</span><span class="n">y</span> <span class="o">=</span> <span class="mi">10</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Can we be sure that the assert won't fail, without knowing the definition of <code>f</code>?
Assuming the language doesn't provide unsafe backdoors (such as OCaml's <code>Obj.magic</code>), we can.
<code>f x</code> cannot change <code>y</code>, because <code>f x</code> does not have access to <code>y</code>.</p>
<p>So here is an access control system, built in to the lambda calculus itself!
At first glance this might not look very promising.
For example, while <code>f</code> doesn't have access to <code>y</code>, it does have access to any global variables defined before <code>f</code>.
It also, typically, has access to the file-system and network,
which are effectively globals too.</p>
<p>To make this useful, we ban global variables.
Then any top-level function like <code>f</code> can only access things passed to it explicitly as arguments.
Avoiding global variables is usually considered good practise, and some systems ban them for other reasons anyway
(for example, Rust doesn't allow global mutable state as it wouldn't be able to prevent races accessing it from multiple threads).</p>
<p>Returning to the Haskell example above (but now in OCaml syntax),
it looks like this in our capability system:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="n">x</span> <span class="o">=</span> <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">ch</span> <span class="o">=</span> <span class="n">output_string</span> <span class="n">ch</span> <span class="o">(</span><span class="n">f</span> <span class="mi">42</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Since <code>f</code> is a top-level function, we know it does not close over any mutable state, and our <code>42</code> argument is pure data.
Therefore, the call <code>f 42</code> does not have access to, and therefore cannot affect,
any pre-existing state (including the filesystem).
Internally, it can use mutation (creating arrays, etc),
but it has nowhere to store any mutable values and so they will get GC'd after it returns.
<code>f</code> therefore appears as a pure function, and calling it multiple times will always give the same result,
just as in the Haskell version.</p>
<p><code>output_string</code> is also a top-level function, closing over no mutable state.
However, the function resulting from evaluating <code>output_string ch</code> is not top-level,
and without knowing anything more about it we should assume it has full access to the output channel <code>ch</code>.</p>
<p>If <code>main</code> is invoked with standard output as its argument, it may output a message to it,
but cannot affect other pre-existing state.</p>
<p>In this way, we can reason about the pure parts of our code as easily as with Haskell,
but we can also reason about the parts with side-effects.
Haskell's purity is just a special case of a more general rule:
the effects of a (top-level) function are bounded by its arguments.</p>
<h3 id="attenuation">Attenuation</h3>
<p>So far, we've been thinking about what values are reachable through other values.
For example, the set of ref-cells that can be modified by <code>f x</code> is bounded by
the union of the set of ref cells reachable from the closure <code>f</code>
with the set of ref cells reachable from <code>x</code>.</p>
<p>One powerful aspect of capabilities is that we can use functions to implement whatever access controls we want.
For example, let's say we only want <code>f</code> to be able to set the ref-cell, but not read it.
We can just pass it a suitable function:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="n">x</span> <span class="o">:=</span> <span class="n">v</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="n">set</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or perhaps we only want to allow inserting positive integers:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="n">v</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">then</span> <span class="n">set</span> <span class="n">v</span>
</span><span class="line">  <span class="k">else</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Positive values only!&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or we can allow access to be revoked:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="o">(</span><span class="nc">Some</span> <span class="n">set</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">match</span> <span class="o">!</span><span class="n">r</span> <span class="k">with</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">Some</span> <span class="n">fn</span> <span class="o">-&gt;</span> <span class="n">fn</span> <span class="n">v</span>
</span><span class="line">  <span class="o">|</span> <span class="nc">None</span> <span class="o">-&gt;</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Access revoked!&quot;</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="o">...</span>
</span><span class="line"><span class="n">r</span> <span class="o">:=</span> <span class="nc">None</span>		<span class="c">(* Revoke *)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or we could limit the number of times it can be used:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">used</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">if</span> <span class="o">!</span><span class="n">used</span> <span class="o">&lt;</span> <span class="mi">3</span> <span class="k">then</span> <span class="o">(</span><span class="n">incr</span> <span class="n">used</span><span class="o">;</span> <span class="n">set</span> <span class="n">v</span><span class="o">)</span>
</span><span class="line">  <span class="k">else</span> <span class="n">invalid_arg</span> <span class="s2">&quot;Quota exceeded&quot;</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or log each time it is used, tagged with a label that's meaningful to us
(e.g. the function to which we granted access):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">log</span> <span class="o">=</span> <span class="n">ref</span> <span class="bp">[]</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">set</span> <span class="n">name</span> <span class="n">v</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="n">sprintf</span> <span class="s2">&quot;%S set it to %d&quot;</span> <span class="n">name</span> <span class="n">v</span> <span class="k">in</span>
</span><span class="line">  <span class="n">log</span> <span class="o">:=</span> <span class="n">msg</span> <span class="o">::</span> <span class="o">!</span><span class="n">log</span><span class="o">;</span>
</span><span class="line">  <span class="n">set</span> <span class="n">v</span>
</span><span class="line"><span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="o">(</span><span class="n">set</span> <span class="s2">&quot;f&quot;</span><span class="o">);</span>
</span><span class="line"><span class="n">g</span> <span class="o">(</span><span class="n">set</span> <span class="s2">&quot;g&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Or all of the above.</p>
<p>In these examples, our function <code>f</code> never got direct access (permission) to <code>x</code>, yet was still able to affect it.
Therefore, in capability systems people often talk about &quot;authority&quot; rather than permission.
Roughly speaking, the <em>authority</em> of a subject is the set of actions that the subject could cause to happen,
now or in the future, on currently-existing resources.
Since it's only things that <em>might</em> happen, and we don't want to read all the code to find out exactly what
it might do, we're usually only interested in getting an upper-bound on a subject's authority,
to show that it <em>can't</em> do something.</p>
<p>The examples here all used a single function.
We may want to allow multiple operations on a single value (e.g. getting and setting a ref-cell),
and the usual techniques are available for doing that (e.g. having the function take the operation as its first argument,
or collecting separate functions together in a record, module or object).</p>
<h3 id="web-server-example">Web-server example</h3>
<p>Let's look at a more realistic example.
Here's a simple web-server (we are defining the <code>main</code> function, which takes two arguments):</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>To use it, we pass it access to some network (<code>net</code>) and a directory tree with the content (<code>htdocs</code>).
Immediately we can see that this server does not access any part of the file-system outside of <code>htdocs</code>,
but that it may use the network. Here's a picture of the situation:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web1.svg" title="Initial reference graph" class="caption"/><span class="caption-text">Initial reference graph</span></span></p>
<p>Notes on reading the diagram:</p>
<ul>
<li>The diagram shows a model of the reference graph, where each node represents some value (function, record, tuple, etc)
or aggregated group of values.
</li>
<li>An arrow from A to B indicates the possibility that some value in the group A holds a reference to
some value in the group B.
</li>
<li>The model is typically an <em>over-approximation</em>, so the lack of an arrow from A to B means that no such reference
exists, while the presence of an arrow just means we haven't ruled it out.
</li>
<li>Orange nodes here represent OCaml values.
</li>
<li>White boxes are directories.
They include all contained files and subdirectories, except those shown separately.
I've pulled out <code>htdocs</code> so we can see that <code>app</code> doesn't have access to the rest of <code>home</code>.
Just for emphasis, I also show <code>.ssh</code> separately.
I'm assuming here that a directory doesn't give access to its parent,
so <code>htdocs</code> can only be used to read files within that sub-tree.
</li>
<li><code>net</code> represents the network and everything else connected to it.
</li>
<li>In most operating systems, directories exist in the kernel's address space,
and so you cannot have a direct reference to them.
That's not a problem, but for now you may find it easier to imagine a system where the kernel and applications
are all a single program, in a single programming language.
</li>
<li>This diagram represents the state at a particular moment in time (when starting the application).
We could also calculate and show all the references that might ever come to exist,
given what we know about the behaviour of <code>app</code> and <code>net</code>.
Since we don't yet know anything about either,
we would have to assume that <code>app</code> might give <code>net</code> access to <code>htdocs</code> and to itself.
</li>
</ul>
<p>So, the diagram above shows the application <code>app</code> has been given references to <code>net</code> and to <code>htdocs</code> as arguments.</p>
<p>Looking at our checklist from the start:</p>
<ul>
<li>It can't delete all my files, but it might delete the ones in <code>htdocs</code>.
</li>
<li>It can't edit <code>~/.ssh/authorized_keys</code>.
</li>
<li>It might act as a relay, allowing remote machines to attack other computers on my local network.
</li>
<li>It might send telemetry to a third-party.
</li>
</ul>
<p>We can read the body of the function to learn more:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">socket</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">listen</span> <span class="n">net</span> <span class="o">(`</span><span class="nc">Tcp</span> <span class="mi">8080</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">handler</span> <span class="o">=</span> <span class="n">static_files</span> <span class="n">htdocs</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="n">handler</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Note: <code>Net.listen net</code> is typical OCaml style for performing the <code>listen</code> operation on <code>net</code>.
We could also have used a record and written <code>net.listen</code> instead, which may look more familiar to some readers.</p>
<p>Here's an updated diagram, showing the moment when <code>Http.serve</code> is called.
The <code>app</code> group has been opened to show <code>socket</code> and <code>handler</code> separately:</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web2.svg" title="After reading the code of main" class="caption"/><span class="caption-text">After reading the code of main</span></span></p>
<p>We can see that the code in the HTTP library can only access the network via <code>socket</code>,
and can only access <code>htdocs</code> by using <code>handler</code>.
Assuming <code>Net.listen</code> is trust-worthy (we'll normally trust the platform's networking layer),
it's clear that the application doesn't make out-bound connections,
since <code>net</code> is used only to create a listening socket.</p>
<p>To know what the application might do to <code>htdocs</code>, we only have to read the definition of <code>static_files</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">static_files</span> <span class="n">dir</span> <span class="n">request</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Path</span><span class="p">.</span><span class="n">load</span> <span class="o">(</span><span class="n">dir</span> <span class="o">/</span> <span class="n">request</span><span class="o">.</span><span class="n">path</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Now we can see that the application doesn't change any files; it only uses <code>htdocs</code> to read them.</p>
<p>Finally, expanding <code>Http.serve</code>:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">serve</span> <span class="n">socket</span> <span class="n">handle_request</span> <span class="o">=</span>
</span><span class="line">  <span class="k">while</span> <span class="bp">true</span> <span class="k">do</span>
</span><span class="line">    <span class="k">let</span> <span class="n">conn</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">accept</span> <span class="n">socket</span> <span class="k">in</span>
</span><span class="line">    <span class="n">handle_connection</span> <span class="n">conn</span> <span class="n">handle_request</span>
</span><span class="line">  <span class="k">done</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>We see that <code>handle_connection</code> has no way to share telemetry information between connections,
given that <code>handle_request</code> never stores anything.</p>
<p>We can tell these things after only looking at the code for a few seconds, even though dozens of libraries are being used.
In particular, we didn't have to read <code>handle_connection</code> or any of the HTTP parsing logic.</p>
<p>Now let's enable TLS. For this, we will require a configuration directory containing the server's key:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="o">~</span><span class="n">tls_config</span> <span class="n">net</span> <span class="n">htdocs</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">socket</span> <span class="o">=</span> <span class="nn">Net</span><span class="p">.</span><span class="n">listen</span> <span class="n">net</span> <span class="o">(`</span><span class="nc">Tcp</span> <span class="mi">8443</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">tls_socket</span> <span class="o">=</span> <span class="nn">Tls</span><span class="p">.</span><span class="n">wrap</span> <span class="o">~</span><span class="n">tls_config</span> <span class="n">socket</span> <span class="k">in</span>
</span><span class="line">  <span class="k">let</span> <span class="n">handler</span> <span class="o">=</span> <span class="n">static_files</span> <span class="n">htdocs</span> <span class="k">in</span>
</span><span class="line">  <span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">tls_socket</span> <span class="n">handler</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>OCaml syntax note: I used <code>~</code> to make <code>tls_config</code> a named argument; we wouldn't want to get this directory confused with <code>htdocs</code>!</p>
<p>We can see that only the TLS library gets access to the key.
The HTTP library interacts only with the TLS socket, which presumably does not reveal it.</p>
<p><span class="caption-wrapper center"><img src="/blog/images/lambda-caps/web3.svg" title="Updated graph showing TLS" class="caption"/><span class="caption-text">Updated graph showing TLS</span></span></p>
<p>Notice too how this fixes the problem we had with our original policy enforcement system.
There, an attacker could request <code>https://example.com/../tls_config/server.key</code> and the HTTP server might send the key.
But here, the handler cannot do that even if it wants to.
When <code>handler</code> loads a file, it does so via <code>htdocs</code>, which does not have access to <code>tls_config</code>.</p>
<p>The above server has pretty good security properties,
even though we didn't make any special effort to write secure code.
Security-conscious programmers will try to wrap powerful capabilities (like <code>net</code>)
with less powerful ones (like <code>socket</code>) as early as possible, making the code easier to understand.
A programmer uninterested in readability is likely to mix in more irrelevant code you have to skip through,
but even so it shouldn't take too long to track down where things like <code>net</code> and <code>htdocs</code> end up.
And even if they spread them throughout their entire application,
at least you avoid having to read all the libraries too!</p>
<p>By contrast, consider a more traditional (non-capability) style.
We start with:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">htdocs</span> <span class="o">=</span> <span class="o">...</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, <code>htdocs</code> would be a plain string rather than a reference to a directory,
and the network would be reached through a global.
We can't tell anything about what this server could do from looking at this one line,
and even if we expand it, we won't be able to tell what all the functions it calls do, either.
We will end up having to follow every function call recursively through all of the server's
dependencies, and our analysis will be out of date as soon as any of them changes.</p>
<h3 id="use-at-different-scales">Use at different scales</h3>
<p>We've seen that we can create an over-approximation of the reference graph by looking at just a small part of the code,
and then get a closer bound on the possible effects as needed
by expanding groups of values until we can prove the desired property.
For example, to prove that the application didn't modify <code>htdocs</code>, we followed <code>htdocs</code> by expanding <code>main</code> and then <code>static_files</code>.</p>
<p>Within a single process, a capability is a reference (pointer) to another value in the process's memory.
However, the diagrams also included arrows (capabilities) to things outside of the process, such as directories.
We can regard these as references to privileged proxy functions in the process that make calls to the OS kernel,
or (at a higher level of abstraction) we can consider them to be capabilities to the external resources themselves.</p>
<p>It is possible to build capability operating systems (in fact, this was the first use of capabilities).
Just as we needed to ban global variables to make a safe programming language,
we need to ban global namespaces to make a capability operating system.
For example, on FreeBSD this is done (on a per-process basis) by invoking the <a href="https://man.freebsd.org/cgi/man.cgi?query=cap_enter">cap_enter</a> system call.</p>
<p>We can zoom out even further, and consider a network of computers.
Here, an arrow between machines represents some kind of (unforgeable) network address or connection.
At the IP level, any process can connect to any address, but a capability system can be implemented on top.
<a href="http://www.erights.org/elib/distrib/captp/index.html">CapTP</a> (the Capability Transport Protocol) was an early system for this, but
<a href="https://capnproto.org/rpc.html">Cap'n Proto</a> (Capabilities and Protocols) is the modern way to do it.</p>
<p>So, thinking in terms of capabilities, we can zoom out to look at the security properties of the whole network,
yet still be able to expand groups as needed right down to the level of individual closures in a process.</p>
<h3 id="key-points">Key points</h3>
<ul>
<li>
<p>Library code can be imported and called without it getting access to any pre-existing state,
except that given to it explicitly. There is no &quot;ambient authority&quot; available to the library.</p>
</li>
<li>
<p>A function's side-effects are bounded by its arguments.
We can understand (get a bound on) the behaviour of a function call just by looking at it.</p>
</li>
<li>
<p>If <code>a</code> has access to <code>b</code> and to <code>c</code>, then <code>a</code> can introduce them (e.g. by performing the function call <code>b c</code>).
Note that there is no capability equivalent to making something &quot;world readable&quot;;
to perform an introduction,
you need access to both the resource being granted and to the recipient (&quot;only connectivity begets connectivity&quot;).</p>
</li>
<li>
<p>Instead of passing the <em>name</em> of a resource, we pass a capability reference (pointer) to it,
thereby proving that we have access to it and sharing that access (&quot;no designation without authority&quot;).</p>
</li>
<li>
<p>The caller of a function decides what it should access, and can provide restricted access by wrapping
another capability, or substituting something else entirely.</p>
<p>I am sometimes unable to install a messaging app on my phone because it requires me to grant it
access to my address book.
A capability system should never say &quot;This application requires access to the address book. Continue?&quot;;
it should say &quot;This application requires access to <em>an</em> address book; which would you like to use?&quot;.</p>
</li>
<li>
<p>A capability must behave the same way regardless of who uses it.
When we do <code>f x</code>, <code>f</code> can perform exactly the same operations on <code>x</code> that we can.</p>
<p>It is tempting to add a traditional policy language alongside capabilities for &quot;extra security&quot;,
saying e.g. &quot;<code>f</code> cannot write to <code>x</code>, even if it has a reference to it&quot;.
However, apart from being complicated and annoying,
this creates an incentive for <code>f</code> to smuggle <code>x</code> to another context with more powers.
This is the root cause of many real-world attacks, such as click-jacking or cross-site request forgery,
where a URL permits an attack if a victim visits it, but not if the attacker does.
One of the great benefits of capability systems is that you don't need to worry that someone is trying to trick you
into doing something that you can do but they can't,
because your ability to access the resource they give you comes entirely from them in the first place.</p>
</li>
</ul>
<p>All of the above follow naturally from using functions in the usual way, while avoiding global variables.</p>
<h2 id="practical-considerations">Practical considerations</h2>
<p>The above discussion argues that capabilities would have been a good way to build systems in an ideal world.
But given that most current operating systems and programming languages have not been designed this way,
how useful is this approach?
I'm currently working on <a href="https://github.com/ocaml-multicore/eio">Eio</a>, an IO library for OCaml, and using these principles to guide the design.
Here are a few thoughts about applying capabilities to a real system.</p>
<h3 id="plumbing-capabilities-everywhere">Plumbing capabilities everywhere</h3>
<p>A lot of people worry about cluttering up their code by having to pass things explicitly everywhere.
This is actually not much of a problem, for a couple of reasons:</p>
<ol>
<li>
<p>We already do this with most things anyway.
If your program uses a database, you probably establish a connection to it at the start and pass the connection around as needed.
You probably also pass around open file handles, configuration settings, HTTP connection pools, arrays, queues, ref-cells, etc.
Handling &quot;the file-system&quot; and &quot;the network&quot; the same way as everything else isn't a big deal.</p>
</li>
<li>
<p>You can often bundle up a capability with something else.
For example, a web-server will likely let the user decide which directory to serve,
so you're already passing around a pathname argument.
Passing a path capability instead is no extra work.</p>
</li>
</ol>
<p>Consider a request handler that takes the address of a Redis server:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">redis_url</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It might seem that by using capabilities we'd need to pass the network in here too:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">net</span> <span class="n">redis_url</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>This is both messy and unnecessary.
Instead, <code>handle_request</code> can take a function for connecting to Redis:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Http</span><span class="p">.</span><span class="n">serve</span> <span class="n">socket</span> <span class="o">(</span><span class="n">handle_request</span> <span class="n">redis</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Then there is only one argument to pass around again.
Instead of writing the connection logic in <code>handle_request</code>, we write the same logic outside and just pass in the function.
And now someone looking at the code can see &quot;the handler can connect to Redis&quot;,
rather than the less precise &quot;the handler accesses the network&quot;.
Of course, if Redis required more than one configuration setting then you'd probably already be doing it this way.</p>
<p>The main problematic case is providing <em>defaults</em>.
For example, a TLS library might allow us to specify the location of the system's certificate store,
but it would like to provide a default (e.g. <code>/etc/ssl/certs/</code>).
This is particularly important if the default location varies by platform.
If the TLS library decides the location, then we must give it (read-only at least) access to the whole system!
We may just decide to trust the library, or we might separate out the default paths into a trusted package.</p>
<h3 id="levels-of-support">Levels of support</h3>
<p>Ideally, our programming language would provide a secure implementation of capabilities that we could depend on.
That would allow running untrusted code safely and protect us from compromised packages.
However, converting a non-capability language to a capability-secure one isn't easy,
and isn't likely to happen any time soon for OCaml
(but see <a href="https://www.hpl.hp.com/techreports/2006/HPL-2006-116.pdf">Emily</a> for an old proof-of-concept).</p>
<p>Even without that, though, capabilities help to protect non-malicious code from malicious inputs.
For example, the request handler above forgot to sanitise the URL path from the remote client,
but it still can't access anything outside of <code>htdocs</code>.</p>
<p>And even if we don't care about security at all, capabilities make it easy to see what a program does;
they make it easy to test programs by replacing OS resources with mocks;
and preventing access to globals helps to avoid race conditions,
since two functions that access the same resource must be explicitly introduced.</p>
<h3 id="running-on-a-traditional-os">Running on a traditional OS</h3>
<p>A capability OS would let us run a program's <code>main</code> function and provide the capabilities it wanted directly,
but most systems don't work like that.
Instead, each program requires a small trusted entrypoint that has the full privileges of the process.
In Eio, an application will typically start something like this:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="nn">Eio_main</span><span class="p">.</span><span class="n">run</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">env</span> <span class="o">-&gt;</span>
</span><span class="line"><span class="k">let</span> <span class="n">net</span> <span class="o">=</span> <span class="nn">Eio</span><span class="p">.</span><span class="nn">Stdenv</span><span class="p">.</span><span class="n">net</span> <span class="n">env</span> <span class="k">in</span>
</span><span class="line"><span class="k">let</span> <span class="n">fs</span> <span class="o">=</span> <span class="nn">Eio</span><span class="p">.</span><span class="nn">Stdenv</span><span class="p">.</span><span class="n">fs</span> <span class="n">env</span> <span class="k">in</span>
</span><span class="line"><span class="nn">Eio</span><span class="p">.</span><span class="nn">Path</span><span class="p">.</span><span class="n">with_open_dir</span> <span class="o">(</span><span class="n">fs</span> <span class="o">/</span> <span class="s2">&quot;/srv/www&quot;</span><span class="o">)</span> <span class="o">@@</span> <span class="k">fun</span> <span class="n">htdocs</span> <span class="o">-&gt;</span>
</span><span class="line"><span class="n">main</span> <span class="n">net</span> <span class="n">htdocs</span>
</span></code></pre></td></tr></tbody></table></div></figure><p><code>Eio_main.run</code> starts the Eio event loop and then runs the callback.
The <code>env</code> argument gives full access to the process's environment.
Here, the callback extracts network and filesystem access from this,
gets access to just &quot;/srv/www&quot; from <code>fs</code>,
and then calls the <code>main</code> function as before.</p>
<p>Note that <code>Eio_main.run</code> itself is not a capability-safe function (it magics up <code>env</code> from nothing).
A capability-enforcing compiler would flag this bit up as needing to be audited manually.</p>
<h3 id="use-with-existing-security-mechanisms">Use with existing security mechanisms</h3>
<p>Maybe you're not convinced by all this capability stuff.
Traditional security systems are more widely available, better tested, and approved by your employer,
and you want to use that instead.
Still, to write the policy, you're going to need a list of resources the program might access.
Looking at the above code, we can immediately see that the policy need allow access only to the &quot;/srv/www&quot; directory,
and so we could call e.g. <a href="https://man.openbsd.org/unveil">unveil</a> here.
And if <code>main</code> later changes to use TLS,
the type-checker will let us know to update this code to provide the TLS configuration
and we'll know to update the policy at the same time.</p>
<p>If you want to drop privileges, such a program also makes it easy to see when it's safe to do that.
For example, looking at <code>main</code> we can see that <code>net</code> is never used after creating the socket,
so we don't need the <code>bind</code> system call after that,
and we never need <code>connect</code>.
We know, for instance, that this program isn't hiding an XML parser that needs to download schema files to validate documents.</p>
<h3 id="thread-local-storage">Thread-local storage</h3>
<p>In addition to global and local variables, systems often allow us to attach data to threads as a sort of middle ground.
This could allow unexpected interactions. For example:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span> <span class="k">in</span>
</span><span class="line"><span class="n">f</span> <span class="n">x</span><span class="o">;</span>
</span><span class="line"><span class="n">g</span> <span class="bp">()</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>Here, we'd expect that <code>g</code> doesn't have access to <code>x</code>, but <code>f</code> could pass it using thread-local storage.
To prevent that, Eio instead provides <a href="https://ocaml-multicore.github.io/eio/eio/Eio/Fiber/index.html#val-with_binding">Fiber.with_binding</a>,
which runs a function with a binding but then puts things back how they were before returning,
so <code>f</code> can't make changes that are still active when <code>g</code> runs.</p>
<p>This also allows people who don't want capabilities to disable the whole system easily:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="k">let</span> <span class="n">everything</span> <span class="o">=</span> <span class="nn">Fiber</span><span class="p">.</span><span class="n">create_key</span> <span class="bp">()</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">f</span> <span class="bp">()</span> <span class="o">=</span>
</span><span class="line">  <span class="k">let</span> <span class="n">env</span> <span class="o">=</span> <span class="nn">Option</span><span class="p">.</span><span class="n">get</span> <span class="o">(</span><span class="nn">Fiber</span><span class="p">.</span><span class="n">get</span> <span class="n">everything</span><span class="o">)</span> <span class="k">in</span>
</span><span class="line">  <span class="o">...</span>
</span><span class="line">
</span><span class="line"><span class="k">let</span> <span class="n">main</span> <span class="n">env</span> <span class="o">=</span>
</span><span class="line">  <span class="nn">Fiber</span><span class="p">.</span><span class="n">with_binding</span> <span class="n">everything</span> <span class="n">env</span> <span class="n">f</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It looks like <code>f ()</code> doesn't have access to anything, but in fact it can recover <code>env</code> and get access to everything!
However, anyone trying to understand the code will start following <code>env</code> from the main entrypoint
and will then see that it got put in fiber-local storage.
They then at least know that they must read all the code to understand anything about what it can do.</p>
<p>More usefully, this mechanism allows us to make just a few things ambiently available.
For example, we don't want to have to plumb stderr through to a function every time we want to do some <code>printf</code> debugging,
so it makes sense to provide a tracing function this way (and Eio does this by default).
Tracing allows all components to write debug messages, but it doesn't let them read them.
Therefore, it doesn't provide a way for components to communicate with each other.</p>
<p>It might be tempting to use <code>Fiber.with_binding</code> to restrict access to part of a program
(e.g. giving an HTTP server network access this way),
but note that this is a non-capability way to do things,
and suffers the same problems as traditional security systems,
separating designation from authority.
In particular, supposedly sandboxed code in other parts of the application
can try to escape by tricking the HTTP server part into running a callback function for them.
But fiber local storage is fine for things to which you don't care to restrict access.</p>
<h3 id="symlinks">Symlinks</h3>
<p>Symlinks are a bit of a pain! If I have a capability reference to a directory, it's useful to know that I can only access things beneath that directory. But the directory may contain a symlink that points elsewhere.</p>
<p>One option would be to say that a symlink is a capability itself, but this means that you could only create symlinks to things you can access yourself, and this is quite a restriction. For example, you might be forbidden from extracting a tarball because <code>tar</code> didn't have permission to the target of a symlink it wanted to create.</p>
<p>The other option is to say that symlinks are just strings, and it's up to the user to interpret them.
This is the approach FreeBSD uses. When you use a system call like <code>openat</code>,
you pass a capability to a base directory and a string path relative to that.
In the case of our web-server, we'd use a capability for <code>htdocs</code>, but use strings to reference things inside it, allowing the server to follow symlinks within that sub-tree, but not outside.</p>
<p>The main problem is that it makes the API a bit confusing. Consider:</p>
<figure class="code"><div class="highlight"><table><tbody><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="ocaml"><span class="line"><span class="n">save_to</span> <span class="o">(</span><span class="n">htdocs</span> <span class="o">/</span> <span class="s2">&quot;uploads&quot;</span><span class="o">)</span>
</span></code></pre></td></tr></tbody></table></div></figure><p>It might look like <code>save_to</code> is only getting access to the &quot;uploads&quot; directory,
but in Eio it actually gets access to the whole of <code>htdocs</code>.
If you want to restrict access, you have to do that explicitly
(as we did when creating <code>htdocs</code> from <code>fs</code>).</p>
<p>The advantage, however, is that we don't break software that relies on symlinks.
Also, restricting access is quite expensive on some systems (FreeBSD has the handy <code>O_BENEATH</code> open flag,
and Linux has <code>RESOLVE_BENEATH</code>, but not all systems provide this), so might not be a good default.
I'm not completely satisfied with the current API, though.</p>
<h3 id="time-and-randomness">Time and randomness</h3>
<p>It is also possible to use capabilities to restrict access to time and randomness.
The security benefits here are less clear.
Tracking access to time can be useful in preventing side-channel attacks that depend on measuring time accurately,
but controlling access to randomness makes it difficult to e.g. randomise hash functions to
help prevent denial-of-service-attacks.</p>
<p>However, controlling access to these does have the advantage of making code deterministic by default,
which is a great benefit, especially for expect-style testing.
Your top level test function is called with no arguments, and therefore has no access to non-determinism,
instead creating deterministic mocks to use with the code under test.
You can then just record a good trace of a test's operations and check that it doesn't change.</p>
<h3 id="power-boxes">Power boxes</h3>
<p>Interactive applications that load and save files present a small problem:
since the user might load or save anywhere, it seems they need access to the whole file-system.
The solution is a &quot;powerbox&quot;.
The powerbox has access to the file-system and the rest of the application only has access to the powerbox.
When the application wants to save a file, it asks the powerbox, which pops up a GUI asking the user to choose the location.
Then it opens the file and passes that back to the application.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Currently-popular security mechanisms are complex and have many shortcomings.
Yet, the lambda calculus already contains an excellent security mechanism,
and making use of it requires little more than avoiding global variables.</p>
<p>This is known as &quot;capability-based security&quot;.
The word &quot;capabilities&quot; has also been used for several unrelated concepts (such as &quot;POSIX capabilities&quot;),
and for clarity much of the community rebranded a while back as &quot;Object Capabilities&quot;,
but this can make it seem irrelevant to functional programmers.
In fact, I wrote this blog post because several OCaml programmers have asked me what the point of capabilities is.
I was expecting it to be quite short (basically: applying functions to arguments good, global variables bad),
but it's got quite long; it seems there is a fair bit that follows from this simple idea!</p>
<p>Instead of seeing security as an extra layer that runs separately from the code and tries to guess what it meant to do,
capabilities fit naturally into the language.
The key difference with traditional security is that
the ability to do something depends on the reference used to do it, not on the identity of the caller.
This way of thinking about security works not only for controlling access to resources within a single program,
but also for controlling interactions between processes running on a machine, and between machines on a network.
We can group together resources and zoom out to see the overall picture, or expand groups to zoom in and get a closer
bound on the behaviour.</p>
<p>Even ignoring security, a key question is: what can a function do?
Should a function call be able to do anything at all that the process can do,
or should its behaviour be bounded in some way that is obvious just by looking at it?
If we say that you must read the source code of a function to see what it does, then this applies recursively:
we must also read all the functions that it calls, and so on.
To understand the <code>main</code> function, we end up having to read the code of every library it uses!</p>
<p>If you want to read more,
the <a href="http://habitatchronicles.com/2017/05/what-are-capabilities/">What Are Capabilities?</a> blog post provides a good overview;
Part II of <a href="https://papers.agoric.com/papers/robust-composition/abstract/">Robust Composition</a> contains a longer explanation;
<a href="https://srl.cs.jhu.edu/pubs/SRL2003-02.pdf">Capability Myths Demolished</a> does a good job of enumerating security properties provided by capabilities;
my own <a href="https://roscidus.com/blog/about/#the-serscis-access-modeller">SERSCIS Access Modeller</a> paper shows how to analyse systems
where some components have unknown behaviour; and, for historical interest, see
Dennis and Van Horn's 1966 <a href="https://dl.acm.org/doi/pdf/10.1145/365230.365252">Programming Semantics for Multiprogrammed Computations</a>, which introduced the idea.</p>
]]></content>
  </entry>
</feed>
