Vulkan graphics in OCaml vs C - Thomas Leonard's blog

I convert my Vulkan test program from C to OCaml and compare the results, then continue the Vulkan tutorial in OCaml, adding 3D, textures and depth buffering.

Table of Contents

Introduction
Running it yourself
The direct port
Refactored version
The 3D version
Garbage collection
Conclusions

( this post also appeared on Lobsters )

Introduction

In Investigating Linux graphics, I wrote a little C program to help me learn about GPUs by drawing a triangle. But I wondered if using OCaml instead would make my life easier. It didn't, because there were no released OCaml Vulkan bindings, but I found some unfinished ones by Florian Angeletti. The bindings are generated mostly automatically from the Vulkan XML specification, and with a bit of effort I got them working well enough to continue with the Vulkan tutorial, which resulted in this nice Viking room:

Vulkan tutorial in OCaml

In this post, I'll be looking at how the C code compares to the OCaml. First, I did a direct line-by-line port of the C, then I refactored it to take better advantage of OCaml.

(Note: the Vulkan tutorial is actually using C++, but I'm comparing my C version to OCaml)

Running it yourself

If you want to try it yourself (note: it requires Wayland):

git clone https://github.com/talex5/vulkan-test -b ocaml
cd vulkan-test
nix develop
dune exec -- ./src/main.exe 200

As the OCaml Vulkan bindings (Olivine) are unreleased, I included a copy of my patched version in vendor/olivine. The dune exec command will build them automatically.

The ocaml branch above just draws one triangle. If you want to see the 3D room pictured above, use ocaml-3d instead:

git clone https://github.com/talex5/vulkan-test -b ocaml-3d
cd vulkan-test
nix develop
make download-example
dune exec -- ./src/main.exe 10000 viking_room.obj viking_room.png

The direct port

Porting the code directly, line by line, was pretty straight-forward:

Comparing the code with meld

The code ended up slightly shorter, but not by much:

 28 files changed, 1223 insertions(+), 1287 deletions(-)

This is only approximate; sometimes I added or removed blank lines, etc. Some things were a bit easier and others a bit harder. It mostly balanced out.

As an example, one thing that makes the OCaml shorter is that arrays are passed as a single item, whereas C takes the length separately. On the other hand, single-item arrays can be passed in C by just giving the address of the pointer, whereas OCaml requires an array to be constructed separately. Also, I had to include some bindings for the libdrm C library.

Labelled arguments

The OCaml bindings use labelled arguments (e.g. the VK_TRUE argument in the screenshot above became ~wait_all:true in the OCaml), which is longer but clearer.

The OCaml code uses functions to create C structures, which looks pretty similar due to labels. For example:

const VkSemaphoreGetFdInfoKHR get_fd_info = {
    .sType = VK_STRUCTURE_TYPE_SEMAPHORE_GET_FD_INFO_KHR,
    .semaphore = semaphore,
    .handleType = VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_SYNC_FD_BIT,
};

becomes:

let get_fd_info = Vkt.Semaphore_get_fd_info_khr.make ()
    ~semaphore
    ~handle_type:Vkt.External_semaphore_handle_type_flags.sync_fd
in

An advantage is that the sType field gets filled in automatically.

Enums and bit-fields

Enumerations and bit-fields are namespaced, which is a lot clearer as you can see which part is the name of the enum and which part is the particular value. For example, VK_ATTACHMENT_STORE_OP_STORE becomes Vkt.Attachment_store_op.Store. Also, OCaml usually knows the expected type and you can omit the module, so:

VkAttachmentDescription colorAttachment = {
    .format = format,
    .samples = VK_SAMPLE_COUNT_1_BIT,
    .loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR,  // Clear framebuffer before rendering
    .storeOp = VK_ATTACHMENT_STORE_OP_STORE,
    .stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE,
    .stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE,
    .initialLayout = VK_IMAGE_LAYOUT_UNDEFINED,
    .finalLayout = VK_IMAGE_LAYOUT_GENERAL,
};

becomes

let color_attachment = Vkt.Attachment_description.make ()
    ~format:format
    ~samples:Vkt.Sample_count_flags.n1
    ~load_op:Clear	(* Clear framebuffer before rendering *)
    ~store_op:Store
    ~stencil_load_op:Dont_care
    ~stencil_store_op:Dont_care
    ~initial_layout:Undefined
    ~final_layout:General
in

Bit-fields and enums get their own types (they're not just integers), so you can't use them in the wrong place or try to combine things that aren't bit-fields (and so the _BIT suffix isn't needed). One particularly striking example of the difference is that

.colorWriteMask =
    VK_COLOR_COMPONENT_R_BIT |
    VK_COLOR_COMPONENT_G_BIT |
    VK_COLOR_COMPONENT_B_BIT |
    VK_COLOR_COMPONENT_A_BIT,

becomes

~color_write_mask:Vkt.Color_component_flags.(r + g + b + a)

The Vkt.Color_component_flags.(...) brings all the module's symbols into scope, including the + operator for combining the flags.

Optional fields

The specification says which fields are optional. In C you can ignore that, but OCaml enforces it. This can be annoying sometimes, e.g.

.blendEnable = VK_FALSE,

becomes

~blend_enable:false
~src_color_blend_factor:One
~dst_color_blend_factor:Zero
~color_blend_op:Add
~src_alpha_blend_factor:One
~dst_alpha_blend_factor:Zero
~alpha_blend_op:Add

because the spec says these are all non-optional, rather than that they are only needed when blending is enabled.

There's a similar situation with the Wayland code: the OCaml compiler requires you to provide a handler for all possible events. For example, OCaml forced me to write a handler for the window close event (and so closing the window works in the OCaml version, but not in the C one). Likewise, if the compositor returns an error from create_immed the OCaml version logs it, while the C version ignored the error message, because the C compiler didn't remind me about that.

Loading shaders

Loading the shaders was easier. The C version has code to load the shader bytecode from disk, but in the OCaml I used ppx_blob to include it at compile time, producing a self-contained executable file:

load_shader_module device [%blob "./vert.spv"]

Logging

OCaml has a somewhat standard logging library, so I was able to get the logs messages shown as I wanted without having to pipe the output through awk. And, as a bonus, the log messages get written in the correct order now. e.g. the C libwayland logs:

wl_display#1.delete_id(3)
...
wl_callback#3.done(59067)

which appears to show a callback firing some time after it was deleted, while ocaml-wayland logs:

<- wl_callback@3.done callback_data:1388855
<- wl_display@1.delete_id id:3

Error handling

The OCaml bindings return a result type for functions that can return errors, using polymorphic variants to say exactly which errors can be returned by each function. That's clever, but I found it pretty useless in practice and I followed the Olivine example code in immediately turning every Error result into an exception. You can then handle errors at a higher level (unlike the C, which just calls exit). Maybe Olivine should be changed to do that itself.

I thought I'd been rigorous about checking for errors in the C, but I missed some places (e.g. vkMapMemory). The OCaml compiler forced me to handle those too, of course.

Refactored version

One reason to switch to OCaml was because I was finding it hard to see how all the C code fit together. I felt that the overall structure was getting lost in the noise. While the initial OCaml version was similar to the C, I think the refactored version is quite a bit easier to read.

Moving code to separate files is much easier than in C. There, you typically need to write a header file too, and then include it from the other files. But in the OCaml I could just move e.g. export_semaphore to export in a new file called semaphore.ml and refer to it as Semaphore.export. Because each file gets its own namespace, you don't have to guess where functions are defined, and you don't get naming conflicts between symbols in different files. The build system (dune) automatically builds all modules in the correct order.

Olivine wrappers

I added a vulkan directory with wrappers around the auto-generated Vulkan functions with the aim of removing some noise. For example, the wrappers take OCaml lists and convert them to C arrays as needed, and raise exceptions on error instead of returning a result type.

Sometimes they do more, as in the case of queue_submit. That took separate wait_semaphores and wait_dst_stage_mask arrays, requiring them to be the same length. By taking a list of tuples, the wrapper avoids the possibility of this error. The old submit code:

let wait_semaphores = Vkt.Semaphore.array [t.image_available] in
let wait_stages = [Vkt.Pipeline_stage_flags.color_attachment_output] in
let submit_info = Vkt.Submit_info.make ()
    ~wait_semaphores
    ~wait_dst_stage_mask:(A.of_list Vkt.Pipeline_stage_flags.ctype wait_stages)
    ~command_buffers:(Vkt.Command_buffer.array [t.command_buffer])
    ~signal_semaphores:(Vkt.Semaphore.array [frame_state.render_finished])
in
Vkc.queue_submit t.graphics_queue ()
  ~submits:(Vkt.Submit_info.array [submit_info])
  ~fence:t.in_flight_fence <?> "queue_submit";

becomes:

Vulkan.Cmd.submit device t.command_buffer
  ~wait:[t.image_available, Vkt.Pipeline_stage_flags.color_attachment_output]
  ~signal_semaphores:[frame_state.render_finished]
  ~fence:t.in_flight_fence;

Sometimes the new API drops features I don't use (or don't currently understand). For example, my new submit only lets you submit one command buffer at a time (though each buffer can have many commands).

I moved various generic helper functions like find_memory_type to the wrapper library, getting them out of the main application code.

Separating out these libraries made the code longer, but I think it makes it easier to read:

 20 files changed, 843 insertions(+), 663 deletions(-)

Using fibers / effects for control flow

The C code has a single thread with a single stack, using callbacks to redraw when the compositor is ready. OCaml has fibers (light-weight cooperative threads), so we can use a plain loop:

while t.frame < frame_limit do
  let next_frame_due = Window.frame window in
  draw_frame t;
  Promise.await next_frame_due;
  t.frame <- t.frame + 1
done

The Promise.await suspends this fiber, allowing e.g. the Wayland code to handle incoming events. I find that makes the logic easier to follow.

Using the CPU and GPU in parallel

Next I split off the input handling from the huge render.ml file into input.ml.

The Vulkan tutorial creates one uniform buffer for the input data for each frame-buffer, but this seems wasteful. I think we only need at most two: one for the GPU to read, and one for the CPU to write for the next frame, if we want to do that in parallel.

To allow this parallel operation I also had to create a pair of command buffers. The duo.ml module holds the two (input, command-buffer) jobs and swaps them on submit.

Resizing and resource lifetimes

When the window size changes we need to destroy the old swap-chain and recreate all the images, views and framebuffers. My C code didn't bother, and just kept things at 640x480.

The main problem here is how to clean up the old resources. We could use the garbage collector, but the framebuffers are rather large and I'd like to get them freed promptly. Also, Vulkan requires things to be freed in the correct order, which the GC wouldn't ensure.

I added code to free resources by having each constructor take a sw switch argument. When the switch is turned off, all resources attached to it are freed. That makes it easy to scope things to the stack: when the Switch.run block ends, all resources it created are freed.

But the life-cycle of the swap-chain is a little complicated. I don't want to clutter the main application loop with the logic of adapting to size changes. Again, OCaml's fibers system makes it easy to have multiple stacks so I have another fiber run:

let render_loop t duo =
  while true do
    let geometry = Window.geometry t.window in
    Switch.run @@ fun sw ->
    let framebuffers = create_swapchain ~sw t geometry in
    while geometry = Window.geometry t.window do
      let fb = Vulkan.Swap_chain.get_framebuffer framebuffers in
      let redraw_needed = next_as_promise t.redraw_needed in
      let job = Duo.get duo in
      record_commands t job fb;
      Duo.submit duo fb job.command_buffer;
      Window.attach t.window ~buffer:fb.wl_buffer;
      Promise.await redraw_needed
    done
  done

The C code created a fixed set of 4 framebuffers on each resize, but the OCaml only creates them as needed. When dragging the window to resize that means we may only need to create one at each size, and when keeping a steady size, it seems I only need 3 framebuffers with Sway.

The main loop changes slightly so that it just triggers the render_loop fiber:

while render.frame < frame_limit do
  let next_frame_due = Window.frame window in
  Render.trigger_redraw render;
  Promise.await next_frame_due;
  render.frame <- render.frame + 1
done

I'm not sure if freeing the framebuffers immediately is safe, since in theory the GPU might still be using them if the display server requests a new frame at a new size before the previous one has finished rendering on the GPU. Possibly freed OCaml resources should instead get added to a list of things to free on the C side the next time the GPU is idle.

The 3D version

Although it looks a lot more impressive, the 3D version isn't that much more work than the 2D triangle.

I used the Cairo library to load the PNG file with the textures and then added a Vulkan sampler for it. The shader code has to be modified to read the colour from the texture. The most complex bit is that the texture needs to be copied from Cairo's memory to host memory that's visible to the GPU, and from there to fast local memory on the GPU (see texture.ml).

Other changes needed:

There's a bit of matrix stuff to position the model and project it in 3D.
I added obj_format.ml to parse the model data.
The pipeline adds a depth buffer so near things obscure things behind them, regardless of the drawing order.

I didn't get my C version to do the 3D bits, but for comparison here's the Vulkan tutorial's official C++ version.

Garbage collection

To render smoothly at 60Hz, we have about 16ms for each frame. You might wonder if using a garbage collector would introduce pauses and cause us to miss frames, but this doesn't seem to be a problem.

In C, you can improve performance for frame-based applications by using a bump allocator:

Create a fixed buffer with enough space for every allocation needed for one frame.
Allocate memory just by allocating sequentially in the region (bumping the next-free-address pointer).
At the end of each frame, reset the pointer.

This makes allocation really fast and freeing things at the end costs nothing. Implementing this in C requires special code, but OCaml works this way by default, allocating new values sequentially onto the minor heap. At the end of each frame, we can call Gc.minor to reset the heap.

Gc.minor scans the stack looking for pointers to values that are still in use and copies any it finds to the major heap. However, since we're at the end of the frame, the stack is pretty much empty and there's almost nothing to scan. I captured a trace of running the 3D room version with a forced minor GC at the end of every frame:

make && eio-trace run ./_build/default/src/main.exe

Tracing the full 3D version

The four long grey horizontal bars are the main fibers. From top to bottom they are:

The main application loop (incrementing the frame counter and triggering the render loop fiber).
An ocaml-wayland fiber, receiving messages from the display server (and spawning some short-lived sending fibers).
The render_loop fiber (sending graphics commands to the GPU).
A fiber used internally by the IO system.

The green sections show when each fiber is running and the yellow background indicates when the process is sleeping. The thin red columns indicate time spent in GC (which we're here triggering after every frame).

If I remove the forced Gc.minor after each frame then the GC happens less often, but can take a bit longer when it does. Still not nearly long enough to miss the deadline for rendering the frame though.

Collection of the major heap is done incrementally in small slices and doesn't cause any trouble.

So, we're only using a tiny fraction of the available time. Also, I suspect the CPU is running in a slow power-saving mode due to all the sleeping; if we had more work to do then it would probably speed up.

Conclusions

Doing Vulkan programming in OCaml has advantages (clearer code, easier refactoring), but also disadvantages (unfinished and unreleased Vulkan bindings, some friction using a C API from OCaml, and I had to write more support code, such as some bindings for libdrm).

As a C API, Vulkan is not safe and will happily segfault if passed incorrect arguments. The OCaml bindings do not fix this, and so care is still needed. I didn't bother about that because it wasn't a problem in practice, and properly protecting against use-after-free will probably require some changes to OCaml (e.g. unmapping memory isn't safe without something like the "modes" being prototyped in OxCaml).

I'm slowly upstreaming my changes to Olivine; hopefully this will all be easier to use one day!