OCaml: the bugs so far - Thomas Leonard's blog

OCaml's static typing allows it to detect many problems at compile-time. Still, some bugs slip though. In this post, I go over each discovered bug that made it into a Git commit and try to work out why it happened and whether it could have been prevented.

Note: As this post is entirely about bugs, it may appear rather negative. So let me say first that, overall, I've been very impressed with the reliability of the OCaml code: I'd have expected to find more bugs than this in 27,806 lines of new code!

Table of Contents

Methodology
Core OCaml issues
Lwt-related bugs
Curl-related bugs
- Failing to reset Curl connections
- Error with cancelled downloads
GTK bugs
- Sorted treeview iter mix-up
- Crashes with GtkIconView
Logic errors
Python bugs
Summary

( This post is part of a series in which I am converting 0install from Python to OCaml, learning OCaml as I go. The code is at GitHub/0install. )

Methodology

I've gone back through the Git commit log and selected all the ones that say they fix a bug in the comment, starting from when I merged the initial OCaml code to master (on 2013-07-03). It's possible that I sneakily fixed some bugs while making other changes, but this should cover most of them. Any bug that made it into a released version of 0install should certainly have its own commit because it would have to go on the release branch. I also included a few "near misses" (bugs I spotted before committing, but which I could plausibly have missed).

In a number of cases, I wrote and committed the new OCaml code first, and then ported the Python unit-tests in a later commit and discovered the bug that way (so some of these could never have made it into an actual release). Compile-time bugs have been ignored (e.g. code that didn't compile on older versions of OCaml); I'm only interested in run-time errors here.

I've classified each bug as follows:

Inexperience: This bug was caused by my inexperience with OCaml. Making proper use of OCaml's features would avoid this class of bug entirely.
Third-party: Caused by a bug in a library I was using (and hopefully now fixed). Could only have been discovered by testing.
Poor API: This bug was my fault, but could have been avoided if the library had a better API.
Warning needed: My fault, but the compiler could have detected the problem and issued a warning.
Testing only: I only know how to find such bugs through testing. Similar bugs could happen again.

Core OCaml issues

Note: I'm grouping the bugs by the library the code was interacting with, regardless of whether that library was at fault. This section is for bugs that occurred when just using OCaml itself and its standard library.

Out-of-range integers

Everything seemed to be working nicely on my Arch system, but the first run on a Debian VM gave this unhelpful error:

Failure "int_of_string"

I was trying to store a Unix timestamp in an int. Unlike Python, OCaml's integers are limited precision, having 1 bit less than the machine word size. The 32-bit VM only had 31-bit integers and the time value was out of range for it.

This was entirely my fault. OCaml always uses floats to represent times, and they work fine. I was just converting to ints to be consistent with the Python code.

However, the error message is very poor. I replaced all calls to int_of_string with my own version, which at least displays the number it was trying to convert. This should make debugging any similar problems easier in future.

Type: Inexperience

Fails to start on Windows

Windows kept complaining that my program didn't exist and to check that the path was correct, even when I was double-clicking on the executable in Explorer! Turns out, Windows refuses to run binaries with certain character sequences in their names ("instal" being one such sequence). See Vista doesn't start application called "install" w/o being elevated.

Solution: you have to embed an XML "manifest" in Windows binaries to avoid this behaviour. Would be nice if OCaml did that automatically for you.

Type: Third-party

Spawning a daemon fails on Windows

Windows doesn't have fork, so the usual double-fork trick doesn't work. Solution: Use create_process on Windows.

Would be nice if OCaml grouped all the POSIX-only functions together and made you check which platform you were on. Then you'd know when you were using platform-specific functions. e.g.

match Sys.platform_specific_ops with
| POSIX ops -> ops#fork ...
| Windows ops -> ops#spawn ...

Type: Poor API

0install select ignores `--refresh`

I forget to handle the Refresh case for the "select" command. Different commands need to handle different subsets of the options. I was using a plain variant (enum) type and throwing an exception if I got an option I wasn't expecting:

Support.Argparse.iter_options extra_options (function
  | ShowXML -> select_opts.xml <- true
  | _ -> raise_safe "Unknown option"
)

(Note: Several people have asked why I used a default match case here. It's needed because there are many options that don't apply to the "select" command. The option parser makes sure that each sub-command's handler function is only called with options it claims to support.)

Solution: I switched from plain variants to polymorphic variants and removed the default case. Now, the type-checker verifies at compile-time that each subcommand handles all its options:

Support.Argparse.iter_options extra_options (function
  | `ShowXML -> select_opts.xml <- true
  | `Refresh -> select_opts.refresh <- true
)

See Option Handling With OCaml Polymorphic Variants for a write-up of that.

Type: Inexperience

Not found errors

When printing diagnostics about a failed solve, we check each interface to see if it has a replacement that it conflicts with. e.g. the new "Java" interface replaces (and conflicts with) the old "Java 6" interface. But if the conflicting interface wasn't used in the solve, we'd crash with:

Exception: Not_found

I use a lot of maps with strings as the keys. I therefore created a StringMap module in my common.ml file like this:

module StringMap = Map.Make(String)

StringMap.find raises Not_found if the key isn't found, which is never what you want. These exceptions are awkward to deal with and it's easy to forget to handle them.

A nice solution is to replace the definition with:

module StringMap = struct
  include Map.Make(String)

  let find key map =
    try Some (find key map)
    with Not_found -> None
end

This redefines the find method to return an option type. Now you can't do a StringMap.find without the compiler forcing you to consider the case of the key not being there.

Would be nice if the OCaml standard library did this. Perhaps providing a Map.get function with the new behaviour and deprecating Map.find?

Type: Poor API

Octal value

I used 0700 instead of 0o700 to set a file mode. Would be nice if OCaml warned about decimals that start with 0, as Python 3 does.

Type: Warning needed

Parsing a path as a URL

This didn't actually get committed, but it's interesting anyway. Downloaded feeds are signed with GPG keys, which are trusted only for their particular domains. At one point, I used Trust.domain_from_url feed.url to get the domain. It was defined as:

let domain_from_url url = ...

However, feeds come in different types: there are remote feeds with URLs, and local feeds with local paths (there are also virtual "distribution" feeds representing the output from the distribution's package manager).

I was trying to get the trust domain for all feeds, not just remote ones where it makes sense.

Once again, the solution was to use polymorphic variants. The three different types of feed get three different constructors. A method (such as domain_from_url) that only works on remote feeds is declared as:

let domain_from_url (`remote_feed url) = ...

Then, it's impossible to call it without first ensuring you've got a remote feed URL.

This change also improves the type-safety of many other parts of the code (e.g. you can't try to download a local feed now either), and uncovered another bug: you couldn't use the GUI to set the stability rating for a distribution-provided implementation, because one of the functions used only worked for non-distribution feeds.

Type: Inexperience (x2)

Interrupted waitpid

The Unix.waitpid function can raise EINTR if the system call is interrupted, although the documentation doesn't mention this. It would be nice if OCaml would automatically restart the call in this case (as Python's subprocess module does).

Type: Poor API

HTTP redirects with data cause corrupted downloads

We download to a temporary file. If we get an HTTP redirect, we truncate the file and try the new address. However, ftruncate doesn't reset the position in the file. So, if the redirecting site sent any data in its reply, you'd get that many zeroes at the start of the download. As with waitpid, OCaml's behaviour is standard POSIX, but not mentioned in the OCaml documentation.

Solution: seek_out ch 0.

Also, I updated the test server used in the unit-tests to send back some data when doing a redirect.

Type: Testing only

Setting wrong mtime

Unix.utimes is supposed to set the mtime and atime of a file to the given values. However, the behaviour is a little odd:

When 1 <= time < infinity, it sets it to the requested time.
When 0 <= time < 1 however, it sets it to the current time instead.

That's a problem for us, because we often use "0" as the time for files which don't have a timestamp, and the time is part of the secure hashes we calculate.

Solution: I wrote a C function to allow setting the time to whatever value you like.

This bug didn't make it into a commit because I hit it while writing a test script (I was trying to reset a timestamp file to time zero), and the unit-tests would have caught it if not, but it's still a poor API. Not only does it fail to use a variant type to handle different cases, but it chooses a magic value that's a valid input!

Or, rather than using a variant type for these two cases, it could just drop the magic current time feature completely - it's easy enough to read the current time and pass it explicitly if you need it. That would make the code clearer too.

(note: the documentation does say "A time of 0.0 is interpreted as the current time", but it's easy to forget this if it wasn't relevant the first time you read the docs)

Type: Poor API

Strict sequences

This didn't make it into a commit, but it's interesting anyway. Simplified version of the problem:

let trace fn =
  print_endline "START";
  fn ();
  print_endline "END"

let () =
  trace (fun () -> Printf.printf "Hello %s!\n")

This prints:

START
END

Why is there no warning? You might expect OCaml would infer the type of fn as unit -> unit and then complain that the function we pass has the wrong type (unit -> string -> unit).

In fact, although OCaml warns if you ignore the result of a function that doesn't return unit, it's not actually an error. So it actually infers the type of fn as (unit -> 'a), and it compiles fine.

Solution: always compile with -strict-sequence (or put true: strict_sequence in your _tags file)

Type: Inexperience

Lwt process fails with empty string

When spawning a process, the Lwt docs say you can pass an empty string as the binary to run and it will search for the first argument in $PATH for you. However, that behaviour was added only in Lwt 2.4 and using the older version in Debian it failed at runtime with a confusing error.

Probably I should have been reading the old version of the docs (which the web-site helpfully lets you do).

I'm classifying this as a poor API because it was caused by using "" as a magic value, rather than defining a new constructor function.

Type: Poor API

EPIPE on Windows

On Windows, we couldn't read the output of GnuPG. This was due to a bug in Lwt, which they quickly fixed:

Lwt_io.read fails on Windows with EPIPE

Type: Third-party

Deleting temporary files

We download various files to a temporary directory. In some cases, they weren't being deleted afterwards.

Solution: the downloader now takes a mandatory Lwt switch and deletes the file when the switch is turned off. Callers just have to wrap the download call in a try ... finally block, like this:

let switch = Lwt_switch.create () in
try_lwt
  match_lwt downloader#download ~switch key_url with
  | `network_failure msg -> raise_safe "%s" msg
  | `aborted_by_user -> raise Aborted
  | `tmpfile tmpfile -> U.read_file system tmpfile |> G.import_key system
finally
  Lwt_switch.turn_off switch

To make this completely foolproof, you'd need something like the linear types from Rust or ATS, but this is good enough for me.

Type: Inexperience

Race shutting down test HTTP server

Some of the unit tests run a simple HTTP server. When the test is over, they use Lwt.cancel to kill it. However, it appears that this call is unreliable: depending on exactly what the server is doing at the time it might just ignore it and continue.

Solution: we both cancel the task and set a boolean flag, which we test just before calling accept. If we're in accept at the time of the cancel, the thread will abort correctly. If it's anywhere else, it may continue handling the current request, but will quit as soon as it finishes and checks the flag.

Would perhaps be nice if Lwt remembered that an attempt was made to cancel the thread during a non-cancellable operation, and killed it at the next opportunity.

Type: Poor API

A related race occurred if we spawned a child process while handling an HTTP request, because the child would inherit the client socket and it would never get closed.

Solution: Use Lwt_unix.set_close_on_exec connection as soon as the connection is accepted.

Note that both these hacks should be race-free, because Lwt is cooperatively multi-threaded (e.g. we can't spawn a subprocess between accepting a connection and marking it close_on_exec). I think.

Ideally, when spawning a child process you'd specify the file descriptors you wanted it to inherit explicitly (Go does this, but really it needs to be at the POSIX level).

Type: Testing only

(although these bugs only occurred in the unit-tests, I'm including them because they could just as easily appear in the main code)

Reactive event handler gets garbage collected

The OCaml D-BUS bindings use functional reactive programming to report property values. The idea is that you get an object representing the continuously-varying value of the property, rather than a particular sample of it. Then you can handle the signal as a whole (for example, you can get the "progress" signal from a PackageKit transaction and pass it to a GUI progress monitor widget, so that the widget always shows the current progress). You can build up chains of signal processors. For example, you might transform a "bytes downloaded" signal into a "percentage complete" one.

The technique seems to come from Haskell. Being purely functional, it's always safe to garbage collect a signal if no-one is holding a reference to it.

However, OCaml is not purely functional. You might easily want to evaluate a signal handler for its side-effects. I created a handler to monitor the transaction status signal to see when it was finished, and attached the resulting signal to a Lwt_switch. My plan was that the switch would keep it alive until it fired. That didn't work, because there was a subtle circular reference in the whole scheme, and OCaml would sometimes garbage-collect the handler and the switch. Then the process would ignore the finished event and appear to hang. I asked on StackOverflow and got some suggestions:

How to stop OCaml garbage collecting my reactive event handler?

The solution seems to be to keep references to all active signals in a global variable. Rather messy.

Type: Testing only

Stuck reading output from subprocess

When using the Lwt_glib module to integrate with the GTK mainloop, the HUP response from poll is ignored. This means that it will call poll endlessly in a tight loop. Patch

Type: Third-party

Downloads never complete

When using Lwt_glib, downloads may never complete. This is because OCaml, like Python, has a global lock and Lwt_glib fails to release it when calling poll. Therefore, no other thread (such as the download thread) can make progress while the main thread is waiting (e.g. for the download thread to finish). Patch

Type: Third-party

Event loop doesn't work on OS X

Lwt_glib passes -1000 to poll to mean "no timeout". This works on Linux, but not on BSD-type systems. Patch

Type: Third-party

Failing to reset Curl connections

For efficiency, Curl encourages the reuse of connections. However, I forgot to reset some parameters (max file size and expected modification time). If the next call didn't use them, it would reuse the old values and possibly fail.

Newer versions of ocurl have a reset function, which avoids these problems.

Type: Poor API

Error with cancelled downloads

Downloads were sometimes failing with this confusing error:

easy handled already used in multi handle

It happened when reusing connections (which Curl encourages, for efficiency). There was no direct way to cancel a download, so I handled cancellation by closing the channel the download was writing to. Then, next time some data arrived, my write callback would fail to write the new data and throw an exception, aborting the download. It turned out that this was leaving the connection in an invalid state.

Solution: return 0 from the handler instead of throwing an exception.

Ideally, ocurl should catch exceptions from callbacks and allow the C code to clean up properly. Now fixed.

Type: Third-party

GTK bugs

Sorted treeview iter mix-up

(I caught this before committing it, but it's a nasty bug that could easily be missed. It was present for a while in the original Python version.)

The cache explorer dialog allows you to delete implementations from the cache by selecting an item and pressing the Delete button. It also allows you to sort the table by clicking on the column headings. However, if you sort the table and then delete something, it deletes the wrong thing!

To make a sortable table (which is just a special case of a tree to GTK), you first create an underlying (unsorted) list model, then wrap it with a sorted model, then pass that to the display widget (GtkTreeView), like so:

let model = GTree.list_store cols
let sorted_model = GTree.model_sort model
let view = GTree.view ~model:sorted_model ()

To do things with the model, you pass it a GtkTreeIter, which says which item you want to act on, e.g.

model#remove iter

The trouble is, sorted and unsorted GtkTreeIters both have the same type, so you can easily pass an iterator of the sorted model as an argument to the unsorted model. Then it will act on the wrong item. If the view isn't sorted then everything works fine, so you might not notice the problem while testing.

Solution: I created a new module for unsorted lists. The implementation (unsorted_list.ml) just proxies calls to the real code:

unsorted_list.ml

type t = GTree.list_store
type iter = Gtk.tree_iter

let list_store cols = GTree.list_store cols
let clear (model:t) = model#clear ()
let model_sort model = GTree.model_sort model
let get_iter_first (model:t) = model#get_iter_first
let set (model:t) ~(row:iter) ~column value = model#set ~row ~column value
let get (model:t) ~(row:iter) ~column = model#get ~row ~column
let remove (model:t) (row:iter) = model#remove row
let convert_iter_to_child_iter (model:GTree.model_sort) (iter:Gtk.tree_iter) = model#convert_iter_to_child_iter iter
let append (model:t) = model#append ()
let iter_next (model:t) (row:iter) = model#iter_next row

However, the interface (unsorted_list.mli) makes the types t (the model) and iter (its GtkTreeIters) abstract, so that code outside of the module isn't allowed to know their real types:

unsorted_list.mli

type t
type iter
val list_store : GTree.column_list -> t
val clear : t -> unit
val model_sort : t -> GTree.model_sort
val get_iter_first : t -> iter option
val set : t -> row:iter -> column:'a GTree.column -> 'a -> unit
val get : t -> row:iter -> column:'a GTree.column -> 'a
val remove : t -> iter -> bool
val convert_iter_to_child_iter : GTree.model_sort -> Gtk.tree_iter -> iter
val append : t -> iter
val iter_next : t -> iter -> bool

Now it's impossible to mix up sorted and unsorted types:

let model = Unsorted_list.list_store cols     (* Unsorted_list.t *)
let sorted_model = Unsorted_list.model_sort model (* GTree.model *)
let view = GTree.view ~model:sorted_model ()

It's still possible to mix up iterators in some cases (e.g. between two different instances of a sorted model), but that's a much less likely mistake to make.

Another way to solve the problem would be to bundle the owning model with each iterator, but that would be a big change to how the underlying GTK library works. And ATS could solve this easily using its dependant types, by declaring the iterator type as iter(m) ("iterator of model at address m"), linking models to iterators in the type system.

Type: Poor API

Crashes with GtkIconView

As with GtkTreeView, you can make a sorted GtkIconView with a pair of models. For some reason, clearing the underlying model didn't clear the sorted version, and repopulating it corrupted memory:

Program received signal SIGSEGV, Segmentation fault.

Solution: since there is no UI to let the user change the sort column, I just removed the sorted model and sorted the underlying model myself in OCaml. I guess this is probably a GTK bug.

Type: Third-party

A second crashing bug with GtkIconView is caused by a bug in lablgtk. The C wrapper get_path_at_pos returns a tree_path option (None if you clicked in a blank area), but the OCaml declaration says it returns a plain tree_path.

Solution: use Obj.magic to do an unchecked cast to the correct type:

let path_unsafe = view#get_path_at_pos x y in
let path : Gtk.tree_path option = Obj.magic path_unsafe in

(reported as segfault due to GtkIconView type mismatch)

Two interesting things about this bug:

Even the low-level GTK bindings are presumably not generated automatically. If they were, this kind of mismatch surely couldn't happen.
An OCaml optional pointer doesn't have the same representation as a non-optional pointer. If it did, the code wouldn't crash. This suggests OCaml is being a bit inefficient about option types, which is disappointing.

Type: Third-party

Logic errors

The remaining bugs aren't very interesting, but I've included them for completeness:

Failed to handle ambiguous short options (e.g. -r)
Missing tab completion for 0install add
0install run ignores --gui option
Infinite loop handling recursive dependencies
Allow absolute local paths in local implementations
Default command for --source should be compile, not run
Allow machine:null in JSON response
Typo: scripts start #! not !#
Report an error if we need to confirm keys during a headless background update
Wrong attribute name in overrides XML file
Allow executing selections with no command but an absolute main
Handle local-path on <group> elements
Wrong attribute name when saving user overrides
Typo: "https://" not "https://'"
Don't try to use GUI in --dry-run mode
Support old version of selections XML format
Support old selections without a from-feed attribute
Don't send an empty GetDetails request to PackageKit
Cope with dpkg-query returning a non-zero exit status
Detect Python even if python-gobject is missing
Race when cancelling downloads

Type: Testing only (x21)

Python bugs

Just for interest, here are the Python bugs discovered over the same period (it doesn't make sense to compare bug counts, because these are bugs in mature code, often several years old, not just-written new code).

I think these would be impossible or unlikely in OCaml (the problem would be detected at compile time):

Error setting machine type for Cygwin packages (type error)
When loading selections from a file, convert last-check-mtime attribute to an int (type error)
UnicodeError extracting or generating a manifest for archives with non-ASCII file names (no encoding of file names in OCaml :-)
Crash when specifying bold text (PyGTK API change)
Broken clipboard handling in cache explorer (PyGTK API change)
Broken filtering in cache explorer (PyGTK API change)
Broken cache explorer menu when using a filter (model mix up; could avoid with abstract types, as above)
Fails running an app when the master feed is no longer cached (would be forced to handle None case in OCaml)

These would likely still be bugs in OCaml:

Fix mtime check in selections.get_unavailable_selections for native packages
Always return False for native packages in needs_download if include_packages is False
Always use the prefix "xml" for the XML namespace
Don't abort solving just because a local feed is missing
Escape tooltips in cache explorer
32-bit size limit in cache explorer

This was a third-party bug:

Workaround for PyGTK garbage collection bug

Summary

Despite the newness of the code, the bug-rate has been surprisingly low so far. Of the (detected) bugs that did make it past the compiler, about a sixth were due to bugs in third-party libraries, another sixth could have been avoided with better third-party APIs and a sixth we due to my inexperience with OCaml. For the remaining half, more testing is still the only way I can see to find such bugs.

It's a shame that OCaml seems to have no system for deprecating old APIs. This means that poor API choices made years ago are still causing trouble today. It would be good if OCaml could flag functions as being there for historical reasons only, and issue a compiler warning if you used them. I do, however, like the fact that they stay around - breaking existing code (as Python 3 did) is not the solution either!

Two of the bugs ("Deleting temporary files" and "Reactive event handler gets garbage collected") could have been avoided if OCaml had linear types, but I have reasonable solutions to both. The XML / JSON handling bugs could have been avoided by using proper schemas, but such schemas didn't exist (my fault).

Overall, I'm pretty happy with the bug rate so far. No doubt more bugs will be discovered as the new code makes its way into the distributions and gains more users, but I think this code will be easier to maintain than the Python code, and much less likely to break silently due to changes in the language or third-party libraries.

Methodology

Core OCaml issues

Out-of-range integers

Fails to start on Windows

Spawning a daemon fails on Windows

0install select ignores --refresh

Not found errors

Octal value

Parsing a path as a URL

Interrupted waitpid

HTTP redirects with data cause corrupted downloads

Setting wrong mtime

Strict sequences

Lwt-related bugs

Lwt process fails with empty string

EPIPE on Windows

Deleting temporary files

Race shutting down test HTTP server

Reactive event handler gets garbage collected

Stuck reading output from subprocess

Downloads never complete

Event loop doesn't work on OS X

Curl-related bugs

Failing to reset Curl connections

Error with cancelled downloads

GTK bugs

Sorted treeview iter mix-up

Crashes with GtkIconView

Logic errors

Python bugs

Summary

0install select ignores `--refresh`