OCaml’s static typing allows it to detect many problems at compile-time. Still, some bugs slip though. In this post, I go over each discovered bug that made it into a Git commit and try to work out why it happened and whether it could have been prevented.
Note: As this post is entirely about bugs, it may appear rather negative. So let me say first that, overall, I’ve been very impressed with the reliability of the OCaml code: I’d have expected to find more bugs than this in 27,806 lines of new code!
Table of Contents
- Core OCaml issues
- Lwt-related bugs
- Curl-related bugs
- GTK bugs
- Logic errors
- Python bugs
I’ve gone back through the Git commit log and selected all the ones that say they fix a bug in the comment, starting from when I merged the initial OCaml code to master (on 2013-07-03). It’s possible that I sneakily fixed some bugs while making other changes, but this should cover most of them. Any bug that made it into a released version of 0install should certainly have its own commit because it would have to go on the release branch. I also included a few “near misses” (bugs I spotted before committing, but which I could plausibly have missed).
In a number of cases, I wrote and committed the new OCaml code first, and then ported the Python unit-tests in a later commit and discovered the bug that way (so some of these could never have made it into an actual release). Compile-time bugs have been ignored (e.g. code that didn’t compile on older versions of OCaml); I’m only interested in run-time errors here.
I’ve classified each bug as follows:
- This bug was caused by my inexperience with OCaml. Making proper use of OCaml’s features would avoid this class of bug entirely.
- Caused by a bug in a library I was using (and hopefully now fixed). Could only have been discovered by testing.
- Poor API
- This bug was my fault, but could have been avoided if the library had a better API.
- Warning needed
- My fault, but the compiler could have detected the problem and issued a warning.
- Testing only
- I only know how to find such bugs through testing. Similar bugs could happen again.
Core OCaml issues
Note: I’m grouping the bugs by the library the code was interacting with, regardless of whether that library was at fault. This section is for bugs that occurred when just using OCaml itself and its standard library.
Everything seemed to be working nicely on my Arch system, but the first run on a Debian VM gave this unhelpful error:
I was trying to store a Unix timestamp in an
int. Unlike Python, OCaml’s integers are limited precision, having 1 bit less than the machine word size. The 32-bit VM only had 31-bit integers and the time value was out of range for it.
This was entirely my fault. OCaml always uses floats to represent times, and they work fine. I was just converting to ints to be consistent with the Python code.
However, the error message is very poor. I replaced all calls to
int_of_string with my own version, which at least displays the number it was trying to convert. This should make debugging any similar problems easier in future.
Fails to start on Windows
Windows kept complaining that my program didn’t exist and to check that the path was correct, even when I was double-clicking on the executable in Explorer! Turns out, Windows refuses to run binaries with certain character sequences in their names (“instal” being one such sequence). See Vista doesn’t start application called “install” w/o being elevated.
Solution: you have to embed an XML “manifest” in Windows binaries to avoid this behaviour. Would be nice if OCaml did that automatically for you.
Spawning a daemon fails on Windows
Windows doesn’t have
fork, so the usual double-fork trick doesn’t work.
create_process on Windows.
Would be nice if OCaml grouped all the POSIX-only functions together and made you check which platform you were on. Then you’d know when you were using platform-specific functions. e.g.
1 2 3
Type: Poor API
0install select ignores
I forget to handle the
Refresh case for the “select” command.
Different commands need to handle different subsets of the options. I was using a plain variant (enum) type and throwing an exception if I got an option I wasn’t expecting:
1 2 3 4
(Note: Several people have asked why I used a default match case here. It’s needed because there are many options that don’t apply to the “select” command. The option parser makes sure that each sub-command’s handler function is only called with options it claims to support.)
Solution: I switched from plain variants to polymorphic variants and removed the default case. Now, the type-checker verifies at compile-time that each subcommand handles all its options:
1 2 3 4
See Option Handling With OCaml Polymorphic Variants for a write-up of that.
Not found errors
When printing diagnostics about a failed solve, we check each interface to see if it has a replacement that it conflicts with. e.g. the new “Java” interface replaces (and conflicts with) the old “Java 6” interface. But if the conflicting interface wasn’t used in the solve, we’d crash with:
I use a lot of maps with strings as the keys. I therefore created a
StringMap module in my
common.ml file like this:
Not_found if the key isn’t found, which is never what you want. These exceptions are awkward to deal with and it’s easy to forget to handle them.
A nice solution is to replace the definition with:
1 2 3 4 5 6 7
This redefines the
find method to return an option type. Now you can’t do a
StringMap.find without the compiler forcing you to consider the case of the key not being there.
Would be nice if the OCaml standard library did this. Perhaps providing a
Map.get function with the new behaviour and deprecating
Type: Poor API
0700 instead of
0o700 to set a file mode. Would be nice if OCaml warned about decimals that start with 0, as Python 3 does.
Type: Warning needed
Parsing a path as a URL
This didn’t actually get committed, but it’s interesting anyway. Downloaded feeds are signed with GPG keys, which are trusted only for their particular domains. At one point, I used
Trust.domain_from_url feed.url to get the domain. It was defined as:
However, feeds come in different types: there are remote feeds with URLs, and local feeds with local paths (there are also virtual “distribution” feeds representing the output from the distribution’s package manager).
I was trying to get the trust domain for all feeds, not just remote ones where it makes sense.
Once again, the solution was to use polymorphic variants. The three different types of feed get three different constructors. A method (such as
domain_from_url) that only works on remote feeds is declared as:
Then, it’s impossible to call it without first ensuring you’ve got a remote feed URL.
This change also improves the type-safety of many other parts of the code (e.g. you can’t try to download a local feed now either), and uncovered another bug: you couldn’t use the GUI to set the stability rating for a distribution-provided implementation, because one of the functions used only worked for non-distribution feeds.
Type: Inexperience (x2)
Unix.waitpid function can raise
EINTR if the system call is interrupted, although the documentation doesn’t mention this. It would be nice if OCaml would automatically restart the call in this case (as Python’s
subprocess module does).
Type: Poor API
HTTP redirects with data cause corrupted downloads
We download to a temporary file. If we get an HTTP redirect, we truncate the file and try the new address. However,
ftruncate doesn’t reset the position in the file. So, if the redirecting site sent any data in its reply, you’d get that many zeroes at the start of the download. As with
waitpid, OCaml’s behaviour is standard POSIX, but not mentioned in the OCaml documentation.
seek_out ch 0.
Also, I updated the test server used in the unit-tests to send back some data when doing a redirect.
Type: Testing only
Setting wrong mtime
Unix.utimes is supposed to set the mtime and atime of a file to the given values. However, the behaviour is a little odd:
- When 1 <= time < infinity, it sets it to the requested time.
- When 0 <= time < 1 however, it sets it to the current time instead.
That’s a problem for us, because we often use “0” as the time for files which don’t have a timestamp, and the time is part of the secure hashes we calculate.
Solution: I wrote a C function to allow setting the time to whatever value you like.
This bug didn’t make it into a commit because I hit it while writing a test script (I was trying to reset a timestamp file to time zero), and the unit-tests would have caught it if not, but it’s still a poor API. Not only does it fail to use a variant type to handle different cases, but it chooses a magic value that’s a valid input!
Or, rather than using a variant type for these two cases, it could just drop the magic current time feature completely - it’s easy enough to read the current time and pass it explicitly if you need it. That would make the code clearer too.
(note: the documentation does say “A time of 0.0 is interpreted as the current time”, but it’s easy to forget this if it wasn’t relevant the first time you read the docs)
Type: Poor API
This didn’t make it into a commit, but it’s interesting anyway. Simplified version of the problem:
1 2 3 4 5 6 7
Why is there no warning? You might expect OCaml would infer the type of
unit -> unit and then complain that the function we pass has the wrong type (
unit -> string -> unit).
In fact, although OCaml warns if you ignore the result of a function that doesn’t return unit, it’s not actually an error. So it actually infers the type of
fn as (
unit -> 'a), and it compiles fine.
Solution: always compile with
-strict-sequence (or put
true: strict_sequence in your
Lwt process fails with empty string
When spawning a process, the Lwt docs say you can pass an empty string as the binary to run and it will search for the first argument in
$PATH for you. However, that behaviour was added only in Lwt 2.4 and using the older version in Debian it failed at runtime with a confusing error.
Probably I should have been reading the old version of the docs (which the web-site helpfully lets you do).
I’m classifying this as a poor API because it was caused by using
"" as a magic value, rather than defining a new constructor function.
Type: Poor API
EPIPE on Windows
On Windows, we couldn’t read the output of GnuPG. This was due to a bug in Lwt, which they quickly fixed:
Deleting temporary files
We download various files to a temporary directory. In some cases, they weren’t being deleted afterwards.
Solution: the downloader now takes a mandatory Lwt switch and deletes the file when the switch is turned off. Callers just have to wrap the download call in a
try ... finally block, like this:
1 2 3 4 5 6 7 8
To make this completely foolproof, you’d need something like the linear types from Rust or ATS, but this is good enough for me.
Race shutting down test HTTP server
Some of the unit tests run a simple HTTP server. When the test is over, they use
Lwt.cancel to kill it. However, it appears that this call is unreliable: depending on exactly what the server is doing at the time it might just ignore it and continue.
Solution: we both cancel the task and set a boolean flag, which we test just before calling
accept. If we’re in
accept at the time of the cancel, the thread will abort correctly. If it’s anywhere else, it may continue handling the current request, but will quit as soon as it finishes and checks the flag.
Would perhaps be nice if Lwt remembered that an attempt was made to cancel the thread during a non-cancellable operation, and killed it at the next opportunity.
Type: Poor API
A related race occurred if we spawned a child process while handling an HTTP request, because the child would inherit the client socket and it would never get closed.
Lwt_unix.set_close_on_exec connection as soon as the connection is accepted.
Note that both these hacks should be race-free, because Lwt is cooperatively multi-threaded (e.g. we can’t spawn a subprocess between accepting a connection and marking it
close_on_exec). I think.
Ideally, when spawning a child process you’d specify the file descriptors you wanted it to inherit explicitly (Go does this, but really it needs to be at the POSIX level).
Type: Testing only
(although these bugs only occurred in the unit-tests, I’m including them because they could just as easily appear in the main code)
Reactive event handler gets garbage collected
The OCaml D-BUS bindings use functional reactive programming to report property values. The idea is that you get an object representing the continuously-varying value of the property, rather than a particular sample of it. Then you can handle the signal as a whole (for example, you can get the “progress” signal from a PackageKit transaction and pass it to a GUI progress monitor widget, so that the widget always shows the current progress). You can build up chains of signal processors. For example, you might transform a “bytes downloaded” signal into a “percentage complete” one.
The technique seems to come from Haskell. Being purely functional, it’s always safe to garbage collect a signal if no-one is holding a reference to it.
However, OCaml is not purely functional. You might easily want to evaluate a signal handler for its side-effects.
I created a handler to monitor the transaction status signal to see when it was finished, and attached the resulting signal to a
My plan was that the switch would keep it alive until it fired.
That didn’t work, because there was a subtle circular reference in the whole scheme, and OCaml would sometimes garbage-collect the handler and the switch. Then the process would ignore the finished event and appear to hang. I asked on StackOverflow and got some suggestions:
The solution seems to be to keep references to all active signals in a global variable. Rather messy.
Type: Testing only
Stuck reading output from subprocess
When using the
Lwt_glib module to integrate with the GTK mainloop, the
HUP response from
poll is ignored. This means that it will call
poll endlessly in a tight loop.
Downloads never complete
Lwt_glib, downloads may never complete. This is because OCaml, like Python, has a global lock and
Lwt_glib fails to release it when calling
poll. Therefore, no other thread (such as the download thread) can make progress while the main thread is waiting (e.g. for the download thread to finish).
Event loop doesn’t work on OS X
poll to mean “no timeout”. This works on Linux, but not on BSD-type systems.
Failing to reset Curl connections
For efficiency, Curl encourages the reuse of connections. However, I forgot to reset some parameters (max file size and expected modification time). If the next call didn’t use them, it would reuse the old values and possibly fail.
Newer versions of ocurl have a
reset function, which avoids these problems.
Type: Poor API
Error with cancelled downloads
Downloads were sometimes failing with this confusing error:
easy handled already used in multi handle
It happened when reusing connections (which Curl encourages, for efficiency). There was no direct way to cancel a download, so I handled cancellation by closing the channel the download was writing to. Then, next time some data arrived, my write callback would fail to write the new data and throw an exception, aborting the download. It turned out that this was leaving the connection in an invalid state.
Solution: return 0 from the handler instead of throwing an exception.
Ideally, ocurl should catch exceptions from callbacks and allow the C code to clean up properly. Now fixed.
Sorted treeview iter mix-up
(I caught this before committing it, but it’s a nasty bug that could easily be missed. It was present for a while in the original Python version.)
The cache explorer dialog allows you to delete implementations from the cache by selecting an item and pressing the Delete button. It also allows you to sort the table by clicking on the column headings. However, if you sort the table and then delete something, it deletes the wrong thing!
To make a sortable table (which is just a special case of a tree to GTK), you first create an underlying (unsorted) list model, then wrap it with a sorted model, then pass that to the display widget (GtkTreeView), like so:
1 2 3
To do things with the model, you pass it a GtkTreeIter, which says which item you want to act on, e.g.
The trouble is, sorted and unsorted GtkTreeIters both have the same type, so you can easily pass an iterator of the sorted model as an argument to the unsorted model. Then it will act on the wrong item. If the view isn’t sorted then everything works fine, so you might not notice the problem while testing.
Solution: I created a new module for unsorted lists. The implementation (
unsorted_list.ml) just proxies calls to the real code:
1 2 3 4 5 6 7 8 9 10 11 12 13
However, the interface (
unsorted_list.mli) makes the types
t (the model) and
GtkTreeIters) abstract, so that code outside of the module isn’t allowed to know their real types:
1 2 3 4 5 6 7 8 9 10 11 12
Now it’s impossible to mix up sorted and unsorted types:
1 2 3
It’s still possible to mix up iterators in some cases (e.g. between two different instances of a sorted model), but that’s a much less likely mistake to make.
Another way to solve the problem would be to bundle the owning model with each iterator, but that would be a big change to how the underlying GTK library works.
And ATS could solve this easily using its dependant types, by declaring the iterator type as
iter(m) (“iterator of model at address m”), linking models to iterators in the type system.
Type: Poor API
Crashes with GtkIconView
As with GtkTreeView, you can make a sorted GtkIconView with a pair of models. For some reason, clearing the underlying model didn’t clear the sorted version, and repopulating it corrupted memory:
Program received signal SIGSEGV, Segmentation fault.
Solution: since there is no UI to let the user change the sort column, I just removed the sorted model and sorted the underlying model myself in OCaml. I guess this is probably a GTK bug.
A second crashing bug with GtkIconView is caused by a bug in lablgtk.
The C wrapper
get_path_at_pos returns a
tree_path option (
None if you clicked in a blank area), but the OCaml declaration says it returns a plain
Obj.magic to do an unchecked cast to the correct type:
(reported as segfault due to GtkIconView type mismatch)
Two interesting things about this bug:
- Even the low-level GTK bindings are presumably not generated automatically. If they were, this kind of mismatch surely couldn’t happen.
- An OCaml optional pointer doesn’t have the same representation as a non-optional pointer. If it did, the code wouldn’t crash. This suggests OCaml is being a bit inefficient about option types, which is disappointing.
The remaining bugs aren’t very interesting, but I’ve included them for completeness:
- Failed to handle ambiguous short options (e.g.
- Missing tab completion for
- Infinite loop handling recursive dependencies
- Allow absolute local paths in local implementations
- Default command for
machine:nullin JSON response
- Typo: scripts start
- Report an error if we need to confirm keys during a headless background update
- Wrong attribute name in overrides XML file
- Allow executing selections with no command but an absolute main
- Handle local-path on
- Wrong attribute name when saving user overrides
- Don’t try to use GUI in
- Support old version of selections XML format
- Support old selections without a
- Don’t send an empty GetDetails request to PackageKit
- Cope with dpkg-query returning a non-zero exit status
- Detect Python even if python-gobject is missing
- Race when cancelling downloads
Type: Testing only (x21)
Just for interest, here are the Python bugs discovered over the same period (it doesn’t make sense to compare bug counts, because these are bugs in mature code, often several years old, not just-written new code).
I think these would be impossible or unlikely in OCaml (the problem would be detected at compile time):
- Error setting machine type for Cygwin packages (type error)
- When loading selections from a file, convert
last-check-mtimeattribute to an int (type error)
UnicodeErrorextracting or generating a manifest for archives with non-ASCII file names (no encoding of file names in OCaml :-)
- Crash when specifying bold text (PyGTK API change)
- Broken clipboard handling in cache explorer (PyGTK API change)
- Broken filtering in cache explorer (PyGTK API change)
- Broken cache explorer menu when using a filter (model mix up; could avoid with abstract types, as above)
- Fails running an app when the master feed is no longer cached (would be forced to handle None case in OCaml)
These would likely still be bugs in OCaml:
- Fix mtime check in
selections.get_unavailable_selectionsfor native packages
- Always return False for native packages in
- Always use the prefix “xml” for the XML namespace
- Don’t abort solving just because a local feed is missing
- Escape tooltips in cache explorer
- 32-bit size limit in cache explorer
This was a third-party bug:
- Workaround for PyGTK garbage collection bug
Despite the newness of the code, the bug-rate has been surprisingly low so far. Of the (detected) bugs that did make it past the compiler, about a sixth were due to bugs in third-party libraries, another sixth could have been avoided with better third-party APIs and a sixth we due to my inexperience with OCaml. For the remaining half, more testing is still the only way I can see to find such bugs.
It’s a shame that OCaml seems to have no system for deprecating old APIs. This means that poor API choices made years ago are still causing trouble today. It would be good if OCaml could flag functions as being there for historical reasons only, and issue a compiler warning if you used them. I do, however, like the fact that they stay around - breaking existing code (as Python 3 did) is not the solution either!
Two of the bugs (“Deleting temporary files” and “Reactive event handler gets garbage collected”) could have been avoided if OCaml had linear types, but I have reasonable solutions to both. The XML / JSON handling bugs could have been avoided by using proper schemas, but such schemas didn’t exist (my fault).
Overall, I’m pretty happy with the bug rate so far. No doubt more bugs will be discovered as the new code makes its way into the distributions and gains more users, but I think this code will be easier to maintain than the Python code, and much less likely to break silently due to changes in the language or third-party libraries.