In the first post, I took a brief look at the programming languages ATS, C#, Go, Haskell, OCaml, Python and Rust to try to decide which would be the best language in which to write 0install (which is currently implemented in Python). Now it's time to eliminate a few candidates and look in more detail at the others.
Last time, I converted 4 lines of Python code from 0install into each language. This time I'm converting 576 lines, so this should give a more accurate picture of how each performs for a real-world task.
As before, note that I'm a beginner in these languages. Please post corrections or suggestions of better ways of doing things in the comments. Thanks.
(this post also appeared on reddit and on Hacker News, where there are more comments)
Table of Contents
- Conclusions from last time
- Test case
- Syntax
- The "run" module
- Data structures
- Variants
- Using the data structures
- Handling XML
- Building lists
- String processing
- Finding resources
- API docs
- Speed
- Conclusions
Conclusions from last time
Based on the initial evaluation and feedback (thanks everyone!), I'm not going to look any further at these:
- ATS
-
I'm glad I included ATS. Its excellent performance and tiny binary put the other languages into perspective. Still, it's too hard to use, and makes it too difficult to separate safe code from unsafe code.
- C#
-
Although C# is widely used, has an excellent cross-platform bytecode format and many libraries available, it is too large and too slow to start for 0install (to the people who suggested profiling: when even "Hello World" is too slow, there isn't much you can do).
- Go
-
Although many Go users complained that Go's score was unfairly low, they didn't seem to disagree that it was the worst of the candidates for our requirements, only about by how much it was the worst. To summarise the discussion briefly:
-
Go is good because errors are handled where they occur. Maybe, but ATS, Haskell, OCaml and Rust do that too, and they help you get it right.
-
You can write pretty reliable code in Go. No doubt, since you can do it in C too. But maybe I can write even better code with less effort using the other languages?
-
It's OK to ignore errors silently in some places, because handling all errors would clutter the code up too much. This seems like a trade-off the other languages don't require me to make.
-
- Rust
-
Rust has excellent safety and a familiar imperative syntax. It also has excellent support for shared libraries, which I didn't understand when I wrote the previous post (although I don't feel too bad about this as it seems that almost no-one else in the Rust community understood it either). Speed is OK but not great, though likely to improve. Rust's main weakness is its immaturity. The language is likely to change in incompatible ways in the near future and there are few libraries available (for example, there is no XML library for Rust). It will be a few years before this is usable in production (and the developers make no secret of this).
So, here are the remaining candidates and a summary of the conclusions from last time:
- Haskell (7.6.3)
- Haskell is fast, but it has problems with shared libraries: libraries are only compatible when compiled by the exact same version of the compiler. Its pure functional style may make it very difficult to convert the existing code to Haskell. Diagnostics (e.g. getting stack-traces) may be a problem too.
- OCaml (4.00.1)
- OCaml is also fast and had good safety, but its support for shared libraries seems limited. Getting good diagnostics from production code may be tricky, as enabling stack-traces has a performance cost (OCaml code assumes exceptions are cheap).
- Python (2.7.5 and 3.3.2)
-
Python is much slower than the other candidates, but it does have the advantages of an excellent standard library,
easy distribution (no need to make platform-specific binaries), being the language we're
currently using, and being very well known. But it has no static type
checking, which means a lot of work writing unit-tests for even trivial code (e.g. testing
__repr__
methods and logging in obscure error paths to make sure it won't crash due to a typo).
Several people said that D has improved a lot in the last few years. I didn't have time to look at it carefully again, though. It's a nice language, but probably too low-level/unsafe for us. For example, it's trivial to make it segfault by dereferencing a null pointer.
Test case
0install collects information about all available versions of a program and its libraries from around the web. Then it runs a solver to determine the best valid combination and writes the results to an XML selections document. For an app, foo, the current selections can be found in ~/.config/0install.net/apps/foo/selections.xml.
When it's time to run an application, we read this XML, set up a suitable environment for the process and then exec it. As before, this should happen as quickly as possible.
The test program can be given either the name of an app (e.g. "foo") or the path to a selections document.
This task involves:
- Using the XDG basedir spec to find the current selections and the cached software.
- Parsing XML into a suitable data structure.
- Manipulating pathnames, files, directories and symlinks.
- Updating environment variables based on the requested bindings.
- String manipulation.
- Creating launcher executables (as in the previous test).
We can't go over every line in this post, so I'll just highlight the most interesting bits. The full version in each language is in this GitHub repository. If you get bored, there's a summary at the end.
Syntax
Python
I guess most people are familiar with Python. It's clear, simple and straight-forward. It uses indentation to see when a block ends, which means that the structure of a program is exactly what it looks like. Here's a sample of the Python main module:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Note: The real Python code uses the "optparse" module to handle option parsing and generate help text. However,
because the Haskell/OCaml code developed here will be used as a front-end to the real Python, we don't want any
special handling. For example, if invoked with --help
they should fall back to the Python version instead of
handling it themselves. So proper option parsing isn't part of this task.
OCaml
OCaml syntax is very compact, but somewhat prone to confusing syntax errors. Often an error reported at a particular line means you used ";" rather than ";;" 20 or 30 lines earlier. Here's the OCaml main module:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Some useful things to know when reading the examples:
let a = b in ...
assigns a variable (well, constant), likea = b; ...
in Python.let foo a b = ...
defines a function, likedef foo(a, b): ...
in Python.foo (a + 1) b
means to callfoo
with two arguments, likefoo(a + 1, b)
in Python.foo a
creates a partially applied function, likefunctools.partial(foo, a)
in Python.()
means no data (you can think of it as an empty tuple).- Module names start with a capital letter (e.g.
Run.execute_selections
).
We assign the result of Run.execute_selections
to ()
just to check that it didn't return anything (so if it
did, we'd get a type error rather than just ignoring it).
The syntax tends to emphasise the body of functions while minimising the signature, which I'm not convinced is a good thing. Consider:
1 2 3 4 5 6 7 8 |
|
What is the argument b
here? What is the return type? These things are inferred by the compiler.
b
is an element
because ZI.get_attribute_opt
takes an element
as its last argument. The
return type is env_source
, because InsertPath
and Value
are constructors for env_source
objects.
You can include the types, but generally you don't and that makes it hard to know the type of a function just by looking at its definition in the source code.
Update: ygrek points out that most text editors can tell you the type of an OCaml identifier, e.g. see these instructions for Vim.
It also tends to bulk out the code: e.g. Map.find key map
rather than the more
object-oriented map.find key
. OCaml does support objects if you want them, but
the normal style seems to be to avoid them. On the other hand, it does make it very easy to
see which bit of code is being called, which isn't so obvious in an object-oriented style.
One thing to watch out for in OCaml is that, unlike Python and Haskell, it doesn't look at your indentation. Consider the following code (this is a simplified version of a mistake I actually made):
1 2 3 4 5 6 7 8 9 |
|
This code never prints "End", because it treats that as part of the "many" branch, even with all warnings enabled. It would be nice if it would issue a warning if a single block (e.g. the last match case) contains lines with different levels of indentation.
Haskell
Like Python, Haskell uses whitespace for indentation and generally avoids semi-colons and braces. Here's the main function in Haskell:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Some things you should know to help you read the examples:
::
means "has type". Functions typically declare their type at the top.myfunc :: String -> Int -> Bool
means thatmyFunc
takes aString
argument and then anInt
argument and returns aBool
.do
notation means that the expressions inside are connected together in some type-specific way (this is quite confusing). Sincemain
has the typeIO ()
, these statements are connected by the "IO monad", which does them one at a time, interacting with the system for each one. If the do block had a different type, it might do something completely different.a $ b $ c d
meansa (b (c d))
(without them it means((a b) c ) d
). It helps to avoid having lots of close brackets.
The "run" module
Once we've loaded the selections XML document, the basic steps to execute the program are:
- For each selection, get the path to its directory (for packages not provided by the distribution).
- Scan the XML and collect all the bindings (things we need to set up).
- Ensure the launcher helper utility is installed (see last post).
- Set up environment variables requested in the bindings.
- Set up any launchers requested in the bindings.
- Build the argv for the new process and exec it.
OCaml
This is all very straight-forward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Haskell
This is rather complicated. Haskell functions can't have side effects (like, say, creating a launcher or execing a
process). Instead, the function returns an IO
request to main
, which returns it to whatever is driving Haskell.
The IO request is basically a request to perform an operation and a callback to invoke when done. To avoid this becoming
a syntax nightmare, Haskell's do
notation lets you write it in an imperative style:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
For example, take the first line: origEnv <- getEnvironment
. getEnvironment
is a request to get the current environment.
It has the type IO [(String, String)]
- a request for a list of (name, value) mappings. The rest of the do block is
effectively a function which takes an argument origEnv
, of type [(String, String)]
(i.e. an actual list of mappings).
When this is returned by main, the system gets the mappings and then calls this function with the results. The same thing then
happens with the second line, and so on.
There's one tricky bit: resolvePath
takes a single selection and finds where it's stored (which requires access to the filesystem).
It returns an IO request to check whether some paths exist. But when we loop over all the selections, we get a list of IO operations, not an
IO operation. So you need to use mapM
(monadic map), which turns a list of IO requests into an IO request for a list. That's a lot
of thinking to do something quite simple.
Doing IO in Haskell is hard. Here's another example (reading an environment variable):
1 2 3 4 5 6 |
|
I want to do a getEnv
operation and return Nothing
if the variable isn't set. At first, I tried calling getEnv
and
catching the exception using the normal exception handing function. That doesn't work. The reason is that getEnv
doesn't
throw an exception; it successfully returns an IO request for an environment variable. You have to use the special tryIOError
function to get a request for either an environment variable or an error, and then pattern-match on that.
The benefit of all this extra work is that you can instantly see which functions do (or rather request) IO by looking at their type signature. Take the XML parsing function, for example:
1
|
|
Just by looking at the type, we can see that it does no IO (and, therefore, isn't vulnerable to attacks which might try to trick
it into loading local files while parsing the DTD). It also, incidentally, suggests that we're not going the get any useful error
message if it doesn't parse: it will just return Nothing
.
Data structures
We need to store the user's configured search paths, following the XDG Base Directory Specification. We need a list of directories to search for configuration settings, cached data (e.g. downloads) and "data" (which we use mainly for binaries we've compiled).
Python
For storing general records, Python provides a choice of classes and tuples. Here's a typical class and an example of creating an instance and accessing a field:
1 2 3 4 5 6 7 8 9 10 |
|
When calling any Python function or constructor, you can use the (optional) name = value
syntax so you don't have to
remember the order of the arguments.
Python also provides named tuples, which saves some boilerplate code when you just want pure data with no methods attached. The syntax isn't great, though, and you can't do pattern matching on the names, only by remembering the order:
1 2 3 4 5 6 7 8 9 10 |
|
OCaml
OCaml provides classes, unnamed tuples and records (essentially named tuples). A record would be the obvious choice here:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Curiously, there's no need to tell it the type of the records you're building. It works it out from the field names. However, this does mean that you can't define two different record types with the same field name in the same module. Also, if you want to access a field from a record defined in a different module, you need to qualify the field name, e.g.
1
|
|
There are also "polymorphic variants" which allow the same field name to be used in different structures, but I haven't tried using them. The manual notes that the compiler can do a better job of finding errors with plain variants, however.
A big win over Python is the ability to pattern-match on field names. e.g. to extract the cache
and config
fields:
1 2 3 4 |
|
Update: if you compile with warnings on (and you should), it will complain that you're ignoring the data
field. This is a really useful check because if you add a new field later it will tell you all the places that might need updating. You can use data = _
to ignore that field explicitly, or just _
to ignore all remaining fields.
Haskell
Haskell doesn't have classes (in the Python / OCaml sense), but it does provide named tuples:
1 2 3 4 5 6 7 8 9 10 11 |
|
Note that the syntax for accessing a field is field record
, not record.field
as in other languages.
Like OCaml, you can't use the same field name in different structures. Pattern matching works:
1 2 3 4 |
|
However, it doesn't have OCaml's short-cut for the common case where the field you
want to match has the same name as the variable you want to store it in (to use Python terms). In fact,
doing that is strongly discouraged, because if you matched with config = config
, then that would
shadow the config
function used to access the record.
Variants
Sometimes, a value can be one of several sub-types. Bindings in 0install are a good example of this:
-
There are two types of binding: an
EnvironmentBinding
sets an environment variable to tell a program where some resource is, while anExecutableBinding
gives the program an executable launcher for another program. -
An
EnvironmentBinding
can be used to find a path within the selected component, or to provide a constant value. -
An
EnvironmentBinding
can affect its variable in three different ways: it can append the new value to the end of the old value, prepend it at the start, or replace the old value completely. -
An
ExecutableBinding
can store the launcher's location in a variable or add the launcher's directory to the application's$PATH
.
Here's an example of an environment binding which prepends a package's lib
directory to CLASSPATH
:
1
|
|
The code that parses the XML and generates a list of bindings needs to store different values depending on which kind it is. For example, "append" and "prepend" bindings let you specify optional separator and default values, while "replace" ones don't.
Then, the code that applies the bindings needs to handle each of the separate cases. We'd like to make sure that we didn't forget to handle any case, and that we don't try to access a field that's only defined for a different case.
Python
The traditional object-oriented way to handle this is with subclasses (e.g.
ExtendEnvironment
with default
and separator
fields, and
ReplaceEnvironment
without; InsertPath
and Value
, etc). However, that's a lot of classes to
write, so 0install actually does everything in just two classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Note the use of strings (PREPEND
, APPEND
, REPLACE
) in place of a proper enum, as Python
doesn't have them.
OCaml
Here's a definition in OCaml. Variants use |
to separate the various possibilities:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
That's far more useful than the Python. It accurately describes the possible combinations, and is clear
about the types and which bits are optional. Using varname
and filepath
as aliases for string
doesn't
add any type safety, but it does make the signatures easier to read and gives better error messages.
Note that the extra |
on the first line after type binding
isn't strictly necessary, but it helps to
line things up.
Haskell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
This is essentially the same as the OCaml, except that I used tuples rather than records in Binding
,
because handling records is more awkward in Haskell due to the pattern matching problems noted above.
Using tuples (which I could have done in OCaml too) makes the definitions shorter, because the
definition of a tuple can be done in-line instead of with a separate structure.
deriving Show
causes Haskell to automatically generate code to convert these types to strings,
which is handy for debugging (and sadly missing from OCaml).
Using the data structures
To apply the bindings, a runner module needs to collect all the bindings and then:
- Process each
EnvironmentBinding
, updating the environment. - Process each
ExecutableBinding
, using the new environment to create the launchers.
Here, we'll look at the code to process an EnvironmentBinding
.
Python
Here's the Python code for applying an <environment> binding:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
The code for getting the value to append to is a bit messy. We're trying to say:
- Use the current value of the environment variable. Or, if not set:
- Use the
default
from the <environment> element. Or, if not set: - Use the built-in default value for this variable.
Python often makes this easy with its a or b or c
syntax, but had to use the longer if
syntax for
this case because or
treats the both None
and the empty string as false,
whereas we want to treat an empty string as a valid value.
1 2 3 4 |
|
Actually, I noticed while writing this post that I got that wrong in the Python (I used the shorter or
in one
place) and had to fix it; a common mistake in Python.
The two binding types must be handled differently. In the run code, we (rather messily) check the type to decide what to do:
1 2 3 4 |
|
This isn't a very object-oriented way to do it. But it made more sense to put the logic for handling the bindings all together in the run module rather than in the module which defines the data types (which I prefer to keep side-effect free). Also, the rule that we need to process all the EnvironmentBindings before all of the ExecutableBindings can't easily go in the classes themselves.
So, the existing Python code is really pretty poor. We're using strings to simulate enums (simple variants),
a single class with a load of if
statements in place of variants for the different ways of setting an
environment variable, and messy isinstance
checks to let us keep the logic for applying bindings together
in the run
module. If we add or change binding classes, there's several places we need to check, and no
static checking to help us. Let's see if the other languages can help us do better...
OCaml
Here's the code to apply an EnvironmentBinding
in OCaml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
It's not bad, and it's nice to see all the different cases laid out.
By the way let do_env_binding env impls = function ...
is a convenient way to pattern match on the
last (unnamed) argument. It means the same as let do_env_binding env impls binding = match binding with ...
.
I initially thought that the Python version was easier to read. On reflection, however, I think it's more subtle. It's easier to see what the Python version does, but it's not easy to see that what it does is correct. By contrast, it's easy to see that the OCaml version handles every case (the compiler checks this), and you can just check that each individual case is handled correctly.
As in the Python, getting old_value
is a bit messy as there's no null coalescing operator in OCaml:
1 2 3 4 5 6 7 |
|
Haskell
And here is the code to apply an EnvironmentBinding
in Haskell:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
That's a good bit shorter than both the Python and the OCaml, because I was able to use
a `mplus` b
handle the defaults easily. That's "monadic plus", in case you
were wondering whether mplus is a bit of a silly name.
Handling XML
We need to parse the XML into some kind of internal representation. Originally, 0install parsed the XML into custom classes, but it turns out that we often want to write XML back out again, preserving attributes and elements we didn't understand. So we've been slowly moving towards using generic XML trees as the in-memory representation, which saves having to write code to serialise our data structures as XML (which then requires ensuring that they're consistent with the parsing code).
Python
Python's standard library includes a selection of XML parsers: minidom, pulldom, ElementTree, expat and SAX. Disappointingly, the documentation says that none of them is safe with untrusted data. 0install uses the low-level "expat" parser, which isn't in the vulnerabilities table, so hopefully we're OK (many of the vulnerabilities are denial of service attacks, which isn't a big problem for us).
We build our own "qdom" tree structure rather than use the more standard
minidom module because import xml.dom.minidom
takes too long to be used in this
speed-critical code (or at least, it did when I wrote the qdom code, I haven't tested it recently).
One of the nice things about writing in the other languages is not having to worry about speed all
the time.
OCaml
OCaml doesn't include an XML parser in the standard libraries, so I Googled for one. The first one I found was Xml-Light, but it turned out that it didn't support XML namespaces. Then I tried the PXP parser, which is enormous, and parsed the test document I gave it incorrectly (but they have fixed that now - thanks!). Finally, I tried Xmlm, which is small and works.
Xmlm doesn't generate a tree structure, just events (like SAX), so you have to build your own structure. That could be a problem in some cases (it's convenient to have a standard data structure in case you want to pass documents between modules). On the other hand, we already use our own document structure ("qdom") in Python and the standard "DOM" interface is awkward anyway.
Xmlm suggests the following structure:
1
|
|
However, this is quite annoying to process, because every time someone gives you a tree
you have
to pattern match and handle the case of them giving you a D
(a text node). Instead, I used
ElementTree's trick of attaching text to element nodes:
1 2 3 4 5 6 7 |
|
With this, it's easy to iterate over all the elements and ignore text, and you don't have to worry about getting two text nodes next to each other.
You generally don't need text_before
unless you're using mixed content, but having it here means we don't
lose any data if we read a document and then write it out again.
It was convenient to have the structure be mutable while building it, and in other code we may want to manipulate nodes, so I marked most of the fields as mutable.
I made a serious mistake in my first attempt at pattern matching on elements. I wanted to match the elements
<arg>
or <for-each>
in the 0install namespace:
1 2 3 4 5 6 |
|
This is the worst kind of mistake: the kind that seems to work fine when you test it. I made the same mistake in Haskell, but
luckily I decided to enable warnings (ghc -Wall
) and spotted the problem. The code above doesn't check that the namespace is
equal to xmlns_feed
. Instead, it creates a new xmlns_feed
binding with whatever the namespace actually was. Obvious in
hindsight (note: turning on warnings in OCaml also catches the mistake, because it complains that xmlns_feed
is unused).
So, how can we fix this? Dealing with namespaces all the time when processing XML is annoying even when you get it right, so I decided to make a helper interface for doing queries in a particular namespace. Of course, I'd like this to generalise to other namespaces. I found an OCaml feature called "functors", which seem to be basically module templates. Here's how I made a namespace-specific query interface:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Then I create a specialised version of this module for our namespace in constants.ml
:
1 2 3 4 5 |
|
With this, ZI.map
applies a function to all elements in the 0install namespace and with the given name. ZI.tag
returns the
local name for 0install elements or Nothing
for others, etc. Now the original code becomes:
1 2 3 4 |
|
Much better!
Haskell
Again, there's a choice of library. I went for Text.XML.Light
which seems to work well.
Dealing with namespaces is fairly painful; I'm not convinced that I'm using it right. Here's how I find the <runner> element inside a <command>, for example:
1 2 3 |
|
Using pattern matching doesn't seem to improve things:
1 2 3 4 |
|
I didn't find any way to create a namespace-specific interface as in OCaml (Haskell "functors" do something different).
Testing it on malformed XML, it sometimes just returns Nothing, which is rather unhelpful, and sometimes it successfully parses it anyway! For reference, here is a test document that Haskell loads happily:
1 2 |
|
Python and OCaml, by contrast, both detect the problem and report the location of the error.
Building lists
We need to walk the XML tree and return all the bindings, in document order. The places where a binding may be found are:
<selection> / [binding]
<selection> / [dependency] / [binding]
<selection> / <command> / [binding]
<selection> / <command> / [dependency] / [binding]
For bindings in a dependency, we need to know the dependency's interface. For other bindings, we want the selection's own interface.
Python
Python's easy syntax for sets provides one way to do it, and its append
method on lists makes it easy to collect the results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
I liked the ZI namespace query thing I made in OCaml so much, I made a Python class to do the same thing.
OCaml
I originally wrote this in a purely functional style, threading the results list through all the functions using fold_left
.
That was pretty messy. Then I decided to take advantage of OCaml's support for imperative programming and just mutate a single
list (bindings
), as in the Python, which worked much better.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Note the use of named arguments (process ~deps:true
) to avoid confusion between the two boolean arguments.
This works like process(deps = true)
in Python.
A strange aspect of OCaml is that you generally write loops backwards, first declaring the code to
handle each item and then calling an iter
function to say what you want to loop over. OCaml does have
a for
loop, but it's one of those old-fashioned BBC BASIC style ones that just updates a counter in
that particular way that programmers never actually want to do.
Note also that in OCaml you add to the start of a list, not the end, so we need to reverse it as the last step.
Update: You can use the pipe operator (|>
) to make loops easier to write. It lets you write the input to a function first:
1 2 3 |
|
Haskell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Nice and short again, but again it requires some explanation. The do
expressions here (unlike the previous one)
have the list type, so they do the lines of the block in a way appropriate for lists. For lists, x <- items
means
to run the rest of the do block once for each item in the list items
, producing a list for each one, and then join
the resulting lists together end-to-end. So these do blocks are actually nested loops.
There are no named arguments, so we just have to remember what the True
and False
mean here.
Also, the |
in the matches doesn't separate alternatives (as in OCaml), but instead gives a condition which
must also be true for it to match, like when
in OCaml.
String processing
We need to expand environment variables in arguments. In the XML, it looks like this:
1
|
|
Python
1 2 |
|
I guess that's cheating, since I picked Python's preferred syntax when I designed the XML format. Here's how to do it without using Template:
1 2 3 4 5 6 7 8 9 10 |
|
Easy. Notice the handy r"..."
syntax for raw strings, which avoids the need to escape backslashes everywhere.
Oh, and I seem to have written an OCaml-style backwards loop here. Guess they're not unique to OCaml after all.
OCaml
The regex is looking a bit ugly, but still pretty good. Oddly, instead of getting some kind of MatchResult
object
passed to expand
, we just get the original string and then pass that to the global match
function. I guess that's
just how the underlying library works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Haskell
This was hard, but stackoverflow to the rescue!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Finding resources
When shipping stand-alone binaries or 0install packages, it's useful if the program can find other resources in its own code directory. In our case, we need to find the launcher program so we can symlink to it.
Python
In Python, this is really easy. Every module has an attribute __file__
which has the location of that module:
1
|
|
OCaml
When called as a binary, we can get the path from argv[0] and use that. If we're being used as a library then
we're either installed as a system package (and can look in some hard-coded path) or being run as a 0install
dependency (and can get 0install to set an environment variable for us). Easy enough, but abspath
(and
realpath
) are missing from the standard library.
1 2 3 4 5 6 |
|
Haskell
Wow. I have no idea how this works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
API docs
All three languages make it easy to generate cross-referenced API docs. I didn't bother to fill in many strings for the OCaml and Haskell, but the linked sample pages show what it looks like:
Python and Haskell require you to document the types in the code if you want them to appear in the docs. The Python syntax for this is a bit awkward. OCaml infers the types automatically.
Speed
My test case is running 0release (which has the largest selections.xml of all
my apps), except that I edited the selections so it runs /bin/echo at the end
rather than /usr/bin/python2. Otherwise we're just measuring the speed of Python.
Running echo
directly with a trivial (hard-coded) C program takes 2-3 ms, so that's the
theoretical best time.
Language | Time / ms | Lines of code | Non-blank chars |
---|---|---|---|
(baseline) | 2 | - | - |
OCaml | 7 | 678 | 17,201 |
Haskell | 14 | 543 | 16,818 |
Python (test) | 45 / 76 | 576 | 14,685 |
Python (real) | 64 / 109 | 21,427 | 609,082 |
Enabling stack traces in OCaml (OCAMLRUNPARAM=b
) increased the time from 7ms to 8ms.
"Python (test)" is a version I wrote for this comparison to get a feel for the number of lines. It doesn't do as much checking as the real version and lacks some features. "Python (real)" is the regular "0launch" command. The first number is the time with Python 2, the second with Python 3. It's interesting how much faster the test version is!
By the way, if you're thinking "parsing XML just to set up environment variables is inefficient; I'll just
use a shell script to start my program", a /bin/sh
script which just execs "echo" takes 10 ms.
Conclusions
All three languages are easy to read, avoiding verbose boilerplate code, and ending up fairly similar in length. No doubt an expert in OCaml or Haskell would write much shorter code than I did. OCaml's syntax is simple, but the indentation doesn't necessarily match up with OCaml's interpretation of the code, which is a concern. In Python and Haskell, the compiler always sees the same structure as the programmer.
Haskell provides all kinds of special notations for writing shorter code.
This means that reading other people's Haskell code can be very difficult.
Also, the same structure (e.g. do
) can mean
wildly different things in different contexts. OCaml syntax is easy to learn and easy to understand.
Many things that are simple in Python or OCaml become very complex in Haskell. In this example: reading an environment
variable that may be unset, doing any kind of IO, reporting errors in XML documents, search-and-replace in strings and
reading argv
. And although not part of this test, I think the converting 0install's solver to Haskell would be very difficult.
Haskell's ability to control where IO happens is useful, but other languages (e.g. E and (hopefully) OCaml with Emily) achieve the same thing without all the complexity.
Python's data structures are very limited and error prone. OCaml and Haskell record and variant types significantly improve the code. OCaml makes working with record types easier, while Haskell simplifies debugging by making it easy to convert structures to strings.
XML handling was easiest in OCaml, though I did have to make my own tree structure. Python's XML libraries aren't safe and Haskell's doesn't report (or even detect, in some cases) errors.
Walking the XML tree and building a list was easy in all three languages. In Python and OCaml, because they let us mutate a single list while walking the tree, and in Haskell thanks to its List monad which handles everything behind the scenes.
A major benefit of OCaml and Haskell is the ease of refactoring. In Python, once the code is working it's best not to change things, in case you break something that isn't covered by the unit-tests. In OCaml and Haskell, you can rename a function, delete old code or change a data structure and rely on the compiler to check that everything's still OK. The static typing in OCaml and Haskell also gives me more confidence that that various untested error reporting code paths will actually work. Untested Python code is buggy code, generally.
I'm not sure how stable the OCaml library APIs are (I assume they don't change much), but it's clear that Haskell's APIs change frequently: code samples and documentation I found on the 'net were frequently out of date and wouldn't compile without changes. Python is generally pretty good, if you overlook the massive changes in Python 3.
In the previous (trivial) test, OCaml and Haskell had very similar performance, but here OCaml clearly moves into the lead. Python is far behind, and only getting slower with the move to Python 3. Being able to write code without worrying about speed all the time is very liberating!
The two statically typed languages didn't require much debugging, except that OCaml's find
function throws an exception if the
key isn't found rather than using an option
type (as in Haskell), which can lead to non-obvious errors. Turning on stack-traces
in OCaml makes it easy to track these down however, and making my own wrappers for find
and similar should mean it won't be a problem
in general.
The big surprise for me in these tests was how little you lose going from Python to OCaml. You still have classes, objects, functions, mutable data and low-level access to the OS, all with an easy and concise syntax, but you gain type checking, much better data structures and a huge amount of speed for no additional effort. Why aren't more people using it?
Although Haskell and OCaml look more similar to each other than to Python, this is just syntax. In fact, OCaml and Python are conceptually pretty similar, while Haskell is different in almost every way.
Overall, binary compatibility (i.e. difficulty of making cross-platform releases of the library, which is currently easy with Python) is my main remaining concern. But while further testing is required (particularly for networking, threading and GUI support), so far I'd be happy to move to OCaml.