Linux and OCaml provide a huge range of tools for investigating performance problems. In this post I try using some of them to understand a network performance problem. In part 2, I'll investigate a problem in a CPU-intensive multicore program.
Table of Contents
- The problem
- time
- eio-trace
- strace
- bpftrace
- tcpdump
- ss
- offwaketime
- magic-trace
- Summary script
- Fixing it
- Conclusions
The problem
While porting capnp-rpc from Lwt to Eio, to take advantage of OCaml 5's new effects system, I tried running the benchmark to see if it got any faster:
$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44933.359573 # The old Lwt version
echo_bench.exe: [INFO] rate = 511.963565 # The (buggy) Eio version
The benchmark records the number of echo RPCs per second. Clearly, something is very wrong here! In fact, the new version was so slow I had to reduce the number of iterations so it would finish.
time
The old time
command can immediately give us a hint:
$ /usr/bin/time ./echo_bench.exe
1.85user 0.42system 0:02.31elapsed 98%CPU # Lwt
0.16user 0.05system 0:01.95elapsed 11%CPU # Eio (buggy)
(many shells provide their own time
built-in with different output formats; I'm using /usr/bin/time
here)
time
's output shows time spent in user-mode (running the application's code on the CPU),
time spent in the kernel, and the total wall-clock time.
Both versions ran for around 2 seconds (doing a different number of iterations),
but the Lwt version was using the CPU 98% of the time, while the Eio version was mostly sleeping.
eio-trace
eio-trace can be used to see what an Eio program is doing. Tracing is always available (you don't need to recompile the program to get it).
$ eio-trace run -- ./echo_bench.exe
eio-trace run
runs the command and displays the trace in a window.
You can also use eio-trace record
to save a trace and examine it later.
The benchmark runs 12 test clients at once, making it a bit noisy. To simplify thing, I set it to run only one client:
I've zoomed the image to show the first four iterations. The first is so quick it's not really visible, but the next three take about 40ms each. The yellow regions labelled "suspend-domain" show when the program is sleeping, waiting for an event from Linux. Each horizontal bar is a fiber (a light-weight thread). From top to bottom they are:
- Three rows for the test client:
- The main application fiber performing the RPC call (mostly awaiting responses).
- The network's write fiber, sending outgoing messages (mostly waiting for something to send).
- The network's read fiber, reading incoming messages (mostly waiting to something to read).
- Four rows for the server:
- A loop accepting new incoming TCP connections.
- A short-lived fiber that accepts the new connection, then short-lived fibers each handling one request.
- The server's network write fiber.
- The server's network read fiber.
- One fiber owned by Eio itself (used to wake up the event loop in some situations).
This trace immediately raises a couple of questions:
-
Why is there a 40ms delay in each iteration of the test loop?
-
Why does the program briefly wake up in the middle of the first delay, do nothing, and return to sleep? (notice the extra "suspend-domain" at the top)
Zooming in on a section between the delays, let's see what it's doing when it's not sleeping:
After a 40ms delay, the server's read fiber receives the next request (the running fiber is shown in green). The read fiber spawns a fiber to handle the request, which finishes quickly, starts the next read, and then the write fiber transmits the reply.
The client's read fiber gets the reply, the write fiber outputs a message, then the application fiber runs and another message is sent. The server reads something (presumably the first message, though it happens after the client had sent both), then there is another long 40ms delay, then (far off the right of the image) the pattern repeats.
To get more context in the trace, I configured the logging library to write the (existing) debug-level log messages to the trace buffer too:
Log messages tend to be a bit long for the trace display, so they overlap and you have to zoom right in to read them, but they do help navigate. With this, I can see that the first client write is "Send finish" and the second is "Calling Echo.ping".
Looks like we're not buffering the output, so it's doing two separate writes rather than combining them. That's a little inefficient, and if you've done much network programming, you also probably already know why this might cause a 40ms delay, but let's pretend we don't know so we can play with a few more tools...
strace
strace can be used to trace interactions between applications and the Linux kernel
(-tt -T
shows when each call was started and how long it took):
$ strace -tt -T ./echo_bench.exe
...
11:38:58.079200 write(2, "echo_bench.exe: [INFO] Accepting"..., 73) = 73 <0.000008>
11:38:58.079253 io_uring_enter(4, 4, 0, 0, NULL, 8) = 4 <0.000032>
11:38:58.079341 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000020>
11:38:58.079408 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000021>
11:38:58.079471 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000018>
11:38:58.079525 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000019>
11:38:58.079580 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000013>
11:38:58.079611 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000009>
11:38:58.079637 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffc1661a480, 24) = -1 ETIME (Timer expired) <0.018913>
11:38:58.098669 futex(0x5584542b767c, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000105>
11:38:58.098889 futex(0x5584542b7690, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000059>
11:38:58.098976 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.021355>
On Linux, Eio defaults to using the io_uring mechanism for submitting work to the kernel.
io_uring_enter(4, 2, 0, 0, NULL, 8) = 2
means we asked to submit 2 new operations to the ring on FD 4,
and the kernel accepted them.
The call at 11:38:58.079637
timed out after 19ms.
It then woke up some futexes and then waited again, getting woken up after a further 21ms (for a total of 40ms).
Futexes are used to coordinate between system threads.
strace -f
will follow all spawned threads (and processes), not just the main one:
$ strace -T -f ./echo_bench.exe
...
[pid 48451] newfstatat(AT_FDCWD, "/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=40, ...}, 0) = 0 <0.000011>
...
[pid 48451] futex(0x561def43296c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
...
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS|IORING_ENTER_EXT_ARG, 0x7ffe1d5d1c90, 24) = -1 ETIME (Timer expired) <0.018899>
[pid 48449] futex(0x561def43296c, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000106>
[pid 48451] <... futex resumed>) = 0 <0.019981>
[pid 48449] io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 <unfinished ...>
...
[pid 48451] exit(0) = ?
[pid 48451] +++ exited with 0 +++
[pid 48449] <... io_uring_enter resumed>) = 0 <0.021205>
...
The benchmark connects to "127.0.0.1"
and Eio uses getaddrinfo
to look up addresses (we can't use uring for this).
Since getaddrinfo
can block for a long time, Eio creates a new system thread (pid 48451) to handle it
(we can guess this thread is doing name resolution because we see it read resolv.conf
).
As creating system threads is a little slow, Eio keeps the thread around for a bit after it finishes in case it's needed again. The timeout is when Eio decides that the thread isn't needed any longer and asks it to exit. So this isn't relevant to our problem (and only happens on the first 40ms delay, since we don't look up any further addresses).
However, strace doesn't tell us what the uring operations were, or their return values.
One option is to switch to the posix
backend (which is the default on Unix systems).
In fact, it's a good idea with any performance problem to check if it still happens with a different backend:
$ EIO_BACKEND=posix strace -T -tt ./echo_bench.exe
...
11:53:52.935976 writev(7, [{iov_base="\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0"..., iov_len=40}], 1) = 40 <0.000170>
11:53:52.936308 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, {tv_sec=0, tv_nsec=0}, NULL, 8) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=0, tv_nsec=0}) <0.000044>
11:53:52.936500 writev(7, [{iov_base="\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0"..., iov_len=136}], 1) = 136 <0.000055>
11:53:52.936831 readv(8, [{iov_base="\0\0\0\0\4\0\0\0\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0"..., iov_len=4096}], 1) = 40 <0.000056>
11:53:52.937516 ppoll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=4, events=POLLIN}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 9, NULL, NULL, 8) = 1 ([{fd=8, revents=POLLIN}]) <0.038972>
11:53:52.977751 readv(8, [{iov_base="\0\0\0\0\20\0\0\0\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0"..., iov_len=4096}], 1) = 136 <0.000398>
(to reduce clutter, I removed calls that returned EAGAIN
and ppoll
calls that returned 0 ready descriptors)
The problem still occurs, and now we can see the two writes:
- The client writes 40 bytes to its end of the socket (FD 7), after which the server's end (FD 8) is ready for reading (
revents=POLLIN
). - The client then writes another 136 bytes.
- The server reads 40 bytes and then uses
ppoll
to await further data. - After 39ms,
ppoll
says FD 8 is now ready, and the server reads the other 136 bytes.
bpftrace
Alternatively, we can trace uring operations using bpftrace. bpftrace is a little scripting language similar to awk, except that instead of editing a stream of characters, it live-patches the running Linux kernel. Apparently this is safe to run in production (and I haven't managed to crash my kernel with it yet).
Here is a list of uring tracepoints we can probe:
$ sudo bpftrace -l 'tracepoint:io_uring:*'
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_cqe_overflow
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_defer
tracepoint:io_uring:io_uring_fail_link
tracepoint:io_uring:io_uring_file_get
tracepoint:io_uring:io_uring_link
tracepoint:io_uring:io_uring_local_work_run
tracepoint:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_req_failed
tracepoint:io_uring:io_uring_short_write
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_task_add
tracepoint:io_uring:io_uring_task_work_run
io_uring_complete
looks promising:
$ sudo bpftrace -vl tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_complete
void * ctx
void * req
u64 user_data
int res
unsigned cflags
u64 extra1
u64 extra2
Here's a script to print out the time, process, operation name and result for each completion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
$ sudo bpftrace uringtrace.bt
Attaching 3 probes...
...
1743ms: echo_bench.exe: WRITE_FIXED 40
1743ms: echo_bench.exe: READV 40
1743ms: echo_bench.exe: WRITE_FIXED 136
1783ms: echo_bench.exe: READV 136
In this output, the order is slightly different: we see the server's read get the 40 bytes before the client sends the rest, but we still see the 40ms delay between the completion of the second write and the corresponding read. The change in order is because we're seeing when the kernel knew the read was complete, not when the application found out about it.
tcpdump
An obvious step with any networking problem is the look at the packets going over the network. tcpdump can be used to capture packets, which can be displayed on the console or in a GUI with wireshark.
$ sudo tcpdump -n -ttttt -i lo
...
...041330 IP ...37640 > ...7000: Flags [P.], ..., length 40
...081975 IP ...7000 > ...37640: Flags [.], ..., length 0
...082005 IP ...37640 > ...7000: Flags [P.], ..., length 136
...082071 IP ...7000 > ...37640: Flags [.], ..., length 0
Here we see the client (on port 37640) sending 40 bytes to the server (port 7000), and the server replying with an ACK (with no payload) 40ms later. After getting the ACK, the client socket sends the remaining 136 bytes.
Here we can see that while the application made the two writes in quick succession, TCP waited before sending the second one. Searching for "delayed ack 40ms" will turn up an explanation.
ss
ss displays socket statistics.
ss -tin
shows all TCP sockets (-t
) with internals (-i
):
$ ss -tin 'sport = 7000 or dport = 7000'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.1:7000 127.0.0.1:56224
ato:40 lastsnd:34 lastrcv:34 lastack:34
ESTAB 0 176 127.0.0.1:56224 127.0.0.1:7000
ato:40 lastsnd:34 lastrcv:34 lastack:34 unacked:1 notsent:136
There's a lot of output here; I've removed the irrelevant bits.
ato:40
says there's a 40ms timeout for "delay ack mode".
lastsnd
, etc, say that nothing had happened for 34ms when this information was collected.
unacked
and notsent
aren't documented in the man-page,
but I guess it means that the client (now port 56224) is waiting for 1 packet to be ack'd and has 136 bytes waiting until then.
The client socket still has both messages (176 bytes total) in its queue; it can't forget about the first message until the server confirms receiving it, since the client might need to send it again if it got lost.
This doesn't quite lead us to the solution, though.
offwaketime
offwaketime records why a program stopped using the CPU, and what caused it to resume:
$ sudo offwaketime-bpfcc -f -p (pgrep echo_bench.exe) > wakes
$ flamegraph.pl --colors=chain wakes > wakes.svg
offwaketime
records a stack-trace when a process is suspended (shown at the bottom and going up)
and pairs it with the stack-trace of the thread that caused it to be resumed (shown above it and going down).
The taller column on the right shows Eio being woken up due to TCP data being received from the network, confirming that it was the TCP ACK that got things going again.
The shorter column on the left was unexpected, and the [UNKNOWN]
in the stack is annoying
(probably C code compiled without frame pointers).
gdb
gets a better stack trace.
It turned out to be OCaml's tick thread, which wakes every 50ms to prevent one sys-thread from hogging the CPU:
$ strace -T -e pselect6 -p (pgrep echo_bench.exe) -f
strace: Process 20162 attached with 2 threads
...
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) <0.050441>
[pid 20173] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout) <0.050318>
Having multiple threads shown on the same diagram is a bit confusing.
I should probably have used -t
to focus only on the main one.
Also, note that when using profiling tools that record the OCaml stack, it's useful to compile with frame pointers enabled. To install e.g. OCaml 5.2.0 with frame pointers enabled, use:
1
|
|
magic-trace
magic-trace allows capturing a short trace of everything the CPUs were doing just before some event. It uses Intel Processor Trace to have the CPU record all control flow changes (calls, branches, etc) to a ring-buffer, with fairly low overhead (2% to 10%, due to extra memory bandwidth needed). When something interesting happens, we save the buffer and use it to reconstruct the recent history.
Normally we'd need to set up a trigger to grab the buffer at the right moment, but since this program is mostly idle it doesn't record much and I just attached at a random point and immediately pressed Ctrl-C to grab a snapshot and detach:
1 2 3 4 |
|
As before, we see 40ms periods of waiting, with bursts of activity between them:
The output is a bit messed up because magic-trace doesn't understand that there are multiple OCaml fibers here, each with their own stack. It also doesn't seem to know that exceptions unwind the stack.
In each 40ms column, Eio_posix.Flow.single_read
(3rd line from top) tried to do a read
with readv
, which got EAGAIN
and called Sched.next
to switch to the next fiber.
Since there was nothing left to run, the Eio scheduler called ppoll
.
Linux didn't have anything ready for this process,
and called the schedule
kernel function to switch to another process.
I recorded an eio-trace at the same time, to see the bigger picture. Here's the eio-trace zoomed in to show the two client writes (just before the 40ms wait), with the relevant bits of the magic-trace stack pasted below them:
We can see the OCaml code calling writev
, entering the kernel, tcp_write_xmit
being called to handle it,
writing the IP packet to the network and then, because this is the loopback interface, the network receive logic
handling the packet too.
The second call is much shorter; tcp_write_xmit
returns quickly without sending anything.
Note: I used the eio_posix
backend here so it's easier to correlate the kernel operations to the application calls
(uring queues them up and runs them later).
The uring-trace project should make this easier in future, but doesn't integrate with eio-trace yet.
Zooming in further, it's easy to see the difference between the two calls to tcp_write_xmit
:
Looking at the source for tcp_write_xmit
,
we finally find the magic word "nagle"!
1 2 3 4 |
|
Summary script
Having identified a load of interesting events
I wrote summary-posix.bt, a bpftrace script to summarise them.
This includes log messages written by the application (by tracing write
calls to stderr),
reads and writes on the sockets,
and various probed kernel functions seen in the magic-trace output and when reading the kernel source.
The output is specialised to this application (for example, TCP segments sent to port 7000 are displayed as "to server", while others are "to client"). I think this is a useful way to double-check my understanding, and any fix:
$ sudo bpftrace summary-posix.bt
[...]
844ms: server: got ping request; sending reply
844ms: server reads from socket (EAGAIN)
844ms: server: writev(96 bytes)
844ms: tcp_write_xmit (to client, nagle-on, packets_out=0)
844ms: tcp_v4_send_check: sending 96 bytes to client
844ms: tcp_v4_rcv: got 96 bytes
844ms: timer_start (tcp_delack_timer, 40 ms)
844ms: client reads 96 bytes from socket
844ms: client: enqueue finish message
844ms: client: enqueue ping call
844ms: client reads from socket (EAGAIN)
844ms: client: writev(40 bytes)
844ms: tcp_write_xmit (to server, nagle-on, packets_out=0)
844ms: tcp_v4_send_check: sending 40 bytes to server
845ms: tcp_v4_rcv: got 40 bytes
845ms: timer_start (tcp_delack_timer, 40 ms)
845ms: client: writev(136 bytes)
845ms: tcp_write_xmit (to server, nagle-on, packets_out=1)
845ms: server reads 40 bytes from socket
845ms: server reads from socket (EAGAIN)
885ms: tcp_delack_timer_handler (ACK to client)
885ms: tcp_v4_send_check: sending 0 bytes to client
885ms: tcp_delack_timer_handler (ACK to server)
885ms: tcp_v4_rcv: got 0 bytes
885ms: tcp_write_xmit (to server, nagle-on, packets_out=0)
885ms: tcp_v4_send_check: sending 136 bytes to server
- The server replies to a ping request, sending a 96 byte reply.
Nagle is on, but nothing is awaiting an ACK (
packets_out=0
) so it gets sent immediately. - The client receives the data. It starts a 40ms timer to send an ACK for it.
- The client enqueues a "finish" message, followed by another "ping" request.
- The client's write fiber sends the 40 byte "finish" message.
Nothing is awaiting an ACK (
packets_out=0
) so the kernel sends it immediately. - The client sends the 136 byte ping request. As the last message hasn't been ACK'd, it isn't sent yet.
- The server receives the 40 byte finish message.
- 40ms pass. The server's delayed ACK timer fires and it sends the ACK to the client.
- The client's delayed ACK timer fires, but there's nothing to do (it sent the ACK with the "finish").
- The client socket gets the ACK for its "finish" message and sends the delayed ping request.
Fixing it
The problem seemed clear: while porting from Lwt to Eio I'd lost the output buffering. So I looked at the Lwt code to see how it did it and... it doesn't! So how was it working?
As I did with Eio, I set the Lwt benchmark's concurrency to 1 to simplify it for tracing, and discovered that Lwt with 1 client thread has exactly the same problem as the Eio version. Well, that's embarrassing! But why is Lwt fast with 12 client threads?
With only minor changes (e.g. write
vs writev
), the summary script above also worked for tracing the Lwt version.
With 1 or 2 client threads, Lwt is slow, but with 3 it's fairly fast.
The delay only happens if the client sends a "finish" message when the server has no replies queued up
(otherwise the finish message unblocks the replies, which carry the ACK to the client immediately).
So, it works mostly by fluke!
Lwt just happens to schedule the threads in such a way that Nagle's algorithm mostly doesn't trigger with 12 concurrent requests.
Anyway, adding buffering to the Eio version fixed the problem:
An interesting thing to notice here is that not only did the long delay go away,
but the CPU operations while it was active were faster too!
I think the reason is that the CPU goes into power-saving mode during the long delays.
cpupower monitor
shows my CPUs running at around 1 GHz with the old code and
around 4.7 GHz when running the new version.
Here are the results for the fixed version:
$ ./echo_bench.exe
echo_bench.exe: [INFO] rate = 44425.962625 # The old Lwt version
echo_bench.exe: [INFO] rate = 59653.451934 # The fixed Eio version
60k RPC requests per second doesn't seem that impressive, but at least it's faster than the old version, which is good enough for now! There's clearly scope for improvement here (for example, the buffering I added is quite inefficient, making two extra copies of every message, as the framing library copies it from a cstruct to a string, and then I have to copy the string back to a cstruct for the kernel).
Conclusions
There are lots of great tools available to help understand why something is running slowly (or misbehaving), and since programmers usually don't have much time for profiling, a little investigation will often turn up something interesting! Even when things are working correctly, these tools are a good way to learn more about how things work.
time
will quickly tell you if the program is taking lots of time in application code, in the kernel, or just sleeping.
If the problem is sleeping, offcputime
and offwaketime
can tell you why it was waiting and what woke it in the end.
My own eio-trace
tool will give a quick visual overview of what an Eio application is doing.
strace
is great for tracing interactions between applications and the kernel,
but it doesn't help much when the application is using uring.
To fix that, you can either switch to the eio_posix
backend or use bpftrace
with the uring tracepoints.
tcpdump
, wireshark
and ss
are all useful to examine network problems specifically.
I've found bpftrace
to be really useful for all kinds of tasks.
Being able to write quick one-liners or short scripts gives it great flexibility.
Since the scripts run in the kernel you can also filter and aggregate data efficiently
without having to pass it all to userspace, and you can examine any kernel data structures.
We didn't need that here because the program was running so slowly, but it's great for many problems.
In addition to using well-defined tracepoints,
it can also probe any (non-inlined) function in the kernel or the application.
I also think using it to create a "summary script" to confirm a problem and its solution seems useful,
though this is the first time I've tried doing that.
magic-trace
is great for getting really detailed function-by-function tracing through the application and kernel.
Its ability to report the last few ms of activity after you notice a problem is extremely useful
(though not needed in this example).
It would be really useful if you could trigger magic-trace from a bpftrace script, but I didn't see a way to do that.
However, it was surprisingly difficult to get any of the tools to point directly at the combination of Nagle's algorithm with delayed ACKs as the cause of this common problem!
This post was mainly focused on what was happening in the kernel. In part 2, I'll investigate a CPU-intensive problem instead.