We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
When you first join an already existing project as a developer, you will often find yourself asking “what this component is doing?”, “how does this algorithm work?” or “what is the point of this function/class?”. Before you start digging into the complicated nest of functions and function calls, allow us to introduce an easier way to get a glimpse of what your program is doing and where it is speeding time: speedscope.app.
Speedscope.app is a JavaScript application that runs in your browser and can visualize the behavior of your program. It can also be installed locally. It supports visualization of many programming languages: JavaScript, Ruby, Python, Go, Rust, C/C++, and .NET languages. Here we will give an overview of its features, for information on how to collect data for the visualization, we refer you to the speedscope website.
The basic idea behind speedscope
The idea behind speedscope is simple: record the execution of your program using a profiler1 and then use speedscope to visualize what it was doing. Visualizations are very nice, here is an example of one:

This graph visualizes the conversion of a .mp4 file to .ts file in ffmpeg
. For those who are unfamiliar with ffmpeg
, it is a command-line tool used to convert between video formats. FFmpeg
is written in C, and we used Linux’ perf
profiler to collect the data for speedscope. The above screenshot is taken from the full example here.
The image displays call stacks taken between 242 ms and 258 ms since the program started executing. We can observe, for instance, that at time 244 ms from the program start the call stack looked like this: main() -> transcode() -> reap_filters() -> do_video_out() -> avcodec_send_frame_() -> ...
In this example the call stacks are arranged in the order they were collected. Depending on how we arrange the call stacks, we can get different information. So what kind of information are available?
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Time Order: What my program is doing in time
In the upper left corner there are three buttons, one of them is Time Order. This options arranges call stacks in time, as they were recorded. The main benefit of this option is that you can see what your program was doing as it was running. Which functions were running at the beginning when the program was started, which functions were running in the middle when the program was doing useful work and which functions were running before the program terminated.

While navigating through the chart, if you zoom out completely to have a look at the full program execution, these charts look messy (as in the image above). The devil is in the details here, you need to explore, zoom in, scroll left and right to find the repeating patterns. It requires a bit of manual investigation work, but it is much easier to do than looking at the code or at the textual profiler output.
In the example of ffmpeg
, the program spends most of its running time in the transcode
function, but most of the work is not done there, but in other functions that transcode
calls. If we zoom in, do a bit of left and right scrolling we can see the pattern:

In this image, where we zoomed in time between 630 ms and 670 ms, we see that the program is calling the same two functions in loop: reap_filters
and proces_input_packets
. By the name, we can guess that one function produces the data for another function.
Click here to visit the example and play around with it. IMPORTANT NOTICE: Initially, when you open the example, you will see a graph where all call stacks are marked with ??? ([unknown])
. This is because you are viewing the profile of perf
thread, the profiler thread that was running to record ffmpeg. To display the example you need to select the ffmpeg thread. Move your mouse to the upper-middle part of speedscope, a list will open where you can select ffmpeg_g
thread.
Left Heavy: Where is my program spending time
The Left Heavy view sorts the call stack by the frequency functions appear in it. This view allows you to see which functions are taking the most time. The functionality is very similar to the flame graphs we already talked about, but I think speedscope is easier to use and has more features.
In our ffmpeg
example, the Left Heavy view looks like this:

The call stacks are now sorted by frequency, not the time. By looking at this graph, we see that function transcode
is doing the bulk of the work. To do it, it calls two functions reap_filters
and process_input_packet
. If we wanted to speed up the conversion, the most obvious candidates would be functions called by encode_thread
, estimate_motion_thread
or decode_slice
.
Sandwich: Execution Statistics
The Sandwich view is similar to the Left Heavy view, in that it will tell you which functions are taking the most time, but it uses a table instead of a chart. Here is the screenshot from the Sandwich mode for our ffmpeg
example.

The table lists all functions with their execution times. Note that it consists of two columns Total and Self. The Total column tells you how much time your program spent in the function and its children. The Self column tells you how much time your program spent in the function, excluding the time spent in the children. In the above image, we can see that diff_pixels_c
spent 140.88 ms in itself. Since the total number is also 140.88 ms, the function didn’t call other functions. On the other hand, the function encode_thread
executed totally for 475.08 ms, but spent only 47.13 ms doing its own work; the rest of the time was spent in its children.
If you are focusing on performance improvement, you want to dig into functions that have a high Self value first.
Practical Considerations About Speedscope
Multithreaded programs
Profiling and understanding multithreaded programs is more difficult than single-threaded programs2. By just looking at the graph that speedscope made, we can see which functions are taking the most time. But for multithreaded programs that is not enough. What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.
Profiler and data collection
We used the perf profile in Linux to collect the data for ffmpeg
and display it in Speedscope. The process is fairly simple and it is documented in the Speedscope’s manual. But, a few words need to be seed about data collection because the quality of data collection determines the quality of the graph.
First we collected the profile information using following commands:
$ perf record --call-graph lbr -F 4000 ./program_to_execute arg1 arg2
$ perf script -i perf.data | speedscope -
Notice the command line and two switches --call-graph and -F
. The command-line switch --call-graph
tells perf
how to collect call stacks during the program execution. There are basically three ways:
fp
– FP is short for frame pointer. This is a low overhead way of collecting profile samples.perf
uses frame pointer register to create the function call stack, but for this to work, your program needs to be compiled with-fno-omit-frame-pointer
3. This compiler switch is not enabled by default, but if you are compiling for x86-64, I highly encourage you to enable it in all your builds. Inlined functions will not be displayed in the profile. There is no limit in the size of the call stack.dwarf
– This type of sample collection is much more computationally intensive. The profiler collects the stack samples (each stack sample is 8 kB large). This can lead to a huge profiler output file that takes a lot of time to analyze. Also, because only 8 kB of the stack is collected for each sample, if the function stack is larger than this number, parts of the call stack will be missing4. Yet, the information collected this way is the most precise since it comes from debugging information present in the file. To use it, the program should be compiled with debug symbols (-g
option in GCC anc CLANG).lbr
– LBR is short for last branch records. Last branch records is a special set of registers thatperf
can use to reconstruct the function’s call stack without debugging info and without frame pointer. It is fast, but it only works on recent Intel’s processors. It has a limit to a stack depth, but this didn’t create problems in our testing.
If your program is a short-running one, I recommend to use dwarf
since it gives the most information. If it is a long-running, you can use fp
or lbr
. If the information collected doesn’t seem good, you can revert to dwarf
with a lower sampling rate.
The second important option passed to perf
is -F
option, which is the sampling rate. The sampling rate is the number of samples taken per second. The default option is 4000, but sometimes you will need to change this. For short-running programs, you will want to increase this number to get more precise information. For long-running programs, you will want to decrease the number since the profile database can grow huge!
Line numbers
If you have large functions, you will want to see the amount of time your program is spending on individual lines, not only on functions. You can do this if you collected your runtime stacks using dwarf
option.
To achieve this, you record the program runtime as already explained. Next, instead of calling perf script
, you will call like this:
$ perf script -F +srcline > out.txt
$ perf-addlines.pl out.txt > out.perf
$ speedscope out.perf
The commend perf script -F +srcline
add line numbers to the output file containing stack traces. Unfortunately, speedscope
cannot consume it in raw format, instead, you need to convert it to a format speedscope
actually understands. For this you use perf-addlines.pl
script, which you can download here.
Final Words
My impression of speedscope.app is great and I use it all the time! If you are looking at unfamiliar code base, it will help you quickly visualize and understand it. If you want to see what is slowing down your program, the application offers that as well. It is very light and simple to use.
Of course, the quality of information will depend on the profiler. It is important to configure the profiler properly. Bad profiler data is more difficult to spot than too much data. If the profiler data looks illogical, or something is missing or not in a place where you would expect it to be, I would recommend increasing the sampling frequency or try another of perf’s call stack collection methods.
Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.
- Profilers are tools that run in the background with your program, and they collect function call stacks many times per second. The developers can use this information later to better understand what the program is doing and find possible problems related to performance. There are many profilers available, some [↩]
- In order to better illustrate the speedscope’s capabilities, we compiled
ffmpeg
without multithreading support [↩] - This is the name of this switch on CLANG and GCC, but other compilers have a switch with a similar semantics. [↩]
- Sometimes the stack is large because of the stack-allocated data, rewriting those parts to use heap-allocated data helped us collect the good profile using this method. [↩]
> What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.
Hi, thanks for the post. If you like in-browser perf profile visualisers and you like speedscope, you should check out Firefox Profiler too. It runs in-browser (in any browser, not just Firefox) and you can import “perf script” format, and it has a great multi-track view to see multiple threads at a time. Enjoy!
Speedscope was good for me because you can fire it up simply from the command line, but Firefox Profiler is great too! One thing I do miss, however, is the profiler which prints statistics on the line level, not only on the function level. Perf command line can do that too, but the output is awful 🙁
On Linux, there is also
hotspot
. There is no good profiler for Windows though 🙁I think Speedscope is a kind of tool likes [chrome://tracing/]
Hi Johnny,
I really enjoyed reading your blog post about Speedscope. I’ve been using it for a few weeks now and it’s been a great help in identifying performance bottlenecks in my code.
I have one question: is there a way to export the data from Speedscope so that I can import it into another tool for further analysis?
The data you get as an output from perf.script can be used elsewhere. Do you have any specific tool in mind?
i generally find tracy-profiler: https://github.com/wolfpld/tracy to be much better than others. allows me to insert minor ‘instrumentation’ in the code, and then perform, an online + offline analysis of program execution profile. very highly recommended.
I skimmed the profiler documentation. This seems to be a more serious profiler with instrumentation intended for game development primarily. You need to add some instrumentation code, which is fine. I think these two profilers target different needs.
If you are familiar with this compiler and are interested in writing an introductory post to it, let me know. We could publish it on this blog, or you can publish it elsewhere and I would just add link to it.
Hi, How to integrate this tool for a C Based IOT Project without any RTOS. Is there a guide to follow on how to integrate this. Thanks in advance,
If you are thinking about bare metal devices, the main issue would be to produce the trace stacks. Do you know how to produce the trace stacks on your system?