We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
For a performance engineer, just measuring function runtime is not enough. Time is the most useful statistic to understand which function is the bottleneck, but it is mostly useless when it comes to understanding why is the function slow.
That is where hardware performance counters come into play. Hardware performance counters are special counters in the CPU that measure all sorts of things, e.g. amount of data cache misses, instruction cache misses, number of instructions, number of cycles, branch mispredictions, etc. Using them effectively is the key to understanding why a segment of code is slow.
Unfortunately, hardware performance counters are not easy to use. Reading raw counters doesn’t help much unless paired with other counters. Also, not all CPUs support all counters. Luckily, this is where LIKWID comes into play.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Welcome to LIKWID
LIKWID is an open-source performance monitoring and benchmarking suite for Linux that abstracts some of the differences between different manufacturers. LIKWID consists of many tools, but in this post we focus on likwid-perfctr
, a tool used to read hardware performance counters. It supports many CPU types: Intel, AMD, ARM and IBM. The full list of supported CPUs is available here.
LIKWID is easy to install from the Linux repositories. If your CPU happens to be unsupported, send a support request on Likwid GitHub like I did here. In my case it took maintainers one day to add support for my CPU.
After the installation, a tool called likwid-perfctr
should be available on your system. This is the tool we will use to get easily human readable information from the hardware timers.
If you are interested for the hardware performance counters from the whole program, the process is simple. Here is an example of the command you can use:
$ likwid-perfctr -C 0 -g MEM git status
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
CPU type: Intel Kabylake processor
CPU clock: 2.11 GHz
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+
| Event | Counter | HWThread 0 |
+-----------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 16678566 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 13957793 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 35989800 |
| DRAM_READS | MBOX0C1 | 663153 |
| DRAM_WRITES | MBOX0C2 | 122973 |
+-----------------------+---------+------------+
+-----------------------------------+------------+
| Metric | HWThread 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 0.0247 |
| Runtime unhalted [s] | 0.0066 |
| Clock [MHz] | 818.1673 |
| CPI | 0.8369 |
| Memory load bandwidth [MBytes/s] | 1721.2553 |
| Memory load data volume [GBytes] | 0.0424 |
| Memory evict bandwidth [MBytes/s] | 319.1842 |
| Memory evict data volume [GBytes] | 0.0079 |
| Memory bandwidth [MBytes/s] | 2040.4395 |
| Memory data volume [GBytes] | 0.0503 |
+-----------------------------------+------------+
We ran git status
command through likwid-perfctr
and we decided to read counter group related to memory, indicated through -g MEM
. In order to for likwid
to work, the program needs to be “pinned” to a CPU core. This is done through option -C 0
where we pin git
to core zero.
From the output produced by likwid-perfctr
, we can see that our program was reading from memory at the rate of 2040 MB/s and that it tranferred a total of 50 MB of data from the memory.
LIKWID hardware counter groups
The counters in likwid-perfctr
are sorted in performance groups. In our previous example, we read the performance group MEM
. You can see the list of all performance groups with likwid-perfctr -a
:
$ likwid-perfctr -a Group name Description -------------------------------------------------------------------------------- FALSE_SHARE False sharing L2 L2 cache bandwidth in MBytes/s L3CACHE L3 cache miss rate/ratio TLB_DATA L2 data TLB miss rate/ratio UOPS UOPs execution info TLB_INSTR L1 Instruction TLB miss rate/ratio CYCLE_ACTIVITY Cycle Activities FLOPS_AVX Packed AVX MFLOP/s TMA Top down cycle allocation MEM L3 cache bandwidth in MBytes/s FLOPS_DP Double Precision MFLOP/s DATA Load to store ratio ICACHE Instruction cache miss rate/ratio CLOCK Power and Energy consumption CYCLE_STALLS Cycle Activities (Stalls) ENERGY Power and Energy consumption L2CACHE L2 cache miss rate/ratio FLOPS_SP Single Precision MFLOP/s UOPS_EXEC UOPs execution L3 L3 cache bandwidth in MBytes/s RECOVERY Recovery duration UOPS_RETIRE UOPs retirement MEM_SP L3 cache bandwidth in MBytes/s DIVIDE Divide unit information BRANCH Branch prediction miss rate/ratio MEM_DP Overview of arithmetic and main memory performance UOPS_ISSUE UOPs issueing
The names of the groups are self explainatory. The names of the groups differ between CPU models, and as far as I can tell, Intel is much better supported than AMD.
Marker API
The information you can collect as described previously is fine, but you can collect similar information with perf stat
or Intel’s VTUNE. However, one distinguishing feature of LIKWID is its Marker API. It allows you to do the same measurements as already described, but on a code segment, instead of the whole program.
Here is the short example used to demonstrate the Marker API:
#define LIKWID_PERFMON #include <likwid.h> float sum(std::vector<float>& arr, int repeat_count) { float result = 0.0; for (int k = 0; k < repeat_count; k++) { LIKWID_MARKER_START("Compute"); for (int i = 0; i < arr.size(); i++) { result += arr[i]; } LIKWID_MARKER_STOP("Compute"); } return result; } int main(int argc, char** argv) { ... LIKWID_MARKER_INIT; LIKWID_MARKER_THREADINIT; double res = sum(test_array, 16); ... LIKWID_MARKER_CLOSE; }
We need to include header likwid.h
(line 2), but before including it, we need to define a macro called LIKWID_PERFMON
. Without it, calls to LIKWID functions resolve to empty macros. We can define the macro either like in this example (line 1), or by passing option -DLIKWID_PERFMON
to the compiler. Only if this macro is enabled will LIKWID work.
Next, we need to initialize LIKWID in the main function (lines 18-19). When the program exits, we need to clean up the resources (line 24).
We surround the region of the code we want to measure with LIKWID_MARKER_START
and LIKWID_MARKER_STOP
markers (lines 7 and 11). Every time our program reaches LIKWID_MARKER_START
, it collects the state of the hardware counters; when it reaches LIKWID_MARKER_STOP
, it collects the state again. The difference between start and stop values are the numbers interesting to us.
You need to provide the region name as a parameter for both LIKWID_MARKER_START
and LIKWID_MARKER_STOP
. The name should not contain empty spaces. This name is later used in the report. In our case, the name is Compute
.
When you link your program, you need to provide -llikwid
to the linker to resolve the LIKWID functions that are missing.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Running your program with Marker API
Running your program with the marker API is very similar to when you run it without it. The difference is that you need to specify the -m
switch on the command line to collect the data. Here is the command line and the output of our program:
$ likwid-perfctr -m -C 0 -g MEM ./likwid-example -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz CPU type: Intel Kabylake processor CPU clock: 2.11 GHz -------------------------------------------------------------------------------- ... -------------------------------------------------------------------------------- Region Compute, Group 1: MEM +-------------------+------------+ | Region Info | HWThread 0 | +-------------------+------------+ | RDTSC Runtime [s] | 0.563254 | | call count | 16 | +-------------------+------------+ +-----------------------+---------+------------+ | Event | Counter | HWThread 0 | +-----------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 1275133000 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1460581000 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1173268000 | | DRAM_READS | MBOX0C1 | 141401800 | | DRAM_WRITES | MBOX0C2 | 1036858 | +-----------------------+---------+------------+ +-----------------------------------+------------+ | Metric | HWThread 0 | +-----------------------------------+------------+ | Runtime (RDTSC) [s] | 0.5633 | | Runtime unhalted [s] | 0.6914 | | Clock [MHz] | 2629.6957 | | CPI | 1.1454 | | Memory load bandwidth [MBytes/s] | 16066.8573 | | Memory load data volume [GBytes] | 9.0497 | | Memory evict bandwidth [MBytes/s] | 117.8136 | | Memory evict data volume [GBytes] | 0.0664 | | Memory bandwidth [MBytes/s] | 16184.6708 | | Memory data volume [GBytes] | 9.1161 | +-----------------------------------+------------+
On line 9 we see the name of the region is Compute
, as written in LIKWID_MARKER_START
. We can also see the total runtime for our region RDTSC Runtime [s] : 0.563254
and we can see the number of invocations call count: 16
. The number of invocations corresponds to the number of times we entered the region.
As far as the metrics are concerned, our program has transferred about 9.1 GB of data from the memory at speed of 16.1 GB/s. Let’s check if the data match. The size of our test array is 128 million elements, and each element is 4 bytes in size. This means we need to transfer 512 MB of data for each compute session. We have 16 iterations, so this means we need to transfer 8 GB. We measured 9.1 GB.
A word of caution
There are a few things I think it is important to know when using LIKWID.
LIKWID has a certain overhead which is typically small, but can grow if you are entering and exiting regions many times. If that is the case, the overhead of LIKWID accumulates and can become large, so that it skews the measurement results. So don’t use LIKWID to measure really short sequences of code.
Second warning is not related to LIKWID, but to hardware performance counters. These counters vary a lot between different CPU types. It can also sometime show misleading numbers. So take all information produced by LIKWID with a grain of salt. My recommendation is not to use absolute values, like we did here where we measured memory data volume. Instead, perform the measurements on an original version, do a slight modification, measure the modified version and finally compare results. This makes more sense.
Final Words
LIKWID through its tool likwid-perfctr
is a really nice and simple way to get information from the hardware performance counters. In this post we covered the most basic scenario, but which should help you get started quickly. We didn’t cover measurements on multithreaded applications here, for this and other advanced topics, we refer you to LIKWID documentation.
LIKWID comes with other tools as well. You can find a very comprehensive introduction to LIKWID in a blog post written by Pramod Kumbhar. LIKWID documentation also contains much useful information.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.