Repository Summary
Description | |
Checkout URI | https://github.com/nobleo/ros2_performance.git |
VCS Type | git |
VCS Version | master |
Last Updated | 2023-01-27 |
Dev Status | UNKNOWN |
Released | UNRELEASED |
Tags | No category tags. |
Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Packages
Name | Version |
---|---|
ros2_performance | 1.0.0 |
README
ros2_performance
Package goal
Highlight the CPU overhead introduced by using ROS 2 versus Fast RTPS directly. Provide data from profiling tools to point where most of the CPU overhead is coming from. Possibly start a discussion on the ROS 2 middleware design decisions.
Package summary
Tests were performed on the Dashing Diademata release of ROS 2. Results may vary for different releases. Future releases may have resolved the issues discussed below.
This package provides different publisher subscriber setups that generate the same traffic. ros, rosonenode and rtps all create 20 topics where each topic has 1 publisher and 10 subscribers. noros and nopub are added for comparison to get an initial idea of overhead generated by different aspects of the ROS 2 stack.
Binary | publishers | subscribers | ROS | ROS nodes | ROS timers | DDS participants |
---|---|---|---|---|---|---|
ros | 20 | 200 | yes | 10 | 10 | 10 |
rosonenode | 20 | 200 | yes | 1 | 1 | 1 |
nopub | 0 | 0 | yes | 10 | 10 | 10 |
rtps | 20 | 200 | no | 0 | 0 | 1 |
noros | 20* | 200* | no | 0 | 0 | 0 |
*C++ implementation no network publishing/subscribing.
Running all examples in isolated docker containers gives the following result:
Recreating the issue using Docker
It is possible to git clone this repository, build the workspace using colcon build and inspect the CPU usage with top or a similar program for each binary individually. It is however much easier to give each binary its own container (make sure to separate their networks or give them a unique ROS_DOMAIN_ID) and measure the usage of each container. If you don’t have docker and docker compose installed first follow online tutorials on how to install these: https://docs.docker.com/install/ https://docs.docker.com/compose/install/ .
- Clone this repository
git clone https://github.com/scgroot/ros2_performance.git
- cd into the folder
cd ros2_performance
- Build the docker image [requires you to have docker installed properly following the links above] (this will take a while)
docker build -t ros2_performance:dashing .
you can name the image differently by replacing “ros2_performance:dashing” by something else, but this will require you to change the docker-compose.yml file accordingly.
- Run a compose file
cd compose_files
here you have 2 options, each folder contains a docker-compose.yml. The “all” folder will start all the binaries listed in the table above. The “pubs_subs” folder will only start “rtps”, “ros” and “rosonenode”.
cd all
or
cd pubs_subs
then
docker-compose up
- Inspect the CPU usage In a new terminal (ctrl+alt+t) do:
docker stats
It is possible to specify the format of the output to show more or less information for example:
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.PIDs}}"
This will only display the container Name, the CPU percentage, Memory usage and PIDS. You should now see a terminal similar to the image above.
Perf results
CPU profiling methods
To inspect which functions are using the most CPU there are multiple options. We investigated CPU usage with valgrind http://www.valgrind.org/ (callgrind) and perf https://perf.wiki.kernel.org/index.php/Tutorial. To inspect The data recorded by callgrind it is possible to use kcachegrind http://kcachegrind.sourceforge.net/html/Home.html. The data recorded by perf can be visualized using https://github.com/brendangregg/FlameGraph to generate svg files or alternatively using a GUI with the app Hotspot https://github.com/KDAB/hotspot. The Flamegraph generated by https://github.com/brendangregg/FlameGraph is integrated in the Hotspot app. How to use the Hotspot app is explained in the README.md of the app.
The results of both tools were basically the same, therefore we picked the most readable output to share our findings. In our opinion the FlameGraphs generated from perf’s data are more readable than kcachegrinds callee maps and call-graphs.
Measurements performed on this package using perf
We ran each binary for 30 seconds and recorded the amount of CPU cycles and time spent in each function using:
source /opt/ros/dashing/setup.bash
cd ros2_performance
colcon build
perf record --call-graph dwarf build/ros2_performance/<binary_name> & sleep 30; killall perf
This was done for all three binaries that have publishers and subscribers (rtps, ros, rosonenode). The total amount of cycles is of course influenced by how busy the CPU is at the time. The CPU usage % for each container is measured at the same time so the multiplicate between the containers remains fairly stable even though the absolute number fluctuates due to background processes. Since comparing percentages of percentages gets confusing we rather compare amount of CPU cycles spent in a function to each other directly. To achieve this we measured the amount of CPU cycles for 30 seconds multiple times for each binary and averaged the values, we then calculated the multiplicates between CPU cycles.
Binary | CPU cycles total | CPU usage % | CPU cycle multiplicate | CPU usage % multiplicate |
---|---|---|---|---|
rtps | ~3.6e9 | ~6 | base | base |
rosonenode | ~1.3e10 | ~21 | ~3.5 | ~3.5 |
ros | ~3.8e10 | ~63 | ~10.6 | ~10.5 |
Because the multiplicates between the amount of CPU cycles and the amount of CPU usage % are so close together we can assume that we can now directly compare the average amount of CPU cycles spent in a function between binaries. We are aware this is a quite crude approach and does not entirely filter out the noise of background processes, but we are certain the observations and conclusions based on this research remain the same. We are not looking to accurately measure if ROS 2 uses 10.5 or 10.6 times more CPU for our system, we are investigating WHY and WHERE ROS 2 is using more CPU.
The following table shows the amount CPU cycles spent in the two main classes of each binary:
File truncated at 100 lines see the full file