Optimizing Rust code with Flamegraph and DHAT - a practical example with Dust DDS

Summary

In this article, we describe in detail the process we went through in the latest execution speed optimization of Dust DDS, a Rust-based implementation of the Data Distribution Service (DDS) standard. Utilizing benchmarks and tools like Flamegraph and Valgrind’s Dynamic Heap Analysis Tool (DHAT), we identified areas to improve the message send and receive cycle times of the middleware. Our optimizations included reducing memory allocation overhead in UDP message reception, improving serialization and deserialization efficiency, and removing the usage of async traits to avoid excessive memory allocations. These changes collectively resulted in performance gains of up to 50%.

The practical steps and insights shared in this article highlight the importance of combining benchmark-driven development with robust profiling tools to uncover and address performance bottlenecks. By systematically applying these techniques, we hope to provide inspiration for other developers aiming to optimize their own Rust applications.

Introduction

Data Distribution Service (DDS) is a middleware standard for dependable, real-time, scalable data exchanges using a publish-subscribe communication pattern. As a foundational technology for applications in robotics, aerospace and automotive industries, performance is a crucial aspect of any DDS implementation. Dust DDS, our implementation of the standard, is built using Rust, a systems programming language renowned for its performance and safety. However, despite Rust’s inherent efficiency and optimization capabilities, there comes a point where detailed software analysis and engineering trade-offs are necessary to achieve maximum performance under the conditions most relevant to the system’s use.

In this article, we describe how we used Flamegraph and DHAT profiling tools during a recent execution speed optimization improvement session for Dust DDS. These tools are instrumental in identifying performance bottlenecks and optimizing memory usage, respectively. By sharing our approach and insights with a concrete example, we aim to provide a practical guide that can serve as inspiration for other trying to enhance the performance of their Rust libraries and applications.

Benchmark and performance targets

The first step of any performance optimization task is to measure what we aim to improve. Fortunately, Rust provides cargo bench which makes it relatively simple to create benchmark tests.

It is important that the benchmarks accurately represent a real use-case of the system not to get trapped in micro-optimizations which might have no impact on the overall goal. For Dust DDS the main task is to send and receive data. Therefore, our benchmark will measure the time between publishing a sample on the writer and receiving it on a subscriber. This is done both with a small sample which can be sent on a single data message (named best_effort_write_and_receive) and a bigger sample which needs to be split into multiple data fragments (named best_effort_write_and_receive_frag). You can find the full source code of the benchmarks in our repository.

For this article we will always use the measurement results obtained by executing the tests on our CI system. This is done to make sure that the runs are comparable with each other and not affected by conditions on a local computer.

best_effort_write_and_receive baseline performance

best_effort_write_and_receive_frag baseline performance

The other important thing about code optimization is to have a performance target in mind. Optimization can probably be done forever so it is important to establish a (ambitious but realistic) goal beforehand not to be trapped in that infinite loop. For example, for this article, our aim is to have our best_effort_write_and_receive performance closer to 100us, which seems to be what other DDS implementations achieve.

Profiling tools

After creating our benchmarks and determining baseline performance measurements, the next step is to utilize profiling tools to guide our optimization efforts. Even for experienced programmers, it can be challenging to identify where the largest performance gains can be achieved through simple code analysis. Profiling tools provide valuable insights to help in this process.

There are many profiling tools available, but for this article, we will be using flamegraph and valgrind Dynamic Heap Analysis Tool (DHAT). Even though we typically develop on Windows most of the tools are easier to install and use on Linux environment. All the results presented in this article were obtained by running the profilers on Windows Subsystem for Linux (WSL) with an Ubuntu distribution.

Optimizing the performance of Dust DDS

With everything in place, we can get started by getting a flamegraph for the execution of the benchmark code on our main branch (commit). We do this by running cargo flamegraph --bench benchmark -- --bench best_effort_write_and_receive to execute only the benchmarks we are interested in. If everything goes well we get a flamegraph.svg file with the results looking like the figure below.

Baseline flamegraph

The flamegraph shows us the call stack on the vertical axis and the total function duration on the horizontal axis. Wider boxes represent functions that take more time, indicating potential areas for optimization. However, it is good to be aware that the graph is not ordered by time so a box can be wider either by a function taking long to execute or being called many times. For our usage it is also good to be aware that not only our own functions are shown but also Cargo and Criterion methods used for running the benchmark and creating the results. For Dust DDS nearly all of our code runs in the Tokio runtime so there is a easy way to identify our methods.

Optimization 1: Memory allocation on message reception

Zooming into the flamegraph we can see that one of the largest boxes corresponds to the read_message method inside the domain_participant_factory_actor, with most of the time being spent on memory allocation. The image below shows the flamegraph output for read_message.

read_message flamegraph

Here is how this function looks:

pub async fn read_message(socket: &mut tokio::net::UdpSocket) ->
    DdsResult<RtpsMessageRead> {

    let mut buf = vec![0; 65507];
    let (bytes, _) = socket.recv_from(&mut buf).await?;
    buf.truncate(bytes);
    if bytes > 0 {
        Ok(RtpsMessageRead::new(Arc::from(buf.into_boxed_slice()))?)
    } else {
        Err(DdsError::NoData)
    }
}

To better understand the memory allocations we can run DHAT with the command valgrind --tool=dhat ./target/release/deps/benchmark-72af074f981138fa --bench -- best_effort_write_and_receive which runs the benchmark executable. Be aware that this is a fairly slow process, so it might take a while to get results. At the end a dhat.out file is generated, which can be opened with a special viewer on the browser.

Analyzing the output, we see that this method is responsible for the second largest usage of memory and is at the top position by the low-access metric, meaning the contents of the buffer are rarely read. The image below shows the DHAT output.

read_message DHAT output

In this version of Dust DDS, we have implemented the message reception following a zero-copy principle, i.e. we allocate the memory where the UDP datagram is stored in an Arc, and that memory remains in the program until no part of it is needed anymore. To make this work in every scenario, we allocate 65507 bytes, the maximum size possible for a datagram coming from the socket. The analysis indicates that this allocation is very expensive while the data is rarely used; therefore, an optimization would be to allocate the reception buffer once and reuse it. The trade-off in this case is that the part of the data representing the received samples will have to be cloned when it needs to be stored in the data readers.

You can see the full code change on this PR. After implementing this modification and running the benchmark again we see that this change reduces the execution by about 7% for both the small data and fragmented data cases.

Optimization 2: Dedicated serialize/deserialize for byte arrays and vectors

Re-analyzing the flamegraph after our previous optimization, the read_message method is not anymore visible among the large blocks of the flamegraph. The largest two blocks at this stage correspond to the RtpsMessageWrite::new method and deserialize_u8 method of the CdrDeserializer. Both of these are involved in the serialization and deserialization of RTPS messages respectively so we can independently try to understand what happens on each of them.

Looking into further detail at the flamegraph of RtpsMessageWrite::new, most of the time is spent in a memset operation with the rest spend on a memory allocation call. The body of the function is as follows:

impl RtpsMessageWrite {
    pub fn new(header: &RtpsMessageHeader,
        submessages: &[Box<dyn Submessage + Send>]) -> Self {

        let mut buffer = [0; BUFFER_SIZE];
        let mut slice = buffer.as_mut_slice();
        header.write_into_bytes(&mut slice);
        for submessage in submessages {
            submessage.write_into_bytes(&mut slice);
        }
        let len = BUFFER_SIZE - slice.len();
        Self {
            data: Arc::from(&buffer[..len]),
        }
    }
}

The idea when we first implemented this function was that keeping the buffer on the stack and allocating only the necessary memory for the Arc should be lower cost than doing heap reallocations while serializing. Though, what we can see from the flamegraph is that initializing the entire array to 0 is consuming a significant amount of time on memset. Furthermore, when creating the Arc we have to clone the buffer into its new heap allocation. Based on this analysis, we concluded that having directly a heap allocation would be more performant. The method code has been modified to the following:

impl RtpsMessageWrite {
    pub fn new(header: &RtpsMessageHeader,
        submessages: &[Box<dyn Submessage + Send>]) -> Self {

        let buffer = Vec::new();
        let mut cursor = Cursor::new(buffer);
        header.write_into_bytes(&mut cursor);
        for submessage in submessages {
            submessage.write_submessage_into_bytes(&mut cursor);
        }
        Self {
            data: Arc::from(cursor.into_inner().into_boxed_slice()),
        }

    }
}

The serialization has also been modified to instead use a cursor that can reallocate memory as needed. You can see the full change on this PR. Based on the benchmark we get another 7% improvement on execution time.

For the deserialization we see on the flamegraph that there is a significant amount of calls on deserializing a u8 value. Digging a little deeper in the code, we see that Vec<u8> are deserialized in a for just like any other Vec<T>. However, since we have an underlying data representation using bytes we can use optimized functions for serializing and deserializing vectors and arrays of u8. This mostly involves having using the specialized function for those cases so the changes are mostly done on the proc macro and you can see all the details on this PR. The biggest impact of this change is on the best_effort_write_and_receive_frag benchmark which sees a 40% execution time improvement.

Optimization 3: Make handling of actor messages sync

Now that we have handled two more large blocks of the flamegraphs it is time to run the benchmark again. After this run, the largest block is the GenericHandlerDyn<A>::handle method. This is a fairly simple method which Boxes the futures (since Rust doesn’t yet support opaque returns such as impl Future in trait objects) and calls the corresponding handle method for processing the message. This method is a core component of the Dust DDS actor system and it is called for every message sent to an actor. The method itself looks like this:

struct ReplyMail<M>
where
    M: Mail,
{
    mail: M,
    reply_sender: tokio::sync::oneshot::Sender<M::Result>,
}

impl<A, M> GenericHandlerDyn<A> for ReplyMail<M>
where
    A: MailHandler<M> + Send,
    M: Mail + Send,
    <M as Mail>::Result: Send,
{
    fn handle<'a, 'b>(
        self: Box<Self>,
        actor: &'b mut A,
    ) -> Pin<Box<dyn Future<Output = ()> + Send + 'b>>
    where
        Self: 'b,
        'a: 'b,
    {
        let this = *self;
        Box::pin(async {
            let result = <A as MailHandler<M>>::handle(actor, this.mail).await;
            this.reply_sender.send(result).ok();
        })
    }
}

The detailed flamegraph for it is shown in the Figure below.

Detailed flamegraph for GenericHandlerDyn<A>::handle

Analyzing this measurement, it seems that making the actor handle methods async was not the best software architecture choice regarding performance since the allocation of arbitrarily large objects on the hot path can quickly get expensive. Based on this, we have decided to modify the actor architecture to make the handle method not return a future. The remaining of the time used in the handle function comes from freeing and copying memory. This originates from consuming the Box<Self> in the line let this = *self;. Since it is not possible to partially destructure this object, we have instead opted by having the fields as Option to be able to take the values and leave None in its place. The handle method after restructure looks like:

struct ReplyMail<M>
where
    M: Mail,
{
    mail: Option<M>,
    reply_sender: Option<tokio::sync::oneshot::Sender<M::Result>>,
}
impl<A, M> GenericHandler<A> for ReplyMail<M>
where
    A: MailHandler<M> + Send,
    M: Mail + Send,
    <M as Mail>::Result: Send,
{
    fn handle(&mut self, actor: &mut A) {
        let result =
            <A as MailHandler<M>>::handle(actor, self.mail.take()
                .expect("Must have a message"));
        self.reply_sender
            .take()
            .expect("Must have a sender")
            .send(result)
            .ok();
    }
}

You can see the complete code modification on this PR and this PR.

The benchmark results show a further improvement for the best_effort_write_and_receive test which brings the performance close enough to 100us. After this optimization there are only relatively small block representing the code of Dust DDS and major blocks are all regarding Tokio runtime internals.

Conclusion

In this article, we explored the process of performance optimization for Dust DDS, a Rust implementation of the Data Distribution Service (DDS) standard. Starting with benchmarks that represent real-world use cases, we established a performance target to guide our optimization efforts. By employing profiling tools such as Flamegraph and Valgrind’s Dynamic Heap Analysis Tool (DHAT), we identified critical bottlenecks and memory inefficiencies within our system.

Our first optimization focused on reducing memory allocation overhead in the read_message function by implementing a reusable reception buffer. Next, we addressed serialization and deserialization inefficiencies by modifying the RtpsMessageWrite::new method and optimizing byte array handling. Finally, we tackled the GenericHandlerDyn<A>::handle method, a core component of our actor system. By transitioning from asynchronous to synchronous handling on the actor message handling method and minimizing unnecessary memory allocations, we achieved a 50% performance gain for best_effort_write_and_receive_frag and a 25% performance gain for best_effort_write_and_receive, bringing our benchmark time close to the 100 microseconds target. The Figure below gives a comparison between the latest version of Dust DDS (in blue) and the starting version (in red).

best_effort_write_and_receive performance comparison

best_effort_write_and_receive_frag performance comparison

The profiling and optimization techniques discussed here not only demonstrate the importance of detailed software analysis but also provide a practical roadmap for others seeking to optimize their Rust applications.