Five Orders of Magnitude Improvement in GNU Radio Stream-Tag Propagation Performance

Recently at GNU Radio Conference 2014 I presented a performance benchmark ( comparing the maximum throughput and expected latency of several simple burst processing graphs.   These included comparing functionally identical graphs based on a message passing design using PDUs to carry around data vectors with a stream tag based design using normal GNU Radio stream buffers and burst-demarcating tags to pass data vectors around.   The initial performance results were somewhat of a landslide performance difference in favor of PDUs, so much so that it was clear that something was wrong, tagged stream item blocks should perform better than this.  To address this Eric Blossom, Doug Geiger, and I set about profiling the root cause and updating the GNU Radio runtime to mitigate the performance issue.

Original Profiling Results

Upon benchmarking the original tag propagation code, we can see that the prune tags method is completely dominating run time.

TSB Profiling Deque Impl
TSB Profiling Deque Impl

Diving into the code for prune_tags method, we see the following which appears to be something of a O(n^2) linear list search and erase within the deque data structure.

 std::deque<tag_t> buffer::d_item_tags;

 buffer::prune_tags(uint64_t max_time)
 std::deque<tag_t>::iterator itr = d_item_tags.begin();
 uint64_t item_time;
 while(itr != d_item_tags.end()) {
   item_time = (*itr).offset;
   if(item_time+d_max_reader_delay + bufsize() < max_time) {
     itr = d_item_tags.begin();

We update the deque to a stl multimap implementation as shown below which should give us something more like an O(log n) access time instead.

std::multimap<uint64_t,tag_t>  buffer::d_item_tags;

 buffer::prune_tags(uint64_t max_time)
 std::multimap<uint64_t, tag_t>::iterator end_itr = d_item_tags.lower_bound(max_time);
 std::multimap<uint64_t, tag_t>::iterator begin_itr = d_item_tags.begin();
 d_item_tags.erase(begin_itr, end_itr);

Profiling After Updates

After updating the tag storage class to be a std multimap, keyed on a uint64_t offset, and the gr_tag_t value, we can update all of our core tag functions to access the data store using more efficient RB-tree access inefficiencies.   Re-running the same benchmark on the updated code we observe the following, a much more reasonable split of processing time than we stated with.

TSB Profiling Multimap Impl
TSB Profiling Multimap Impl

Final Results

The final benchmark performance measurement results are shown below for the following three cases.

  • TSB – Original Stream-Tag Deque Implementation
  • TSB2 – Updated Stream-Tag Multimap Implementation
  • PDU – Message Port Implementation


Graph Throughput
Graph Latency

PDU Latency is probably largely dominated by message port queue depth resulting in back-pressure at a specific depth in these experiments.   We can see that through these data structure updates, our TSB throughput has increased by roughly two orders of magnitude while our TSB latency measure has decreased by almost three orders of magnitude in this case!   We can see this is a major improvement in the unthrottled TSB performance case tested here!

Leave a Reply

Your email address will not be published. Required fields are marked *