Five Orders of Magnitude Improvement in GNU Radio Stream-Tag Propagation Performance

Recently at GNU Radio Conference 2014 I presented a performance benchmark (https://github.com/osh/gr-chunky) comparing the maximum throughput and expected latency of several simple burst processing graphs.   These included comparing functionally identical graphs based on a message passing design using PDUs to carry around data vectors with a stream tag based design using normal GNU Radio stream buffers and burst-demarcating tags to pass data vectors around.   The initial performance results were somewhat of a landslide performance difference in favor of PDUs, so much so that it was clear that something was wrong, tagged stream item blocks should perform better than this.  To address this Eric Blossom, Doug Geiger, and I set about profiling the root cause and updating the GNU Radio runtime to mitigate the performance issue.

Original Profiling Results

Upon benchmarking the original tag propagation code, we can see that the prune tags method is completely dominating run time.

TSB Profiling Deque Impl
TSB Profiling Deque Impl

Diving into the code for prune_tags method, we see the following which appears to be something of a O(n^2) linear list search and erase within the deque data structure.

 std::deque<tag_t> buffer::d_item_tags;

 void
 buffer::prune_tags(uint64_t max_time)
 {
 std::deque<tag_t>::iterator itr = d_item_tags.begin();
 uint64_t item_time;
 while(itr != d_item_tags.end()) {
   item_time = (*itr).offset;
   if(item_time+d_max_reader_delay + bufsize() < max_time) {
     d_item_tags.erase(itr);
     itr = d_item_tags.begin();
   }
   else
     itr++;
   }
 }

We update the deque to a stl multimap implementation as shown below which should give us something more like an O(log n) access time instead.

std::multimap<uint64_t,tag_t>  buffer::d_item_tags;

 void
 buffer::prune_tags(uint64_t max_time)
 {
 std::multimap<uint64_t, tag_t>::iterator end_itr = d_item_tags.lower_bound(max_time);
 std::multimap<uint64_t, tag_t>::iterator begin_itr = d_item_tags.begin();
 d_item_tags.erase(begin_itr, end_itr);
 }

Profiling After Updates

After updating the tag storage class to be a std multimap, keyed on a uint64_t offset, and the gr_tag_t value, we can update all of our core tag functions to access the data store using more efficient RB-tree access inefficiencies.   Re-running the same benchmark on the updated code we observe the following, a much more reasonable split of processing time than we stated with.

TSB Profiling Multimap Impl
TSB Profiling Multimap Impl

Final Results

The final benchmark performance measurement results are shown below for the following three cases.

  • TSB – Original Stream-Tag Deque Implementation
  • TSB2 – Updated Stream-Tag Multimap Implementation
  • PDU – Message Port Implementation

b2_thru

Graph Throughput
b2_latency
Graph Latency

PDU Latency is probably largely dominated by message port queue depth resulting in back-pressure at a specific depth in these experiments.   We can see that through these data structure updates, our TSB throughput has increased by roughly two orders of magnitude while our TSB latency measure has decreased by almost three orders of magnitude in this case!   We can see this is a major improvement in the unthrottled TSB performance case tested here!

GNU Radio Conference 2014

Thank you to everyone who organized, attended, and supported this incredible growing conference this year in Washington, DC.   I enjoyed meeting many new faces along with familiar ones, was incredibly pleased with the quality of the content this year, and was happy to receive such a great reception and lively discussion of GNU Radio’s architectural design patterns in the FASS working group following my talk. Looking forward to working with all of you to ensure GNU Radio continues to innovate and adapt to emerging demands to remain the powerful free open source signal processing ecosystem of choice for years to come.

GRCon14 Picture by John Malsbury