Learning to Communicate with Unsupervised Channel Autoencoders

Our radio physical layers are actually pretty simplistic and boring in the world right now, PSK and QAM are well defined expert representations of information to transit a wireless channel.  Systems using OFDM and SC-FDMA are a bit more involved, but use some of the same constructs underneath with a bit of shuffling sub-carriers.   Forward error correction (FEC), equalization, randomization, and a number of other functions are generally bolted onto this as separate and independent blocks and transforms to make up for performance properties or assumptions of each other layer in order to form an effective end-to-end system.

Enter machine learning … rethink all the things …

We’ve just pre-pubbed a paper to arXiv focusing on trying to learn entire communications systems using unsupervised reconstruction learning (autoencoders).   We seek to reconstruct transmitted information bits at a receiver while introducing channel impairments in the hidden layer of the network to simulate a wireless channel.   By doing this we force learned representations in the encoder and decoder to adapt jointly to optimize for reconstruction performance of the information bits (we refer to this as channel regularization).  The high level design looks something like this:

cae1

We evaluate a number of different autoencoder network structures and also consider keeping the CNN layer constrained to a relatively low number of filters to emulate the relatively low number of communications symbols typically used in communications system (although this is not necessarily optimal, but helps with intuition).  The structure of our DNN-CNN network candidate looks something like this:

cae2

Once we learn a transmit/receive representation in the autoencoder we can evaluate its performance across a range of channel conditions.  Traditional wireless channel performance measures such as BER vs SNR and spectral efficiency can be easily compared to legacy expert modulation techniques as shown below.

cae3

We discuss a handful of other issues including how to start jointly learning synchronization methods on the front of the decoder using radio transformer networks and how to start simulating channel effects beyond simple additive Gaussian noise.   I’m pretty excited about the future of this form of unsupervised communications system learning, there’s a ton of work to do to make it work way better over the air and amongst harsh channel conditions.   Hoping to see what others do with this, and finalize a conference version of it for submission soon.

 

Check out the paper at: https://arxiv.org/abs/1608.06409

MNIST Generative Adversarial Model in Keras

Some of the generative work done in the past year or two using generative adversarial networks (GANs) has been pretty exciting and demonstrated some very impressive results.  The general idea is that you train two models, one (G) to generate some sort of output example given random noise as input, and one (A) to discern generated model examples from real examples.  Then, by training A to be an effective discriminator, we can stack G and A to form our GAN, freeze the weights in the adversarial part of the network, and train the generative network weights to push random noisy inputs towards the “real” example class output of the adversarial half.

mnist_gan
High Level GAN Architecture

Building this style of network in the latest versions of Keras is actually quite straightforward and easy to do, I’ve wanted to try this out on a number of things so I put together a relatively simple version using the classic MNIST dataset to use a GAN approach to generating random handwritten digits.

Before going further I should mention all of this code is available on github here.

Generative Model

We set up a relatively straightforward generative model in keras using the functional API, taking 100 random inputs, and eventually mapping them down to a [1,28,28] pixel to match the MNIST data shape.  Be begin by generating a dense 14×14 set of values, and then run through a handful of filters of varying sizes and numbers of channels and ultimately train using and Adam optimizer for binary cross-entropy (although we really only use the generator model in the forwards direction, we don’t train directly on this model itself).  We use a sigmiod on the output layer to help saturate pixels into 0 or 1 states rather than a range of grays in between, and use batch normalization to help accelerate training and ensure that a wide range of activations are used within each layer.

# Build Generative model ...
nch = 200
g_input = Input(shape=[100])
H = Dense(nch*14*14, init='glorot_normal')(g_input)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Reshape( [nch, 14, 14] )(H)
H = UpSampling2D(size=(2, 2))(H)
H = Convolution2D(nch/2, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(nch/4, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(1, 1, 1, border_mode='same', init='glorot_uniform')(H)
g_V = Activation('sigmoid')(H)
generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer=opt)
generator.summary()

We now have a network which could in theory take in 100 random inputs and output digits, although the current weights are all random and this clearly isn’t happening just yet.

Sad images from an untrained generator
Sad images from an untrained generator

Adversarial Model

We build an adversarial discriminator network to take in [1,28,28] image vectors and decide if they are real or fake by using several convolutional layers, a dense layer, lots of dropout, and a two element softmax output layer encoding: [0,1] = fake, and [1,0] = real.  This is a relatively simple network, but the goal here is largely to get something that works passably and trains relatively quickly for experimentation.

# Build Discriminative model ...
d_input = Input(shape=shp)
H = Convolution2D(256, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(d_input)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Convolution2D(512, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Flatten()(H)
H = Dense(256)(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
d_V = Dense(2,activation='softmax')(H)
discriminator = Model(d_input,d_V)
discriminator.compile(loss='categorical_crossentropy', optimizer=dopt)
discriminator.summary()

We pre-train the discriminative model by generating a handful of random images using the untrained generative model, concatenating them with an equal number of real images of digits, labeling them appropriately, and then fitting until we reach a relatively stable loss value which takes 1 epoch over 20,000 examples.  This is an important step which should not be skipped — pre-training accelerates the GAN massively and I was not able to achieve convergence without it (possibly due to impatience).

Generative Adversarial Model

Now that we have both the generative and adversarial models, we can combine them to make a GAN quite easily in Keras.  Using the functional API, we can simply re-use the same network objects we have already instantiated and they will conveniently maintain the same shared weights with the previously compiled models.  Since we want to freeze the weights in the adversarial half of the network during back-propagation of the joint model, we first run through and set the keras trainable flag to False for each element in this part of the network.  For now, this seems to need to be applied at the primitive layer level rather than on the high level network so we introduce a simple function to do this.

# Freeze weights in the discriminator for stacked training
def make_trainable(net, val):
    net.trainable = val
    for l in net.layers:
       l.trainable = val
make_trainable(discriminator, False)

# Build stacked GAN model
gan_input = Input(shape=[100])
H = generator(gan_input)
gan_V = discriminator(H)
GAN = Model(gan_input, gan_V)
GAN.compile(loss='categorical_crossentropy', optimizer=opt)
GAN.summary()

At this point, we now have a randomly initialized generator, a (poorly) trained discriminator, and a GAN which can be trained across the stacked model of both networks.  The core of training routine for a GAN looks something like this.

  1. Generate images using G and random noise (forward pass only).
  2. Perform a Batch update of weights in A given generated images, real images, and labels.
  3. Perform a Batch update of weights in G given noise and forced “real” labels in the full GAN.
  4. Repeat…

Running this process for a number of epochs, we can plot the loss of the GAN and Adversarial loss functions over time to get our GAN loss plots during training.

mnist_gan_loss4
GAN Training Loss

And finally, we can plot some samples from the trained generative model which look relatively like the original MNIST digits, and some examples from the original dataset for comparison.

mnist_gan7
GAN Generated Random Digits
mnist_real
Examples Digits from Real MNIST Set

https://github.com/osh/KerasGAN

Reducing 1D Convolution to a Single (Big) Matrix Multiplication

This is perhaps the 3rd time I’ve needed this recipe and it doesnt seem to be readily available on google.  Theano and Tensorflow provide convolution primitives for 1D and 2D, but (correct me if I’m wrong) I think they are generally constrained such that the filter taps you are convolving must be parameters, and not additional tensor values in a big tensor application.   This is unfortunate, and annoying for certain operations, and my work around is to implement my own convolution as a matrix multiplication based on a properly indexed version of the input and tap tensors within an operation.

Anyway, hopefully this snippet will be useful to someone else some day –

The idea here is simply that we can simply use a toeplitz matrix to generate a large 2D matrix (H) which is simply indexes into a 1D input of taps (h).   Multiplying our input (x) by the 2D (H) matrix then simply gives us our convolution output (y).   Its fairly simple but somewhat tedious to set up, an example implementation is shown below for reference.

#!/usr/bin/env python
import numpy as np
from scipy import linalg
from scipy import signal
x = np.array([0,0,1,0,0,2,0,0,0]) # 9
h = np.array([0,1,2,0]) # 4
y = signal.convolve(x, h, mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print "shape", H.shape, x.shape
y = np.sum(np.multiply(x,H), 1)
print "y(mult):", y
print "**********************"
x = np.array([0,0,1,0,0,2,0,0,0]) # nsamp
x = np.tile(x,[10,1]) # n_ex x n_samp
h = np.array([0,1,2,0]) # n_samp
h = np.tile(h,[10,1]) # n_ex x n_samp
y = np.zeros([x.shape[0], x.shape[1]])
for i in range(0,x.shape[0]):
    y[i,:] = signal.convolve(x[i,:], h[i,:], mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
H = np.zeros([ x.shape[0], x.shape[1], x.shape[1] ]) # n_ex x n_samp x n_samp
for i in range(0,x.shape[0]):
    padding = np.zeros(x.shape[1]-1, h.dtype) #
    first_col = np.r_[h[i,:], padding] #
    first_row = np.r_[h[i,0], padding] #
    H[i,:,:] = linalg.toeplitz(first_col, first_row)[1:x.shape[1]+1,:]
print "H shape", H.shape
print H[0,:,:]
x = x.reshape([x.shape[0], 1, x.shape[1]])
x = np.tile(x, [1,x.shape[1],1])
y = np.sum(np.multiply(x,H), 2)
print "y(mult):", y
print "**********************"
h = np.array([0,1,2,3,4,5,6,7,8], dtype='int32')
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print H

 

KeRLym: A Deep Reinforcement Learning Toolbox in Keras

Reinforcement learning coupled with deep learning based function approximation has been an exciting area over the past couple years.  The appeal of learning methods which can effectively learn to search an action/reward environment and derive a good policy based on experience and random exploration is quite significant for a wide range of applications.  Vlad Minh’s original DeepMind Deep-Q Networks (DQN) paper demonstrating raw-pixel based learning on Atari games was an awesome demonstration of what was possible, and there have been tons of improvements and other interesting applications by others since then.

Since then, who hasn’t wanted to play around with RL on the handful of Atari games and their own domain specific automation tasks?   DeepMind released their code for this experiment along with the Nature paper, but it was frustratingly in Lua/Torch and as the paper stated, takes quite long (~30 days?) to learn Atari games to a high level of skill.   Since then I’ve become quite fond of working with Keras, Theano, and TensorFlow on a range of ML problems — the workflow and simplicity of python/numpy/tensor algorithm definition and Cuda cross-compilation is just too attractive and productive for me to want to work in anything else right now, so naturally I wanted to leverage these same tools in the RL space.  I started looking into DQN and other RL algorithm implementations available and found a handful of helpful examples, but no particularly featureful, fast or satisfying projects which were designed to be easily applied to your own environments, which got me thinking about standardizing an interface to environments so that learners could be easily applied to a wide class of problems.  Shortly after this though occured to me, OpenAI published their GYM software and online environment scoreboard — pretty much solving this problem and providing a wide range of environmental learning tasks already integrated into a relatively simple reinforcement learning environment API.   This was great, I started playing with DQN implementations leveraging Keras on top of Gym and KeRLym (Keras+RL+Gym) was the result.

The initial results from kerlym were relatively frustrating, DQN tuning is hard and implementing the algorithms is error prone.  The Atari simulator isn’t the fastest, and it takes quite a while to sequentially play enough games to generate a significant amount of experience.  So then there’s been a good bit of work recently in asynchronous methods for RL, running lots of agents in parallel to each run their own episodes and share model parameters and gradients.  Corey Lynch published an awesome implementation of async-rl using Keras and Gym-based Atari games which I spent a good bit of time playing with.  The result was I refactored kerlym significantly to leverage a lot of the async-dqn techniques demonstrated there.

With the new asynchronous DQN implementation, frame-diff’ing, an atari frame pre-processor Andrej Karpathy used recently in his blog post about RL, I finally had a somewhat effective learner that I could set loose on Atari games and see a gradual improvement of total reward (score per game) take form over the course of several hours.   Below is an example of ~64k episodes of Breakout running on kerlym with diagnostic plots enabled to monitor training performance.

example

At this point I finally have some confidence in the correctness of this agent implementation, but there are still countless hyper-parameters which can be tuned and significantly effect performance.

There are a couple of directions I hope to go at this point:

  • Implementing additional agents to compare training performance: I love the speedup of asynchronous/concurrent training, and I’m impatient for multi-day RL tests, so I would really love to add working asynchronous Policy Gradient (PG), TRPO, and A3C agents which can be easily interchanged and tested.
  • Exploring applications of these learning agents to new domains: What other tasks can we readily train and execute using the learning models we have at this point?  Being an applied person, I kind of want to throw DQNs at every task under the sun at this point and see what works, the goal of kerlym is largely to make this easy to do.  I’ve started building out-of-tree gym environments for various tasks, such as the tone search task described here, and its exciting to think of the possibilities applying this this to a number of radio domain tasks.

For now, its hard to stop watching kerlym play Breakout and Pong over and over, slowly improving.

https://github.com/osh/kerlym

Learning to Synchronize with Attention Models

Synchronization is often one of the most involved tasks to get right when building, testing, and deploying a radio system.  In this work, we look at treating synchronization as a learned attention model in a deep neural network to provide a canonical form signal for classification.  We use the same discriminative network as used in prior work and obtain slightly better classification performance.  We introduce a handful of new layers into Keras to build a domain specific set of radio transforms to supplement those used in imagery and described in this paper on Spatial Transformer Networks.

RTN

Classification is perhaps not the most interesting task to apply an attention model for synchronization.  Due to the extremely low SNR of much of the data-set, good synchronization is hard to achieve on short data samples with learned or expert synchronization metrics, and many of the learned discriminative features seem to be relatively robust to synchronization error.  We plan to revisit this attention model more in future work, potentially for other sorts of tasks for which it may be more beneficial, regardless, plotting a color-coded distribution over the density of constellation points before an after the transform on the QPSK subset of the data-set, we can definitely see some qualitative improvements in orderly signal structure.

density_plots

Checkout the paper on arXiv for more details!

Unsupervised Radio Signal Representation Learning

We’ve just posted a brief new arXiv article (https://arxiv.org/abs/1604.07078) on learning to represent modulated radio signals using unsupervised learning.  We employ a small autoencoder network with convolutional and fully connected layers to fit a sparse signal representation with no expert knowledge or supervision.  Mean squared error reconstruction distance and regularization are used during training.

net (1)

One example of a noisy test set example, its compressed representation, and its reconstruction is shown below for a QPSK signal, additional details are available in the arXiv paper!  We achieve a 16x compression in information density (2x88x4->1×44), and 128x in storage space (2x88x32->1×44)!  We’re looking forward to doing many more things with these ideas!

recon1 (1)

As a side note, since drawing hundreds of neural network connection lines in diagramming tools manually is really not fun, I’ve posted a small tool called NNPlot on github which attempts to make generating high level conceptual neural network diagrams much easier.  Hopefully someone else will find this of use some day, the network diagram above is the first example in it.

Dynamic GNU Radio Channel Model Enhancements

In an attempt to test modem performance deterministically through dropout conditions and partial fades selective, we added the fading model and selective fading model to GNU Radio a few years ago.   Recently Bastian Bloessl pointed out that the auto-correlation properties of these channel responses were degrading over time and did a great write up on it here.

Flat Fader Corrections

After looking into the issue we now have stable auto-correlation properties not degrading with the phase accumulation and large non-dense floating point representation that occurred after very long runs of the original channel model.   Here we see the autocorrelation at the beginning of a run, and 500 MSamples into a run both follow the analytically expected ACF closely with the patch introduced here.   Soon to be squashed and merged in a cleaner fashion.

ac5 ac1

Selective Fading Model 2

The selective fading model in GNU Radio takes N flat fading models at fixed fractional delays measured in samples to define a power delay profile [PDP].  For instance delays of [0,1,1.5] and amplitude of [1,1,0.5] would introduce three flat fading components to a PDP at 0 samples delay, 1 sample delay, and 1.5 samples delay, with magnitudes of 1,1, and 0.5 respectively.   This is a standard way to form a frequency-selective fading channel out of a small number of flat fading components.    However, this is a rather contrived fading channel because the PDP components are a fixed delays in time which don’t change during the simulation!   In the real world, we are moving around, reflectors are moving around, direct and indirect path lengths change over time, and so the delays corresponding to these paths shift earlier or later in time.

In an attempt to simulate this effect, we’ve introduced the selective fading model 2, which adds a delay_std and delay_maxdev parameter to each PDP component.   The delay_std, defines a standard deviation of a gaussian random walk in time per sample, measured in samples, while the delay_maxdev defines a maximum distance in time to deviate from the initial delay value.   Experimentally, this significantly helps to reduce repetitious behavior and create a more realistic seeming fading environment for some scenarios.

Using gr-fosphor, we can see a brief excerpt of white noise sent through a fading channel using selective model 2 below.

new_sel_Fader2

Improved Channel Diagnostics and Visualization

A useful step in the validation of this or any other channel model is that of inspecting the impulse response of the channel.   To enable this we add a message output port to the selective_model2 block which passes the complex channel taps at the end of every work function forward.   For now we can simply plot these complex vector messages so that we can visually see the effect of the channel on the time domain while observing the effect on the spectrogram.   These could of course also be used to cheat in an equalizer or other channel estimation algorithm and use channel state information, CSI, that would otherwise not be available in a real system.   This could be very useful for validation or performance measurement of such algorithms in the future.

The graph implementing this simulation and a still from it are shown below …

graph_plot

fader_resp_plots

Finally, running the simulation, we put together a short video clip to show flat white noise through a Rayleigh/NLOS channel simulation.

[wpvideo et6wj0y1]

Convolutional Radio Modulation Recognition Networks

In an arxiv pre-publication report out today, Johnathan Corgan and I study the adaptation of convolutional neural networks to the task of modulation recognition in wireless systems.   We use a relatively simple two layer convolutional network followed by two dense layers, a much smaller network than required for tasks such as ImageNet/ILVC.

net2

We demonstrate that blind time domain feature learning can perform extremely well at the task of modulation classification, achieving a very high accuracy rate on both clean and noisy data sets.

conf_conv_18

As we compare the classifier performance across a wide range of signal to noise ratios, we demonstrate that it outperforms a number of more traditional expert classifiers using zero-delay cumulant features by a large margin.

modreq_snr

While this is preliminary work, we think the results are exciting and that many additional promising results will come from the marriage of software radio and deep learning fields.

For much more detail on these results, please see our paper!  http://arxiv.org/abs/1602.04105

3D Printing a USRP B200 Mini Case

2016-02-10

[edit] Download or Order this model here

USRPs are incredibly handy devices, they let us play all over the spectrum with the signal processing algorithms and software of the day.  The USRP B210 was an awesome step in practicality requiring only USB3 for I/O and power, minimizing the number of cables required to haul around.   However, it’s size and chunky case options have been a source of frustration.   It’s a nice thing to always keep with you, but when packing bags and conserving space, it just can’t always make the cut.

The Ettus Research USRP B200 Mini recently changed all that by releasing a very compact version of the B200 which takes up virtually no space, but frustratingly doesn’t ship with a case to protect it from abuse!   Carrying around padded electrostatic wrap bags isn’t particularly appealing or protective, so I set about to put together a functional case for the device that would at least protect it from physical abuse.

The top and bottom case renderings of the resulting design are shown below, about the size of a stack of business cards. As long as a GPS-DO isn’t needed, this is now pretty much the perfect compact carrying companion for GNU Radio.

case_bottom case_top

The first print, on a pretty low end 3D printer is shown below.   After a few tweaks, fitment around the SMA plugs is very tight, the board fitment into the case is snug otherwise, a bit of space was added around the USB port to allow various sized plugs to clear it.

2016-02-09 (1)

For scale, we show it here next to a full size B210 + case.   While much of the design of the underlying board is the same here, the size reduction, and more tightly fitted case, and resulting hauling size of this device step is pretty amazing!

2016-02-10 (1)

The fit isn’t completely perfect, it could use a little bit more clearance in a few spots, but it shouldn’t be putting too much tension on any overly fragile areas, and seems like it could take quite a bit of beating.   We’ll see how long this one survives!

For anyone interested in having one of these, the STL Case Models have been made available for purchase, or download on shapeways at https://www.shapeways.com/shops/osh

A bit more eye candy below …

the micro-shibu

case3

case4

case2

Note: I would suggest using something like a #4-40 thread and 3/8″length screw size for securing this, see links below.

Black #4-40, 3/8″ Machine Screws   Black #4-40 Hex Nut

 

GNU Radio TensorFlow Blocks

TensorFlow is a powerful python-numpy expression compiler which supports concurrent GPP and GPU offload of large algorithms.  It has been used largely in the machine learning community, but has implications for the rapid and efficient implementation of numerous algorithms in software.   For GNU Radio, it matches up wonderfully with GNU Radio’s python blocks, which pass signal processing data around as numpy ndarrays which can be directly passed to and from TensorFlow compiled functions.   This is very very similar to what I did with gr-theano, but with the caveat that TensorFlow has native complex64 support without any additional patching!  This makes it a great candidate for dropping in highly computationally complex blocks for prototyping and leveraging highly concurrent GPUs when there is gross data parallelism that can easily be leveraged by the compiler.

A quick example of dropping TensorFlow into a python block might look something like this

class add(gr.sync_block):
 x = tensorflow.placeholder("complex64")
 y = tensorflow.placeholder("complex64")
 def __init__(self):
   gr.sync_block.__init__(self,
     name="tf_add",
     in_sig=[numpy.complex64, numpy.complex64],
     out_sig=[numpy.complex64])
   self.sess = tensorflow.Session()
   self.op = tensorflow.add( self.x, self.y)
 def work(self, input_items, output_items):
   rv = self.sess.run([self.op], feed_dict={self.x:input_items[0], self.y:input_items[1]})
   output_items[0][:] = rv[0]
   return len(rv[0])

We simply define self.op as an algorithmic expression we want to compute at run time, and TensorFlow will compile the kernel down to the GPP or GPU depending on available resources, and handle all of the data I/O behind the scenes after we simply pass ndarrays in and out of the work function.

grtfplot

Dropping this block into a new gr-tf out of tree module, we can rapidly plug it into a working GNU Radio flowgraph stream! Clearly there are algorithms which make a lot more sense to offload than “add_cc”.  Things like streaming CAF or MTI computations with lots of concurrent integration come to mind and would be pretty trivial to add.  For now this is just a proof of concept, but it seems like a great way to prototype such things in the future!

The module is available on github @ https://github.com/osh/gr-tf/