# An Introduction to Deep Learning for the Physical Layer

#### Timothy J. O’Shea, Jakob Hoydis

We present and discuss several novel applications of deep learning for the physical layer. By interpreting a communications system as an autoencoder, we develop a fundamental new way to think about communications system design as an end-to-end reconstruction task that seeks to jointly optimize transmitter and receiver components in a single process. We show how this idea can be extended to networks of multiple transmitters and receivers and present the concept of radio transformer networks as a means to incorporate expert domain knowledge in the machine learning model. Lastly, we demonstrate the application of convolutional neural networks on raw IQ samples for modulation classification which achieves competitive accuracy with respect to traditional schemes relying on expert features. The paper is concluded with a discussion of open challenges and areas for future investigation.

https://arxiv.org/abs/1702.00832

# Bit Error Rate Testing in GNU Radio

When testing modems, its often a good idea to make sure the bit error rate (BER) of your receiver lines up with what you might expect from theory.  To this end, GNU Radio has long needed a handful of blocks which make this easy.  Test equipment often has built in psuedo-random test bit sequence (PRBS) modes which can produce known long strings of whitened bits for this sort of testing, but we’ve not had handy blocks to do this in a nice way without manually using the lfsr block, xor block, and something to count bit errors.

Today I added prbs_source_b and prbs_sink_b to the gr-mapper OOT module which provide ready made blocks for this purpose.  An example flowgraph application has been provided in gr-mapper called “prbs_test.grc” which provides a QPSK loopback test of these BER calculation blocks.  For the moment its just printing statistics to screen and averaging them linearly from startup to the current time, at some point these could output async messages if they needed to be incorporated into a larger suite or some downstream logic, and in the case of wanting a recent-rolling BER rather than an absolute BER over the entire run, we could implement some kind of IIR based averaging in the update.  Regardless, these blocks aren’t super exciting, but they are perhaps useful tools that others can use in modem verification!  Screenshot below –

These same blocks should work equally well over the air — or with other modulations, so long as your framing/sync keeps them properly aligned!

# Learning to Communicate with Unsupervised Channel Autoencoders

Our radio physical layers are actually pretty simplistic and boring in the world right now, PSK and QAM are well defined expert representations of information to transit a wireless channel.  Systems using OFDM and SC-FDMA are a bit more involved, but use some of the same constructs underneath with a bit of shuffling sub-carriers.   Forward error correction (FEC), equalization, randomization, and a number of other functions are generally bolted onto this as separate and independent blocks and transforms to make up for performance properties or assumptions of each other layer in order to form an effective end-to-end system.

Enter machine learning … rethink all the things …

We’ve just pre-pubbed a paper to arXiv focusing on trying to learn entire communications systems using unsupervised reconstruction learning (autoencoders).   We seek to reconstruct transmitted information bits at a receiver while introducing channel impairments in the hidden layer of the network to simulate a wireless channel.   By doing this we force learned representations in the encoder and decoder to adapt jointly to optimize for reconstruction performance of the information bits (we refer to this as channel regularization).  The high level design looks something like this:

We evaluate a number of different autoencoder network structures and also consider keeping the CNN layer constrained to a relatively low number of filters to emulate the relatively low number of communications symbols typically used in communications system (although this is not necessarily optimal, but helps with intuition).  The structure of our DNN-CNN network candidate looks something like this:

Once we learn a transmit/receive representation in the autoencoder we can evaluate its performance across a range of channel conditions.  Traditional wireless channel performance measures such as BER vs SNR and spectral efficiency can be easily compared to legacy expert modulation techniques as shown below.

We discuss a handful of other issues including how to start jointly learning synchronization methods on the front of the decoder using radio transformer networks and how to start simulating channel effects beyond simple additive Gaussian noise.   I’m pretty excited about the future of this form of unsupervised communications system learning, there’s a ton of work to do to make it work way better over the air and amongst harsh channel conditions.   Hoping to see what others do with this, and finalize a conference version of it for submission soon.

Check out the paper at: https://arxiv.org/abs/1608.06409

# MNIST Generative Adversarial Model in Keras

Some of the generative work done in the past year or two using generative adversarial networks (GANs) has been pretty exciting and demonstrated some very impressive results.  The general idea is that you train two models, one (G) to generate some sort of output example given random noise as input, and one (A) to discern generated model examples from real examples.  Then, by training A to be an effective discriminator, we can stack G and A to form our GAN, freeze the weights in the adversarial part of the network, and train the generative network weights to push random noisy inputs towards the “real” example class output of the adversarial half.

Building this style of network in the latest versions of Keras is actually quite straightforward and easy to do, I’ve wanted to try this out on a number of things so I put together a relatively simple version using the classic MNIST dataset to use a GAN approach to generating random handwritten digits.

Before going further I should mention all of this code is available on github here.

## Generative Model

We set up a relatively straightforward generative model in keras using the functional API, taking 100 random inputs, and eventually mapping them down to a [1,28,28] pixel to match the MNIST data shape.  Be begin by generating a dense 14×14 set of values, and then run through a handful of filters of varying sizes and numbers of channels and ultimately train using and Adam optimizer for binary cross-entropy (although we really only use the generator model in the forwards direction, we don’t train directly on this model itself).  We use a sigmiod on the output layer to help saturate pixels into 0 or 1 states rather than a range of grays in between, and use batch normalization to help accelerate training and ensure that a wide range of activations are used within each layer.

# Build Generative model ...
nch = 200
g_input = Input(shape=[100])
H = Dense(nch*14*14, init='glorot_normal')(g_input)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Reshape( [nch, 14, 14] )(H)
H = UpSampling2D(size=(2, 2))(H)
H = Convolution2D(nch/2, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(nch/4, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(1, 1, 1, border_mode='same', init='glorot_uniform')(H)
g_V = Activation('sigmoid')(H)
generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer=opt)
generator.summary()

We now have a network which could in theory take in 100 random inputs and output digits, although the current weights are all random and this clearly isn’t happening just yet.

## Adversarial Model

We build an adversarial discriminator network to take in [1,28,28] image vectors and decide if they are real or fake by using several convolutional layers, a dense layer, lots of dropout, and a two element softmax output layer encoding: [0,1] = fake, and [1,0] = real.  This is a relatively simple network, but the goal here is largely to get something that works passably and trains relatively quickly for experimentation.

# Build Discriminative model ...
d_input = Input(shape=shp)
H = Convolution2D(256, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(d_input)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Convolution2D(512, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Flatten()(H)
H = Dense(256)(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
d_V = Dense(2,activation='softmax')(H)
discriminator = Model(d_input,d_V)
discriminator.compile(loss='categorical_crossentropy', optimizer=dopt)
discriminator.summary()

We pre-train the discriminative model by generating a handful of random images using the untrained generative model, concatenating them with an equal number of real images of digits, labeling them appropriately, and then fitting until we reach a relatively stable loss value which takes 1 epoch over 20,000 examples.  This is an important step which should not be skipped — pre-training accelerates the GAN massively and I was not able to achieve convergence without it (possibly due to impatience).

## Generative Adversarial Model

Now that we have both the generative and adversarial models, we can combine them to make a GAN quite easily in Keras.  Using the functional API, we can simply re-use the same network objects we have already instantiated and they will conveniently maintain the same shared weights with the previously compiled models.  Since we want to freeze the weights in the adversarial half of the network during back-propagation of the joint model, we first run through and set the keras trainable flag to False for each element in this part of the network.  For now, this seems to need to be applied at the primitive layer level rather than on the high level network so we introduce a simple function to do this.

# Freeze weights in the discriminator for stacked training
def make_trainable(net, val):
net.trainable = val
for l in net.layers:
l.trainable = val
make_trainable(discriminator, False)

# Build stacked GAN model
gan_input = Input(shape=[100])
H = generator(gan_input)
gan_V = discriminator(H)
GAN = Model(gan_input, gan_V)
GAN.compile(loss='categorical_crossentropy', optimizer=opt)
GAN.summary()

At this point, we now have a randomly initialized generator, a (poorly) trained discriminator, and a GAN which can be trained across the stacked model of both networks.  The core of training routine for a GAN looks something like this.

1. Generate images using G and random noise (forward pass only).
2. Perform a Batch update of weights in A given generated images, real images, and labels.
3. Perform a Batch update of weights in G given noise and forced “real” labels in the full GAN.
4. Repeat…

Running this process for a number of epochs, we can plot the loss of the GAN and Adversarial loss functions over time to get our GAN loss plots during training.

And finally, we can plot some samples from the trained generative model which look relatively like the original MNIST digits, and some examples from the original dataset for comparison.

https://github.com/osh/KerasGAN

# Reducing 1D Convolution to a Single (Big) Matrix Multiplication

This is perhaps the 3rd time I’ve needed this recipe and it doesnt seem to be readily available on google.  Theano and Tensorflow provide convolution primitives for 1D and 2D, but (correct me if I’m wrong) I think they are generally constrained such that the filter taps you are convolving must be parameters, and not additional tensor values in a big tensor application.   This is unfortunate, and annoying for certain operations, and my work around is to implement my own convolution as a matrix multiplication based on a properly indexed version of the input and tap tensors within an operation.

Anyway, hopefully this snippet will be useful to someone else some day –

The idea here is simply that we can simply use a toeplitz matrix to generate a large 2D matrix (H) which is simply indexes into a 1D input of taps (h).   Multiplying our input (x) by the 2D (H) matrix then simply gives us our convolution output (y).   Its fairly simple but somewhat tedious to set up, an example implementation is shown below for reference.

#!/usr/bin/env python
import numpy as np
from scipy import linalg
from scipy import signal
x = np.array([0,0,1,0,0,2,0,0,0]) # 9
h = np.array([0,1,2,0]) # 4
y = signal.convolve(x, h, mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print "shape", H.shape, x.shape
y = np.sum(np.multiply(x,H), 1)
print "y(mult):", y
print "**********************"
x = np.array([0,0,1,0,0,2,0,0,0]) # nsamp
x = np.tile(x,[10,1]) # n_ex x n_samp
h = np.array([0,1,2,0]) # n_samp
h = np.tile(h,[10,1]) # n_ex x n_samp
y = np.zeros([x.shape[0], x.shape[1]])
for i in range(0,x.shape[0]):
y[i,:] = signal.convolve(x[i,:], h[i,:], mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
H = np.zeros([ x.shape[0], x.shape[1], x.shape[1] ]) # n_ex x n_samp x n_samp
for i in range(0,x.shape[0]):
padding = np.zeros(x.shape[1]-1, h.dtype) #
first_col = np.r_[h[i,:], padding] #
first_row = np.r_[h[i,0], padding] #
H[i,:,:] = linalg.toeplitz(first_col, first_row)[1:x.shape[1]+1,:]
print "H shape", H.shape
print H[0,:,:]
x = x.reshape([x.shape[0], 1, x.shape[1]])
x = np.tile(x, [1,x.shape[1],1])
y = np.sum(np.multiply(x,H), 2)
print "y(mult):", y
print "**********************"
h = np.array([0,1,2,3,4,5,6,7,8], dtype='int32')
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print H

# KeRLym: A Deep Reinforcement Learning Toolbox in Keras

Reinforcement learning coupled with deep learning based function approximation has been an exciting area over the past couple years.  The appeal of learning methods which can effectively learn to search an action/reward environment and derive a good policy based on experience and random exploration is quite significant for a wide range of applications.  Vlad Minh’s original DeepMind Deep-Q Networks (DQN) paper demonstrating raw-pixel based learning on Atari games was an awesome demonstration of what was possible, and there have been tons of improvements and other interesting applications by others since then.

Since then, who hasn’t wanted to play around with RL on the handful of Atari games and their own domain specific automation tasks?   DeepMind released their code for this experiment along with the Nature paper, but it was frustratingly in Lua/Torch and as the paper stated, takes quite long (~30 days?) to learn Atari games to a high level of skill.   Since then I’ve become quite fond of working with Keras, Theano, and TensorFlow on a range of ML problems — the workflow and simplicity of python/numpy/tensor algorithm definition and Cuda cross-compilation is just too attractive and productive for me to want to work in anything else right now, so naturally I wanted to leverage these same tools in the RL space.  I started looking into DQN and other RL algorithm implementations available and found a handful of helpful examples, but no particularly featureful, fast or satisfying projects which were designed to be easily applied to your own environments, which got me thinking about standardizing an interface to environments so that learners could be easily applied to a wide class of problems.  Shortly after this though occured to me, OpenAI published their GYM software and online environment scoreboard — pretty much solving this problem and providing a wide range of environmental learning tasks already integrated into a relatively simple reinforcement learning environment API.   This was great, I started playing with DQN implementations leveraging Keras on top of Gym and KeRLym (Keras+RL+Gym) was the result.

The initial results from kerlym were relatively frustrating, DQN tuning is hard and implementing the algorithms is error prone.  The Atari simulator isn’t the fastest, and it takes quite a while to sequentially play enough games to generate a significant amount of experience.  So then there’s been a good bit of work recently in asynchronous methods for RL, running lots of agents in parallel to each run their own episodes and share model parameters and gradients.  Corey Lynch published an awesome implementation of async-rl using Keras and Gym-based Atari games which I spent a good bit of time playing with.  The result was I refactored kerlym significantly to leverage a lot of the async-dqn techniques demonstrated there.

With the new asynchronous DQN implementation, frame-diff’ing, an atari frame pre-processor Andrej Karpathy used recently in his blog post about RL, I finally had a somewhat effective learner that I could set loose on Atari games and see a gradual improvement of total reward (score per game) take form over the course of several hours.   Below is an example of ~64k episodes of Breakout running on kerlym with diagnostic plots enabled to monitor training performance.

At this point I finally have some confidence in the correctness of this agent implementation, but there are still countless hyper-parameters which can be tuned and significantly effect performance.

There are a couple of directions I hope to go at this point:

• Implementing additional agents to compare training performance: I love the speedup of asynchronous/concurrent training, and I’m impatient for multi-day RL tests, so I would really love to add working asynchronous Policy Gradient (PG), TRPO, and A3C agents which can be easily interchanged and tested.
• Exploring applications of these learning agents to new domains: What other tasks can we readily train and execute using the learning models we have at this point?  Being an applied person, I kind of want to throw DQNs at every task under the sun at this point and see what works, the goal of kerlym is largely to make this easy to do.  I’ve started building out-of-tree gym environments for various tasks, such as the tone search task described here, and its exciting to think of the possibilities applying this this to a number of radio domain tasks.

For now, its hard to stop watching kerlym play Breakout and Pong over and over, slowly improving.

https://github.com/osh/kerlym

# Learning to Synchronize with Attention Models

Synchronization is often one of the most involved tasks to get right when building, testing, and deploying a radio system.  In this work, we look at treating synchronization as a learned attention model in a deep neural network to provide a canonical form signal for classification.  We use the same discriminative network as used in prior work and obtain slightly better classification performance.  We introduce a handful of new layers into Keras to build a domain specific set of radio transforms to supplement those used in imagery and described in this paper on Spatial Transformer Networks.

Classification is perhaps not the most interesting task to apply an attention model for synchronization.  Due to the extremely low SNR of much of the data-set, good synchronization is hard to achieve on short data samples with learned or expert synchronization metrics, and many of the learned discriminative features seem to be relatively robust to synchronization error.  We plan to revisit this attention model more in future work, potentially for other sorts of tasks for which it may be more beneficial, regardless, plotting a color-coded distribution over the density of constellation points before an after the transform on the QPSK subset of the data-set, we can definitely see some qualitative improvements in orderly signal structure.

Checkout the paper on arXiv for more details!

# Unsupervised Radio Signal Representation Learning

We’ve just posted a brief new arXiv article (https://arxiv.org/abs/1604.07078) on learning to represent modulated radio signals using unsupervised learning.  We employ a small autoencoder network with convolutional and fully connected layers to fit a sparse signal representation with no expert knowledge or supervision.  Mean squared error reconstruction distance and regularization are used during training.

One example of a noisy test set example, its compressed representation, and its reconstruction is shown below for a QPSK signal, additional details are available in the arXiv paper!  We achieve a 16x compression in information density (2x88x4->1×44), and 128x in storage space (2x88x32->1×44)!  We’re looking forward to doing many more things with these ideas!

As a side note, since drawing hundreds of neural network connection lines in diagramming tools manually is really not fun, I’ve posted a small tool called NNPlot on github which attempts to make generating high level conceptual neural network diagrams much easier.  Hopefully someone else will find this of use some day, the network diagram above is the first example in it.

# Dynamic GNU Radio Channel Model Enhancements

In an attempt to test modem performance deterministically through dropout conditions and partial fades selective, we added the fading model and selective fading model to GNU Radio a few years ago.   Recently Bastian Bloessl pointed out that the auto-correlation properties of these channel responses were degrading over time and did a great write up on it here.

## Flat Fader Corrections

After looking into the issue we now have stable auto-correlation properties not degrading with the phase accumulation and large non-dense floating point representation that occurred after very long runs of the original channel model.   Here we see the autocorrelation at the beginning of a run, and 500 MSamples into a run both follow the analytically expected ACF closely with the patch introduced here.   Soon to be squashed and merged in a cleaner fashion.

## Selective Fading Model 2

The selective fading model in GNU Radio takes N flat fading models at fixed fractional delays measured in samples to define a power delay profile [PDP].  For instance delays of [0,1,1.5] and amplitude of [1,1,0.5] would introduce three flat fading components to a PDP at 0 samples delay, 1 sample delay, and 1.5 samples delay, with magnitudes of 1,1, and 0.5 respectively.   This is a standard way to form a frequency-selective fading channel out of a small number of flat fading components.    However, this is a rather contrived fading channel because the PDP components are a fixed delays in time which don’t change during the simulation!   In the real world, we are moving around, reflectors are moving around, direct and indirect path lengths change over time, and so the delays corresponding to these paths shift earlier or later in time.

In an attempt to simulate this effect, we’ve introduced the selective fading model 2, which adds a delay_std and delay_maxdev parameter to each PDP component.   The delay_std, defines a standard deviation of a gaussian random walk in time per sample, measured in samples, while the delay_maxdev defines a maximum distance in time to deviate from the initial delay value.   Experimentally, this significantly helps to reduce repetitious behavior and create a more realistic seeming fading environment for some scenarios.

Using gr-fosphor, we can see a brief excerpt of white noise sent through a fading channel using selective model 2 below.

## Improved Channel Diagnostics and Visualization

A useful step in the validation of this or any other channel model is that of inspecting the impulse response of the channel.   To enable this we add a message output port to the selective_model2 block which passes the complex channel taps at the end of every work function forward.   For now we can simply plot these complex vector messages so that we can visually see the effect of the channel on the time domain while observing the effect on the spectrogram.   These could of course also be used to cheat in an equalizer or other channel estimation algorithm and use channel state information, CSI, that would otherwise not be available in a real system.   This could be very useful for validation or performance measurement of such algorithms in the future.

The graph implementing this simulation and a still from it are shown below …

Finally, running the simulation, we put together a short video clip to show flat white noise through a Rayleigh/NLOS channel simulation.

[wpvideo et6wj0y1]

# Convolutional Radio Modulation Recognition Networks

In an arxiv pre-publication report out today, Johnathan Corgan and I study the adaptation of convolutional neural networks to the task of modulation recognition in wireless systems.   We use a relatively simple two layer convolutional network followed by two dense layers, a much smaller network than required for tasks such as ImageNet/ILVC.

We demonstrate that blind time domain feature learning can perform extremely well at the task of modulation classification, achieving a very high accuracy rate on both clean and noisy data sets.

As we compare the classifier performance across a wide range of signal to noise ratios, we demonstrate that it outperforms a number of more traditional expert classifiers using zero-delay cumulant features by a large margin.

While this is preliminary work, we think the results are exciting and that many additional promising results will come from the marriage of software radio and deep learning fields.

For much more detail on these results, please see our paper!  http://arxiv.org/abs/1602.04105