Learning to Communicate with Unsupervised Channel Autoencoders

Our radio physical layers are actually pretty simplistic and boring in the world right now, PSK and QAM are well defined expert representations of information to transit a wireless channel.  Systems using OFDM and SC-FDMA are a bit more involved, but use some of the same constructs underneath with a bit of shuffling sub-carriers.   Forward error correction (FEC), equalization, randomization, and a number of other functions are generally bolted onto this as separate and independent blocks and transforms to make up for performance properties or assumptions of each other layer in order to form an effective end-to-end system.

Enter machine learning … rethink all the things …

We’ve just pre-pubbed a paper to arXiv focusing on trying to learn entire communications systems using unsupervised reconstruction learning (autoencoders).   We seek to reconstruct transmitted information bits at a receiver while introducing channel impairments in the hidden layer of the network to simulate a wireless channel.   By doing this we force learned representations in the encoder and decoder to adapt jointly to optimize for reconstruction performance of the information bits (we refer to this as channel regularization).  The high level design looks something like this:

cae1

We evaluate a number of different autoencoder network structures and also consider keeping the CNN layer constrained to a relatively low number of filters to emulate the relatively low number of communications symbols typically used in communications system (although this is not necessarily optimal, but helps with intuition).  The structure of our DNN-CNN network candidate looks something like this:

cae2

Once we learn a transmit/receive representation in the autoencoder we can evaluate its performance across a range of channel conditions.  Traditional wireless channel performance measures such as BER vs SNR and spectral efficiency can be easily compared to legacy expert modulation techniques as shown below.

cae3

We discuss a handful of other issues including how to start jointly learning synchronization methods on the front of the decoder using radio transformer networks and how to start simulating channel effects beyond simple additive Gaussian noise.   I’m pretty excited about the future of this form of unsupervised communications system learning, there’s a ton of work to do to make it work way better over the air and amongst harsh channel conditions.   Hoping to see what others do with this, and finalize a conference version of it for submission soon.

 

Check out the paper at: https://arxiv.org/abs/1608.06409

MNIST Generative Adversarial Model in Keras

Some of the generative work done in the past year or two using generative adversarial networks (GANs) has been pretty exciting and demonstrated some very impressive results.  The general idea is that you train two models, one (G) to generate some sort of output example given random noise as input, and one (A) to discern generated model examples from real examples.  Then, by training A to be an effective discriminator, we can stack G and A to form our GAN, freeze the weights in the adversarial part of the network, and train the generative network weights to push random noisy inputs towards the “real” example class output of the adversarial half.

mnist_gan
High Level GAN Architecture

Building this style of network in the latest versions of Keras is actually quite straightforward and easy to do, I’ve wanted to try this out on a number of things so I put together a relatively simple version using the classic MNIST dataset to use a GAN approach to generating random handwritten digits.

Before going further I should mention all of this code is available on github here.

Generative Model

We set up a relatively straightforward generative model in keras using the functional API, taking 100 random inputs, and eventually mapping them down to a [1,28,28] pixel to match the MNIST data shape.  Be begin by generating a dense 14×14 set of values, and then run through a handful of filters of varying sizes and numbers of channels and ultimately train using and Adam optimizer for binary cross-entropy (although we really only use the generator model in the forwards direction, we don’t train directly on this model itself).  We use a sigmiod on the output layer to help saturate pixels into 0 or 1 states rather than a range of grays in between, and use batch normalization to help accelerate training and ensure that a wide range of activations are used within each layer.

# Build Generative model ...
nch = 200
g_input = Input(shape=[100])
H = Dense(nch*14*14, init='glorot_normal')(g_input)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Reshape( [nch, 14, 14] )(H)
H = UpSampling2D(size=(2, 2))(H)
H = Convolution2D(nch/2, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(nch/4, 3, 3, border_mode='same', init='glorot_uniform')(H)
H = BatchNormalization(mode=2)(H)
H = Activation('relu')(H)
H = Convolution2D(1, 1, 1, border_mode='same', init='glorot_uniform')(H)
g_V = Activation('sigmoid')(H)
generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer=opt)
generator.summary()

We now have a network which could in theory take in 100 random inputs and output digits, although the current weights are all random and this clearly isn’t happening just yet.

Sad images from an untrained generator
Sad images from an untrained generator

Adversarial Model

We build an adversarial discriminator network to take in [1,28,28] image vectors and decide if they are real or fake by using several convolutional layers, a dense layer, lots of dropout, and a two element softmax output layer encoding: [0,1] = fake, and [1,0] = real.  This is a relatively simple network, but the goal here is largely to get something that works passably and trains relatively quickly for experimentation.

# Build Discriminative model ...
d_input = Input(shape=shp)
H = Convolution2D(256, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(d_input)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Convolution2D(512, 5, 5, subsample=(2, 2), border_mode = 'same', activation='relu')(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
H = Flatten()(H)
H = Dense(256)(H)
H = LeakyReLU(0.2)(H)
H = Dropout(dropout_rate)(H)
d_V = Dense(2,activation='softmax')(H)
discriminator = Model(d_input,d_V)
discriminator.compile(loss='categorical_crossentropy', optimizer=dopt)
discriminator.summary()

We pre-train the discriminative model by generating a handful of random images using the untrained generative model, concatenating them with an equal number of real images of digits, labeling them appropriately, and then fitting until we reach a relatively stable loss value which takes 1 epoch over 20,000 examples.  This is an important step which should not be skipped — pre-training accelerates the GAN massively and I was not able to achieve convergence without it (possibly due to impatience).

Generative Adversarial Model

Now that we have both the generative and adversarial models, we can combine them to make a GAN quite easily in Keras.  Using the functional API, we can simply re-use the same network objects we have already instantiated and they will conveniently maintain the same shared weights with the previously compiled models.  Since we want to freeze the weights in the adversarial half of the network during back-propagation of the joint model, we first run through and set the keras trainable flag to False for each element in this part of the network.  For now, this seems to need to be applied at the primitive layer level rather than on the high level network so we introduce a simple function to do this.

# Freeze weights in the discriminator for stacked training
def make_trainable(net, val):
    net.trainable = val
    for l in net.layers:
       l.trainable = val
make_trainable(discriminator, False)

# Build stacked GAN model
gan_input = Input(shape=[100])
H = generator(gan_input)
gan_V = discriminator(H)
GAN = Model(gan_input, gan_V)
GAN.compile(loss='categorical_crossentropy', optimizer=opt)
GAN.summary()

At this point, we now have a randomly initialized generator, a (poorly) trained discriminator, and a GAN which can be trained across the stacked model of both networks.  The core of training routine for a GAN looks something like this.

  1. Generate images using G and random noise (forward pass only).
  2. Perform a Batch update of weights in A given generated images, real images, and labels.
  3. Perform a Batch update of weights in G given noise and forced “real” labels in the full GAN.
  4. Repeat…

Running this process for a number of epochs, we can plot the loss of the GAN and Adversarial loss functions over time to get our GAN loss plots during training.

mnist_gan_loss4
GAN Training Loss

And finally, we can plot some samples from the trained generative model which look relatively like the original MNIST digits, and some examples from the original dataset for comparison.

mnist_gan7
GAN Generated Random Digits
mnist_real
Examples Digits from Real MNIST Set

https://github.com/osh/KerasGAN

Reducing 1D Convolution to a Single (Big) Matrix Multiplication

This is perhaps the 3rd time I’ve needed this recipe and it doesnt seem to be readily available on google.  Theano and Tensorflow provide convolution primitives for 1D and 2D, but (correct me if I’m wrong) I think they are generally constrained such that the filter taps you are convolving must be parameters, and not additional tensor values in a big tensor application.   This is unfortunate, and annoying for certain operations, and my work around is to implement my own convolution as a matrix multiplication based on a properly indexed version of the input and tap tensors within an operation.

Anyway, hopefully this snippet will be useful to someone else some day –

The idea here is simply that we can simply use a toeplitz matrix to generate a large 2D matrix (H) which is simply indexes into a 1D input of taps (h).   Multiplying our input (x) by the 2D (H) matrix then simply gives us our convolution output (y).   Its fairly simple but somewhat tedious to set up, an example implementation is shown below for reference.

#!/usr/bin/env python
import numpy as np
from scipy import linalg
from scipy import signal
x = np.array([0,0,1,0,0,2,0,0,0]) # 9
h = np.array([0,1,2,0]) # 4
y = signal.convolve(x, h, mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print "shape", H.shape, x.shape
y = np.sum(np.multiply(x,H), 1)
print "y(mult):", y
print "**********************"
x = np.array([0,0,1,0,0,2,0,0,0]) # nsamp
x = np.tile(x,[10,1]) # n_ex x n_samp
h = np.array([0,1,2,0]) # n_samp
h = np.tile(h,[10,1]) # n_ex x n_samp
y = np.zeros([x.shape[0], x.shape[1]])
for i in range(0,x.shape[0]):
    y[i,:] = signal.convolve(x[i,:], h[i,:], mode='same')
print "x", x
print "h", h
print "y(conv):", y
# set up the toeplitz matrix
H = np.zeros([ x.shape[0], x.shape[1], x.shape[1] ]) # n_ex x n_samp x n_samp
for i in range(0,x.shape[0]):
    padding = np.zeros(x.shape[1]-1, h.dtype) #
    first_col = np.r_[h[i,:], padding] #
    first_row = np.r_[h[i,0], padding] #
    H[i,:,:] = linalg.toeplitz(first_col, first_row)[1:x.shape[1]+1,:]
print "H shape", H.shape
print H[0,:,:]
x = x.reshape([x.shape[0], 1, x.shape[1]])
x = np.tile(x, [1,x.shape[1],1])
y = np.sum(np.multiply(x,H), 2)
print "y(mult):", y
print "**********************"
h = np.array([0,1,2,3,4,5,6,7,8], dtype='int32')
padding = np.zeros(len(x)-1, h.dtype)
first_col = np.r_[h, padding]
first_row = np.r_[h[0], padding]
H = linalg.toeplitz(first_col, first_row)[1:len(x)+1,:]
print H

 

KeRLym: A Deep Reinforcement Learning Toolbox in Keras

Reinforcement learning coupled with deep learning based function approximation has been an exciting area over the past couple years.  The appeal of learning methods which can effectively learn to search an action/reward environment and derive a good policy based on experience and random exploration is quite significant for a wide range of applications.  Vlad Minh’s original DeepMind Deep-Q Networks (DQN) paper demonstrating raw-pixel based learning on Atari games was an awesome demonstration of what was possible, and there have been tons of improvements and other interesting applications by others since then.

Since then, who hasn’t wanted to play around with RL on the handful of Atari games and their own domain specific automation tasks?   DeepMind released their code for this experiment along with the Nature paper, but it was frustratingly in Lua/Torch and as the paper stated, takes quite long (~30 days?) to learn Atari games to a high level of skill.   Since then I’ve become quite fond of working with Keras, Theano, and TensorFlow on a range of ML problems — the workflow and simplicity of python/numpy/tensor algorithm definition and Cuda cross-compilation is just too attractive and productive for me to want to work in anything else right now, so naturally I wanted to leverage these same tools in the RL space.  I started looking into DQN and other RL algorithm implementations available and found a handful of helpful examples, but no particularly featureful, fast or satisfying projects which were designed to be easily applied to your own environments, which got me thinking about standardizing an interface to environments so that learners could be easily applied to a wide class of problems.  Shortly after this though occured to me, OpenAI published their GYM software and online environment scoreboard — pretty much solving this problem and providing a wide range of environmental learning tasks already integrated into a relatively simple reinforcement learning environment API.   This was great, I started playing with DQN implementations leveraging Keras on top of Gym and KeRLym (Keras+RL+Gym) was the result.

The initial results from kerlym were relatively frustrating, DQN tuning is hard and implementing the algorithms is error prone.  The Atari simulator isn’t the fastest, and it takes quite a while to sequentially play enough games to generate a significant amount of experience.  So then there’s been a good bit of work recently in asynchronous methods for RL, running lots of agents in parallel to each run their own episodes and share model parameters and gradients.  Corey Lynch published an awesome implementation of async-rl using Keras and Gym-based Atari games which I spent a good bit of time playing with.  The result was I refactored kerlym significantly to leverage a lot of the async-dqn techniques demonstrated there.

With the new asynchronous DQN implementation, frame-diff’ing, an atari frame pre-processor Andrej Karpathy used recently in his blog post about RL, I finally had a somewhat effective learner that I could set loose on Atari games and see a gradual improvement of total reward (score per game) take form over the course of several hours.   Below is an example of ~64k episodes of Breakout running on kerlym with diagnostic plots enabled to monitor training performance.

example

At this point I finally have some confidence in the correctness of this agent implementation, but there are still countless hyper-parameters which can be tuned and significantly effect performance.

There are a couple of directions I hope to go at this point:

  • Implementing additional agents to compare training performance: I love the speedup of asynchronous/concurrent training, and I’m impatient for multi-day RL tests, so I would really love to add working asynchronous Policy Gradient (PG), TRPO, and A3C agents which can be easily interchanged and tested.
  • Exploring applications of these learning agents to new domains: What other tasks can we readily train and execute using the learning models we have at this point?  Being an applied person, I kind of want to throw DQNs at every task under the sun at this point and see what works, the goal of kerlym is largely to make this easy to do.  I’ve started building out-of-tree gym environments for various tasks, such as the tone search task described here, and its exciting to think of the possibilities applying this this to a number of radio domain tasks.

For now, its hard to stop watching kerlym play Breakout and Pong over and over, slowly improving.

https://github.com/osh/kerlym