# Autoencoders

## Learning to Communicate: Hands-on Coding

**Authors**: Sebastian Cammerer, Sebastian Dörner, Adriano Pastore

All beginnings are difficult – we have often been asked *how* to get started with deep learning for communications; not in terms of deep learning theory, but how to really practically training the first neural network for information transmission. Thus, this article aims at lowering the barriers-to-entry by providing (and explaining) the necessary Tensorflow code to run and train an autoencoder for information transmission from scratch.

Parts of this article are based on our experience from organizing the *6 ^{th}IRACON Training School on Deep and Machine Learning Techniques for (Beyond) 5G Wireless Communication Systems* and, in particular, the feedback from the incorporated machine learning challenge. It turned out that this was the first practical coding experience with deep learning for many students. Based on their feedback (and our own conviction), we believe that having a practical perspective on deep learning becomes beneficial for any deep learning (for communications) researcher.

The outline of this article is as follows:

1) “Learning to Communicate” – System Model & Background

2) Choosing the Right Environment and Toolchain

3) Tensorflow Implementation

4) Useful Tips and Common Pitfalls

The Tensorflow code examples are also provided online:

**1.) “Learning to Communicate” – System Model & Background **

The fundamental problem of communication is that of *“reproducing at one point either exactly or approximately a message selected at another point”* [1] or, in other words, reliably transmitting a message from a source to a destination over a channel by the use of a transmitter and a receiver. Therefore, we consider a communication system consisting of a transmitter, a channel, and a receiver.

The transmitter wants to convey a message over the channel to the receiver. To do so, it is allowed to transmit complex baseband symbols, i.e., a vector with a power constraint, e.g., . At the receiver side, a noisy and possibly distorted version of can be observed. Now, the task of the receiver is to produce the estimate of the original message . As explained in [2,3], the communication system described above can be interpreted as an autoencoder [4, Ch. 14]. An autoencoder describes a deep neural network (NN) that is trained to reconstruct the input at the output and, as the information must pass each layer, the network needs to find a robust representation of the input message at every layer.

While *undercomplete* autoencoders (i.e., whose hidden layers have fewer neurons than the input/output) have traditionally been studied for extracting hidden features and learning a robust compressed representation of the input, in the case of communication, we consider *overcomplete* autoencoders. Their purpose is to add (on the transmitter side) and remove (on the receiver side) redundancy to the input message representation in a way that is matched to the channel (noise layer). Such an autoencoder will thus *learn* an efficient error-correcting code and its corresponding decoding algorithm.

The trainable transmitter part of the autoencoder consists of an embedding followed by a feedforward dense NN. Its -dimensional output is cast to an -dimensional complex-valued vector by considering one half as the real part and the other half as the imaginary part. Finally, a normalization layer ensures that the power constraint on the output is met.

The channel can be implemented as a set of layers with probabilistic and deterministic behavior, e.g., for an additive white Gaussian noise (AWGN) channel, Gaussian noise with fixed or random noise power per complex symbol is added.

The receiver consists of a transformation from complex to real values (by concatenating real and imaginary parts of the channel output), followed by a feedforward NN whose last layer has a “softmax” activation (see [4]). Its output is a probability vector that assigns a probability to each of the possible messages. Finally, the estimate of is selected as the index of the largest element of .

The resulting autoencoder can be trained end-to-end using stochastic gradient descent (SGD).

Since we have phrased the communication problem as a classification task, it is natural to use the cross-entropy loss function where denotes the th element of . As we deal with an autoencoder where the output should equal the input during training, we have a fixed number of different training labels. For further details we refer to [2,3].

**2.) Choosing the Right Environment and Toolchain**

Having the right toolchain becomes an important decision in deep learning. As most of the state-of-the-art libraries provide a Python support, we stick to Python (and its numerical libraries such as *numpy*). Further, we use Tensorflow [5] as deep learning framework and access it via the browser-based IDE Jupyter [6]. The main advantage is that the setup can run on any server (ideally with graphics processing unit (GPU) support) while the client just requires a standard webbrowser (and no high-performance computing power).

Generally speaking, no GPU is needed for training (everything runs with a CPU and enough RAM). However, the main advantage of having access to a GPU is the support for accelerated training which speeds up the training by (at least) an order of magnitude.

Further, Google provides a Jupyter environment free of charge (including GPU support for several hours of training) called *“Google Colaboratory”* [7], which is sufficient for our basic experiments. However, the provided code examples (Jupyter notebooks) can be exported and executed in any Jupyter environment running on your own server. To keep the story short: for this tutorial, all software you need is a working web-browser.

**3.) Tensorflow Implementation**

So, enough with the theoretical stuff, let’s get our hands dirty!

At first we need to import the required libraries: *tensorflow*, to perform computing graph based training of the NN; *numpy*, for basic computations and to feed the NN; *matplotlib*, to plot our results.

```
import tensorflow as tf
```

import numpy as np

import matplotlib.pyplot as plt

Then we need to define the main parameters of the autoencoder, and .

```
k = 8 # Number of information bits per message, i.e., M=2**k
```

n = 8 # Number of real channel uses per message

M = 2**k # Number of messages

Now we start building our model in Tensorflow. For this simple example, we define all variables in the Tensorflow default graph. We begin with one of the most basic parameters in deep learning(DL), the batch_size. This defines the number of samples within a mini-batch that is used for the stochastic gradient computation. In our case we want this to be a flexible parameter to allow adjustments during training. Therefore, and given the fact that we can generate as many samples as we want on the fly, we define the batch_sizeas a feedable scalar integer.

```
batch_size = tf.placeholder(tf.int32,shape=[])
```

Now we can create the messages we want to transmit in this batch. They are simply drawn from a random uniform distribution.

```
s = tf.random.uniform(shape=[batch_size],minval=0,maxval=M,dtype=tf.int32)
```

And to efficiently feed them to the first dense NN layer of the transmitter part, we transform them to so called one-hot vectors.

```
s_one_hot = tf.one_hot(s,depth=M)
```

This tensor now holds batch_sizevectors of length , where only one entry (at position ) is set to 1.0 while all other entries are 0.0 and is commonly used for classification tasks.

Let’s define the transmitter part. Only two dense layers are already enough to perform a transformation from messages to real valued channel uses. This is basically a simple lookup-table transformation that could also be implemented by a single matrix of trainable weights, but for simplicity we use default Tensorflow/Keras dense layers in the following. The first dense transmitter layer shall be “relu” activated [4] and can have any number of neurons (called *units*), we chose because the samples of input s_one_hot are also of length . The second dense layer is required to have units, which form the output of the transmitter, and shall not have any activation function, since we want the transmitter to be able to output any real-valued numbers (to also allow negative values).

```
tx = tf.keras.layers.Dense(units=M,activation="relu")(s_one_hot)
```

tx = tf.keras.layers.Dense(units=n,activation=None)(tx)

To prevent the transmitter from learning unnecessarily large outputs and becoming numerically instable, we normalize the average power of all transmitter outputs in the mini-batch to equal 1.0.

```
x = tx / tf.sqrt(tf.reduce_mean(tf.square(tx)))
```

Now xis the output of our transmitter, next comes the channel. We chose a basic additive white Gaussian noise (AWGN) channel that simply adds scaled normal distributed (real-valued) on top of x. But to be able to adaptively change the noise power and, thereby, the signal to noise ratio (SNR), we implement the noise standard deviation as a feedable Tensorflow placeholder.

```
noise_std = tf.placeholder(dtype=tf.float32,shape=[])
```

Then we simply draw a noise tensor (i.e., a vector per sample in the mini-batch) of the same shape asxfrom a normal distribution with the standard deviation given by the placeholder.

```
noise = tf.random.normal(shape=tf.shape(x),stddev=noise_std)
```

Now, we simply add this random noise tensor on top of x to get y, which are the received messages after transmission.

```
y = x + noise
```

With the channel being the penalty layer of our autoencoder, we now need a receiver part that produces a reproduction s_hat given y. This receiver part consists of a first dense layer that can have an arbitrary number of neurons and is required to have a non-linear activation. Based on experience we chose Mneurons and “relu” activations.

```
rx = tf.keras.layers.Dense(units=M, activation="relu")(y)
```

Depending on the complexity of the channel model, we could now add several of those layers to our model to increase the complexity and capabilities of the (deep) neural network. But for the simple AWGN channel, one input and one output layer is enough at the receiver. The dense output layer is required to have units, since we want to produce an estiamte on the probability of each possible message, and the output shall be so called “logits”, which means that no activation function is used.

```
s_hat = tf.keras.layers.Dense(units=M, activation=None)(rx)
```

Now that the autoencoder is fully described, we can feed in messages and get the corresponding estimates as an output. What is still missing is a loss function that calculates the current performance of the model by comparing the input swith the output s_hat. We use a default cross entropy loss function that inherently activates the logits with “softmax” and accepts sparse labels.

```
cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=s,logits=s_hat)
```

We also calculate the average message (or block) error rate of the mini-batch by hard-deciding the receiver’s output on the element with the highest probability (argmax).

```
correct_predictions = tf.equal(tf.argmax(tf.nn.softmax(s_hat),axis=1,output_type=tf.int32),s)
```

accuracy = tf.reduce_mean(tf.cast(correct_predictions,dtype=tf.float32))

bler = 1.0 - accuracy

Finally, we need to define an optimizer algorithm that updates the weights of our autoencoder according to the current loss and the gradient of the batch. We chose the Adam optimizer to minimize our loss function and use a placeholder as learning rate to be able to adjust this hyperparameter during training.

```
lr = tf.placeholder(dtype=tf.float32,shape=[])
```

train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy)

Now that the Tensorflow graph is defined, we need to create a Tensorflow session that can run the graph.

```
sess = tf.Session()
```

After creating the session all trainable variables need to be initialized. Since the used Tensorflow/Keras layers already define functions to create their initial weights, we simply need to run the global variables initializer and all weights are ready to go.

```
sess.run(tf.global_variables_initializer())
```

Before we start with the training, we need to formulate an SNR definition, so that we can easily train at a desired SNR point. This function simply calculates the noise standard deviation for a given SNR (while signal power is normalized to 1.0).

```
def EbNo2Sigma(ebnodb):
```

ebno = 10**(ebnodb/10)

bits_per_complex_symbol = k/(n/2)

return 1.0/np.sqrt(bits_per_complex_symbol*ebno)

Now we can begin with the training. We start with 1,000 iterations of running the train_opfunction with a small batch_size of only 100 messages and a learning rate of 0.001. After this first training epoch we reduce the learning rate to 0.0001 and run another epoch with 10,000 iterations. For the last training epoch, we raise the batch size to 1,000 and run another 10,000 iterations. During all training epochs we set the SNR to 7.0 dB as we figured that training the autoencoder at a block-error-rate (BLER) of around 0.01 leads to a fast generalization.

```
for i in range(1000):
```

sess.run(train_op, feed_dict={batch_size: 100, noise_std:

EbNo2Sigma(7.0), lr: 0.001})

for i in range(10000):

sess.run(train_op, feed_dict={batch_size: 100, noise_std:

EbNo2Sigma(7.0), lr: 0.0001})

for i in range(10000):

sess.run(train_op, feed_dict={batch_size: 1000, noise_std:

EbNo2Sigma(7.0), lr: 0.0001})

So, let’s check the performance of the autoencoder by plotting its BLER vs SNR over a range of SNR. Therefore, we need to run a Monte Carlo simulation to get an accurate BLER for each SNR point. In this example we simulate the BLER from 0 to 14dB by running 10 mini-batches of 100,000 messages for each SNR point.

```
snr_range = np.linspace(0,14,15)
```

monte_carlo_bler = np.zeros((len(snr_range),))

for i in range(len(snr_range)):

for j in range(10):

monte_carlo_bler[i] += sess.run(bler, feed_dict={batch_size: 100000, noise_std: EbNo2Sigma(snr_range[i]), lr: 0.0})

monte_carlo_bler = monte_carlo_bler / 10

Finally, we plot the BLER vs SNR using matplotlib.

```
plt.figure(figsize=(10,8))
```

plt.plot(snr_range, monte_carlo_bler, linewidth=2.0)

plt.legend(['Autoencoder'], prop={'size': 16}, loc='upper right');

plt.yscale('log')

plt.xlabel('EbNo (dB)', fontsize=18)

plt.ylabel('Block-error rate', fontsize=18)

plt.grid(True)

plt.ylim([1e-5,1])

The results should look like this plot:

One could now compare this performance with other modulation schemes. In the linked Colaboratory example we also provide the uncoded BLER of the quadrature phase shift keying (QPSK) modulation scheme.

As can be seen in this figure, the autoencoder’s BLER is lower than that of QPSK over the whole SNR range. If you are interested in the origins of this gain, you are welcome to have a look at [2] and [3].

You can find this simple autoencoder code example at: Simple AE Colab Notebook

And a more advanced autoencoder model with several auxiliary functions at: Advanced AE Colab Notebook

**4.) Useful Tips and Common Pitfalls**

1) Keep in mind that efficient data-feeding can speed up the training process a lot. In many scenarios in communications, we have access to a channel model to draw samples (e.g., the AWGN channel). In such cases, it is more convenient to embed the channel model into the Tensorflow graph (as stochastic layer). This has the advantage of an unlimited amount of training samples (as every noise realization is different and not limited to the size of the training set). Further, if the channel model is part of the Tensorflow graph, no data-feeding (i.e., copying data from CPU memory or even hard-drive to the GPU memory) during training is required. This sounds like a minor issue, but turns out to be one of the major bottlenecks in training performance especially if the NNs are relatively small (which is typically the case in communications when compared to computer vision or other popular DL-driven domains). Thus, whenever possible we highly recommend generating the training data on-the-fly directly during the training iteration.

2) Avoid overfitting as the noise/channel is implicitly taken into account during training, thus, the model is “matched” to the specific underlying noise statistics. Even if training and test set are clearly separated, the NN may implicitly learn the noise statistics (e.g., fixed channel taps in a multipath channel model) of the model which was used to draw the samples. Sometimes this is a desired effect, but sometimes this prevents from a fair comparison with *classical* baselines (as they are more universal without implicit knowledge of these parameters) due to the NNs degraded generalization performance. Thus, one should always keep in mind that a NN does extremely well in capturing the underlying statistics of the training data. In some (special) cases even, the noise generator itself could be accidentally learned (cf. [8]).

3) *float32* vs. *float64*; take care that the (typical) default numerical precision in deep learning is 32-bit floating point. This simply comes from the fact that the hardware of most (consumer) GPUs is optimized for float32 and in most computer vision applications, this is sufficient. Although the hardware supports float64 (double precision), it causes a significant speed degradation (for consumer cards >10x) and, thus, should be used carefully only if really needed. However, keep in mind that float32 may cause some inaccuracy (e.g., when compared to Matlab) which may be important in the field of communications (e.g., for fiber optical channel models). In other cases, however, reducing the numerical precision to 16-bit can speed up your computations a lot while the precision can still be sufficient for your specific task.

4) Activate GPU (or even TPU) runtime in Colaboratory (Runtime->Change runtime type); otherwise Tensorflow may just use the CPU for training.

**Conclusion**

In this blog article, we provided you with the basic knowledge to train a neural network in a state-of-the-art software environment.

So, what remains to be done?

Change to your target channel, find the best hyperparameters and let the system learn to communicate for your desired channel; or just use the notebook as basis for your own deep learning projects.

References

[1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. Journal, vol. 27, pp. 379–423, 623–656, 1948.

[2] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563-575, Oct. 2017.

[3] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning-based Communication Over the Air,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 132–143, Feb. 2018.

[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[5] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. [Online]. Available: http://tensorflow.org/

[6] Project Jupyter https://jupyter.org/

[7] Google Colaboratory https://colab.research.google.com/

[8] T. A. Eriksson, H. Bülow, and A. Leven, “Applying neural networks in optical communication systems: possible pitfalls,” IEEE Photonics Technology Letters, vol. 29, no. 23, pp. 2091-2094, Sept. 2017.