Learning to Communicate: Hands-on Coding

Authors: Sebastian Cammerer, Sebastian Dörner, Adriano Pastore

All beginnings are difficult – we have often been asked how to get started with deep learning for communications; not in terms of deep learning theory, but how to really practically training the first neural network for information transmission. Thus, this article aims at lowering the barriers-to-entry by providing (and explaining) the necessary Tensorflow code to run and train an autoencoder for information transmission from scratch.

Parts of this article are based on our experience from organizing the 6thIRACON Training School on Deep and Machine Learning Techniques for (Beyond) 5G Wireless Communication Systems and, in particular, the feedback from the incorporated machine learning challenge. It turned out that this was the first practical coding experience with deep learning for many students. Based on their feedback (and our own conviction), we believe that having a practical perspective on deep learning becomes beneficial for any deep learning (for communications) researcher.

The outline of this article is as follows:

1) “Learning to Communicate” – System Model & Background

2) Choosing the Right Environment and Toolchain

3) Tensorflow Implementation

4) Useful Tips and Common Pitfalls

The Tensorflow code examples are also provided online:

1.) “Learning to Communicate” – System Model & Background

The fundamental problem of communication is that of “reproducing at one point either exactly or approximately a message selected at another point” [1] or, in other words, reliably transmitting a message from a source to a destination over a channel by the use of a transmitter and a receiver. Therefore, we consider a communication system consisting of a transmitter, a channel, and a receiver.

The transmitter wants to convey a message s \in {1,2,\dots,M} over the channel to the receiver. To do so, it is allowed to transmit n complex baseband symbols, i.e., a vector \mathbf{x} \in \mathbb{C}^n with a power constraint, e.g., \left \| x \right \| \leq n. At the receiver side, a noisy and possibly distorted version \mathbf{y} \in \mathbb{C}^n of \mathbf{x} can be observed. Now, the task of the receiver is to produce the estimate \hat{s} of the original message s. As explained in [2,3], the communication system described above can be interpreted as an autoencoder [4, Ch. 14]. An autoencoder describes a deep neural network (NN) that is trained to reconstruct the input at the output and, as the information must pass each layer, the network needs to find a robust representation of the input message at every layer.

While undercomplete autoencoders (i.e., whose hidden layers have fewer neurons than the input/output) have traditionally been studied for extracting hidden features and learning a robust compressed representation of the input, in the case of communication, we consider overcomplete autoencoders. Their purpose is to add (on the transmitter side) and remove (on the receiver side) redundancy to the input message representation in a way that is matched to the channel (noise layer). Such an autoencoder will thus learn an efficient error-correcting code and its corresponding decoding algorithm.

The trainable transmitter part of the autoencoder consists of an embedding followed by a feedforward dense NN. Its 2n-dimensional output is cast to an n-dimensional complex-valued vector by considering one half as the real part and the other half as the imaginary part. Finally, a normalization layer ensures that the power constraint on the output \mathbf{x} is met.

The channel can be implemented as a set of layers with probabilistic and deterministic behavior, e.g., for an additive white Gaussian noise (AWGN) channel, Gaussian noise with fixed or random noise power \sigma^2 per complex symbol is added.

The receiver consists of a transformation from complex to real values (by concatenating real and imaginary parts of the channel output), followed by a feedforward NN whose last layer has a “softmax” activation (see [4]). Its output \mathbf{b} \in (0,1)^M is a probability vector that assigns a probability to each of the possible messages. Finally, the estimate \hat{s} of s is selected as the index of the largest element of .

The resulting autoencoder can be trained end-to-end using stochastic gradient descent (SGD).

Since we have phrased the communication problem as a classification task, it is natural to use the cross-entropy loss function L_\text{loss} = -\log (\mathbs{b}_s) where \mathbf{b}_s denotes the sth element of \mathbf{b}. As we deal with an autoencoder where the output should equal the input during training, we have a fixed number of M different training labels. For further details we refer to [2,3].

 

2.) Choosing the Right Environment and Toolchain

Having the right toolchain becomes an important decision in deep learning. As most of the state-of-the-art libraries provide a Python support, we stick to Python (and its numerical libraries such as numpy). Further, we use Tensorflow [5] as deep learning framework and access it via the browser-based IDE Jupyter [6]. The main advantage is that the setup can run on any server (ideally with graphics processing unit (GPU) support) while the client just requires a standard webbrowser (and no high-performance computing power).

Generally speaking, no GPU is needed for training (everything runs with a CPU and enough RAM). However, the main advantage of having access to a GPU is the support for accelerated training which speeds up the training by (at least) an order of magnitude.

Further, Google provides a Jupyter environment free of charge (including GPU support for several hours of training) called “Google Colaboratory” [7], which is sufficient for our basic experiments. However, the provided code examples (Jupyter notebooks) can be exported and executed in any Jupyter environment running on your own server. To keep the story short: for this tutorial, all software you need is a working web-browser.

3.) Tensorflow Implementation

So, enough with the theoretical stuff, let’s get our hands dirty!

At first we need to import the required libraries: tensorflow, to perform computing graph based training of the NN; numpy, for basic computations and to feed the NN; matplotlib, to plot our results.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Then we need to define the main parameters of the autoencoder, k and n.

k = 8 # Number of information bits per message, i.e., M=2**k
n = 8 # Number of real channel uses per message
M = 2**k # Number of messages

Now we start building our model in Tensorflow. For this simple example, we define all variables in the Tensorflow default graph. We begin with one of the most basic parameters in deep learning(DL), the batch_size. This defines the number of samples within a mini-batch that is used for the stochastic gradient computation. In our case we want this to be a flexible parameter to allow adjustments during training. Therefore, and given the fact that we can generate as many samples as we want on the fly, we define the batch_sizeas a feedable scalar integer.

batch_size = tf.placeholder(tf.int32,shape=[])

Now we can create the messages we want to transmit in this batch. They are simply drawn from a random uniform distribution.

s = tf.random.uniform(shape=[batch_size],minval=0,maxval=M,dtype=tf.int32)

And to efficiently feed them to the first dense NN layer of the transmitter part, we transform them to so called one-hot vectors.

s_one_hot = tf.one_hot(s,depth=M)

This tensor now holds batch_sizevectors of length M, where only one entry (at position s) is set to 1.0 while all other entries are 0.0 and is commonly used for classification tasks.

Let’s define the transmitter part. Only two dense layers are already enough to perform a transformation from messages to real valued channel uses. This is basically a simple lookup-table transformation that could also be implemented by a single matrix of trainable weights, but for simplicity we use default Tensorflow/Keras dense layers in the following. The first dense transmitter layer shall be “relu” activated [4] and can have any number of neurons (called units), we chose M because the samples of input s_one_hot are also of length M. The second dense layer is required to have n units, which form the output of the transmitter, and shall not have any activation function, since we want the transmitter to be able to output any real-valued numbers (to also allow negative values).

tx = tf.keras.layers.Dense(units=M,activation="relu")(s_one_hot)
tx = tf.keras.layers.Dense(units=n,activation=None)(tx)

To prevent the transmitter from learning unnecessarily large outputs and becoming numerically instable, we normalize the average power of all transmitter outputs in the mini-batch to equal 1.0.

x = tx / tf.sqrt(tf.reduce_mean(tf.square(tx)))

Now xis the output of our transmitter, next comes the channel. We chose a basic additive white Gaussian noise (AWGN) channel that simply adds scaled normal distributed (real-valued) on top of x. But to be able to adaptively change the noise power and, thereby, the signal to noise ratio (SNR), we implement the noise standard deviation as a feedable Tensorflow placeholder.

noise_std = tf.placeholder(dtype=tf.float32,shape=[])

Then we simply draw a noise tensor (i.e., a vector per sample in the mini-batch) of the same shape asxfrom a normal distribution with the standard deviation given by the placeholder.

noise = tf.random.normal(shape=tf.shape(x),stddev=noise_std)

Now, we simply add this random noise tensor on top of x to get y, which are the received messages after transmission.

y = x + noise

With the channel being the penalty layer of our autoencoder, we now need a receiver part that produces a reproduction s_hat given y. This receiver part consists of a first dense layer that can have an arbitrary number of neurons and is required to have a non-linear activation. Based on experience we chose Mneurons and “relu” activations.

rx = tf.keras.layers.Dense(units=M, activation="relu")(y)

Depending on the complexity of the channel model, we could now add several of those layers to our model to increase the complexity and capabilities of the (deep) neural network. But for the simple AWGN channel, one input and one output layer is enough at the receiver. The dense output layer is required to have M units, since we want to produce an estiamte on the probability of each possible message, and the output shall be so called “logits”, which means that no activation function is used.

s_hat = tf.keras.layers.Dense(units=M, activation=None)(rx)

Now that the autoencoder is fully described, we can feed in messages and get the corresponding estimates as an output. What is still missing is a loss function that calculates the current performance of the model by comparing the input swith the output s_hat. We use a default cross entropy loss function that inherently activates the logits with “softmax” and accepts sparse labels.

cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=s,logits=s_hat)

We also calculate the average message (or block) error rate of the mini-batch by hard-deciding the receiver’s output on the element with the highest probability (argmax).

correct_predictions = tf.equal(tf.argmax(tf.nn.softmax(s_hat),axis=1,output_type=tf.int32),s)
accuracy = tf.reduce_mean(tf.cast(correct_predictions,dtype=tf.float32))
bler = 1.0 - accuracy

 

Finally, we need to define an optimizer algorithm that updates the weights of our autoencoder according to the current loss and the gradient of the batch. We chose the Adam optimizer to minimize our loss function and use a placeholder as learning rate to be able to adjust this hyperparameter during training.

lr = tf.placeholder(dtype=tf.float32,shape=[])
train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy)

 

Now that the Tensorflow graph is defined, we need to create a Tensorflow session that can run the graph.

sess = tf.Session()

After creating the session all trainable variables need to be initialized. Since the used Tensorflow/Keras layers already define functions to create their initial weights, we simply need to run the global variables initializer and all weights are ready to go.

sess.run(tf.global_variables_initializer())

Before we start with the training, we need to formulate an SNR definition, so that we can easily train at a desired SNR point. This function simply calculates the noise standard deviation for a given SNR (while signal power is normalized to 1.0).

def EbNo2Sigma(ebnodb):
    ebno = 10**(ebnodb/10)
    bits_per_complex_symbol = k/(n/2)
    return 1.0/np.sqrt(bits_per_complex_symbol*ebno)

Now we can begin with the training. We start with 1,000 iterations of running the train_opfunction with a small batch_size of only 100 messages and a learning rate of 0.001. After this first training epoch we reduce the learning rate to 0.0001 and run another epoch with 10,000 iterations. For the last training epoch, we raise the batch size to 1,000 and run another 10,000 iterations. During all training epochs we set the SNR to 7.0 dB as we figured that training the autoencoder at a block-error-rate (BLER) of around 0.01 leads to a fast generalization.

for i in range(1000):
    sess.run(train_op, feed_dict={batch_size: 100, noise_std:
    EbNo2Sigma(7.0), lr: 0.001})

for i in range(10000):
    sess.run(train_op, feed_dict={batch_size: 100, noise_std:
    EbNo2Sigma(7.0), lr: 0.0001})

for i in range(10000):
    sess.run(train_op, feed_dict={batch_size: 1000, noise_std:
    EbNo2Sigma(7.0), lr: 0.0001})

So, let’s check the performance of the autoencoder by plotting its BLER vs SNR over a range of SNR. Therefore, we need to run a Monte Carlo simulation to get an accurate BLER for each SNR point. In this example we simulate the BLER from 0 to 14dB by running 10 mini-batches of 100,000 messages for each SNR point.

snr_range = np.linspace(0,14,15)
monte_carlo_bler = np.zeros((len(snr_range),))
for i in range(len(snr_range)):
    for j in range(10):
        monte_carlo_bler[i] += sess.run(bler, feed_dict={batch_size: 100000, noise_std: EbNo2Sigma(snr_range[i]), lr: 0.0})
monte_carlo_bler = monte_carlo_bler / 10

Finally, we plot the BLER vs SNR using matplotlib.

plt.figure(figsize=(10,8))
plt.plot(snr_range, monte_carlo_bler, linewidth=2.0)
plt.legend(['Autoencoder'], prop={'size': 16}, loc='upper right');
plt.yscale('log')
plt.xlabel('EbNo (dB)', fontsize=18)
plt.ylabel('Block-error rate', fontsize=18)
plt.grid(True)
plt.ylim([1e-5,1])

The results should look like this plot:

One could now compare this performance with other modulation schemes. In the linked Colaboratory example we also provide the uncoded BLER of the quadrature phase shift keying (QPSK) modulation scheme.

As can be seen in this figure, the autoencoder’s BLER is lower than that of QPSK over the whole SNR range. If you are interested in the origins of this gain, you are welcome to have a look at [2] and [3].

You can find this simple autoencoder code example at: Simple AE Colab Notebook

And a more advanced autoencoder model with several auxiliary functions at: Advanced AE Colab Notebook

 

4.) Useful Tips and Common Pitfalls

1) Keep in mind that efficient data-feeding can speed up the training process a lot. In many scenarios in communications, we have access to a channel model to draw samples (e.g., the AWGN channel). In such cases, it is more convenient to embed the channel model into the Tensorflow graph (as stochastic layer). This has the advantage of an unlimited amount of training samples (as every noise realization is different and not limited to the size of the training set). Further, if the channel model is part of the Tensorflow graph, no data-feeding (i.e., copying data from CPU memory or even hard-drive to the GPU memory) during training is required. This sounds like a minor issue, but turns out to be one of the major bottlenecks in training performance especially if the NNs are relatively small (which is typically the case in communications when compared to computer vision or other popular DL-driven domains). Thus, whenever possible we highly recommend generating the training data on-the-fly directly during the training iteration.

2) Avoid overfitting as the noise/channel is implicitly taken into account during training, thus, the model is “matched” to the specific underlying noise statistics. Even if training and test set are clearly separated, the NN may implicitly learn the noise statistics (e.g., fixed channel taps in a multipath channel model) of the model which was used to draw the samples. Sometimes this is a desired effect, but sometimes this prevents from a fair comparison with classical baselines (as they are more universal without implicit knowledge of these parameters) due to the NNs degraded generalization performance. Thus, one should always keep in mind that a NN does extremely well in capturing the underlying statistics of the training data. In some (special) cases even, the noise generator itself could be accidentally learned (cf. [8]).

3) float32 vs. float64; take care that the (typical) default numerical precision in deep learning is 32-bit floating point. This simply comes from the fact that the hardware of most (consumer) GPUs is optimized for float32 and in most computer vision applications, this is sufficient. Although the hardware supports float64 (double precision), it causes a significant speed degradation (for consumer cards >10x) and, thus, should be used carefully only if really needed. However, keep in mind that float32 may cause some inaccuracy (e.g., when compared to Matlab) which may be important in the field of communications (e.g., for fiber optical channel models). In other cases, however, reducing the numerical precision to 16-bit can speed up your computations a lot while the precision can still be sufficient for your specific task.

4) Activate GPU (or even TPU) runtime in Colaboratory (Runtime->Change runtime type); otherwise Tensorflow may just use the CPU for training.

 

Conclusion

In this blog article, we provided you with the basic knowledge to train a neural network in a state-of-the-art software environment.

So, what remains to be done?

Change to your target channel, find the best hyperparameters and let the system learn to communicate for your desired channel; or just use the notebook as basis for your own deep learning projects.


References

[1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. Journal, vol. 27, pp. 379–423, 623–656, 1948.

[2] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563-575, Oct. 2017.

[3] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning-based Communication Over the Air,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 132–143, Feb. 2018.

[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[5] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. [Online]. Available: http://tensorflow.org/

[6] Project Jupyter https://jupyter.org/

[7] Google Colaboratory https://colab.research.google.com/

[8] T. A. Eriksson, H. Bülow, and A. Leven, “Applying neural networks in optical communication systems: possible pitfalls,” IEEE Photonics Technology Letters, vol. 29, no. 23, pp. 2091-2094, Sept. 2017.

 

 

Datasets for supervised machine learning

Author: Carolina Fortuna

By definition, machine learning methods rely on data for training purposes. In particular, supervised machine learning algorithms need labelled data. In this post we provide pointers to repositories and tools where relevant datasets can be found as well as tips on how to generate and publish your dataset.

Repositories and search tools

This site is maintaining a list of publicly available datasets that can be used for solving various telecommunications problems using machine learning. In the machine learning community, a well known source of data (and other resources) is Kaggle. In the wireless community, a well known source of datasets, albeit not specifically generated for machine learning purposes is available on at CRAWDAD. More recently, Google is providing a new tool dedicated to dataset search. While the lists on this ETI are collected and published by the officers of this action, Google’s tool seems to  mostly index public data from .gov. sites but can be extended to include also other datasets. To be included in Google’s dataset search tool, the page where the data is published should include structured data.

Labelled data generation

There are three ways of generating labelled training data for your machine learning task: synthetic labelled data generation, manual labelling and (semi-)automatic data generation. 

 

Synthetic labelled data generation

Synthetic data that can be easily labelled as the transmissions are controlled by scripts written by humans and the reception can thus be aligned in frequency, time and location to the transmissions. The process for generating synthetic datasets is depicted in the figure below and consists of five steps. In the first step, the wireless networking experiment or simulation is defined, implemented and executed. While the experiment is executed, relevant data is collected in the data collection step. The transmission and the reception data are then aligned and the relevant transmission parameters are used to generate labels for the received data in the third step. The resulting data is then split in two according to some pre-defined proportion. One part of this data is used for subsequently training a machine learning model while the other part of the data is used for subsequently evaluating the trained model.

Manual labelled data generation

When collecting data from a real environment that we cannot completely control, collecting labels can sometimes be challenging. In such cases, methods for manually labelling the collected data are employed. The labels generated for such situations are also called ground truth because the assumption is that the human that creates them is a domain expert and is likely to be correct. In other words, it is difficult to create better labels using any other method.

The process for generating manually annotated ground truth datasets is depicted in the figure below and also consists of five steps. In the first step, the data is collected by observing relevant network behavior such as spectrum. As the amount of data can be overwhelming for a human user to understand and process, a data selection process has to be in place to select the data that will be presented to the user for labelling. This is represented as the third step in the figure. Subsequently, in the third step, a manual label insertion system has to be developed as support for humans to label. For instance, a system that shows in an appropriate visual manner the instances selected in Step 2 and enables inserting graphic or textual labels that are then recorded in the appripriate format. Similar as in the case of synthetic data generation and labelling, the resulting data is then split in two according to some pre-defined proportion. One part of this data is used for subsequently training a machine learning model while the other part of the data is used for subsequently evaluating the trained model.

(Semi-)automatic labelled data generation

The third appraoch refers to generating automatically or possibly semi-automatically labelled data for transmissions that exist, are real, but we have no control over. Such approach would require relevant network data collection in the first step, same as in the manual label generation case. The second step would require implementing a labelling logic. The labeling logic could be very simple from hand coded rules on how to generate the labels to complex algorithmic appraoches including heuristics or machine learning. It is actually possible to use machine learning for generating labelled data for machine learning tasks. The labelling logic will then automatically generate labelled data. To verify the quality of such data, an evaluation methodology, usually against a manually labelled subset of the same data is appropriate as depicted in the third step. The last two steps of the process are similar as for the previosly described synthetic and manual label generation.