Datasets for supervised machine learning

Author: Carolina Fortuna

By definition, machine learning methods rely on data for training purposes. In particular, supervised machine learning algorithms need labelled data. In this post we provide pointers to repositories and tools where relevant datasets can be found as well as tips on how to generate and publish your dataset.

Repositories and search tools

This site is maintaining a list of publicly available datasets that can be used for solving various telecommunications problems using machine learning. In the machine learning community, a well known source of data (and other resources) is Kaggle. In the wireless community, a well known source of datasets, albeit not specifically generated for machine learning purposes is available on at CRAWDAD. More recently, Google is providing a new tool dedicated to dataset search. While the lists on this ETI are collected and published by the officers of this action, Google’s tool seems to  mostly index public data from .gov. sites but can be extended to include also other datasets. To be included in Google’s dataset search tool, the page where the data is published should include structured data.

Labelled data generation

There are three ways of generating labelled training data for your machine learning task: synthetic labelled data generation, manual labelling and (semi-)automatic data generation. 

 

Synthetic labelled data generation

Synthetic data that can be easily labelled as the transmissions are controlled by scripts written by humans and the reception can thus be aligned in frequency, time and location to the transmissions. The process for generating synthetic datasets is depicted in the figure below and consists of five steps. In the first step, the wireless networking experiment or simulation is defined, implemented and executed. While the experiment is executed, relevant data is collected in the data collection step. The transmission and the reception data are then aligned and the relevant transmission parameters are used to generate labels for the received data in the third step. The resulting data is then split in two according to some pre-defined proportion. One part of this data is used for subsequently training a machine learning model while the other part of the data is used for subsequently evaluating the trained model.

Manual labelled data generation

When collecting data from a real environment that we cannot completely control, collecting labels can sometimes be challenging. In such cases, methods for manually labelling the collected data are employed. The labels generated for such situations are also called ground truth because the assumption is that the human that creates them is a domain expert and is likely to be correct. In other words, it is difficult to create better labels using any other method.

The process for generating manually annotated ground truth datasets is depicted in the figure below and also consists of five steps. In the first step, the data is collected by observing relevant network behavior such as spectrum. As the amount of data can be overwhelming for a human user to understand and process, a data selection process has to be in place to select the data that will be presented to the user for labelling. This is represented as the third step in the figure. Subsequently, in the third step, a manual label insertion system has to be developed as support for humans to label. For instance, a system that shows in an appropriate visual manner the instances selected in Step 2 and enables inserting graphic or textual labels that are then recorded in the appripriate format. Similar as in the case of synthetic data generation and labelling, the resulting data is then split in two according to some pre-defined proportion. One part of this data is used for subsequently training a machine learning model while the other part of the data is used for subsequently evaluating the trained model.

(Semi-)automatic labelled data generation

The third appraoch refers to generating automatically or possibly semi-automatically labelled data for transmissions that exist, are real, but we have no control over. Such approach would require relevant network data collection in the first step, same as in the manual label generation case. The second step would require implementing a labelling logic. The labeling logic could be very simple from hand coded rules on how to generate the labels to complex algorithmic appraoches including heuristics or machine learning. It is actually possible to use machine learning for generating labelled data for machine learning tasks. The labelling logic will then automatically generate labelled data. To verify the quality of such data, an evaluation methodology, usually against a manually labelled subset of the same data is appropriate as depicted in the third step. The last two steps of the process are similar as for the previosly described synthetic and manual label generation.