This is the data format used e.g. here http://manikvarma.org/downloads/XC/XMLRepository.html. It supports multiple labels per example, and encodes features and labels in a sparse format. Can be read using the io::read_xmc_dataset function.

Specification

Files start with a header line, which contains three positive integers that denote the number of instances (i.e. number of lines following) as well as number of features and number of labels. This is followed by one line for each instance, which has first a comma-separated list of label ids and then a space-separated list of sparse features, where each sparse feature consists of an integer feature index and a real-valued feature value, separated by a colon. The comma and colon need to follow the preceding number immediately (i.e. without whitespace), but can themselves be followed by whitespace. An empty label list has to be indicated by the line starting with a whitespace character. Both spaces and tabs are recognized as whitespace. Empty lines are ignored, if you have an example without labels where all features are zero, you have to specify one of the features with an explicit zero. We also skip any line whose first character is a #. Placing # at any other location is an error.

An example file would be

3 10 5
# 3 instances, 10 features {0, ..., 9} and 5 labels {0, ..., 4}
2     4:1.0     5:-0.5
1, 4  2:1.0e-4
0,1   0:0.5     3:2.2

We also support reading files in which features and labels are indexed starting from one. In that case, set IndexMode to io::IndexMode::ONE_BASED.

Details

The functions for reading xmc data are defined in xmc.cpp. In broad terms, the reading works as follows: First, one quick pass is performed over the entire dataset, in which we count the number of occurrences of : characters. This allows us to pre-allocate the buffers for the sparse feature matrix immediately at the correct size, so that all insert operations will be O(1). The second pass then does the actual number parsing. In that case, I expect no disk read overhead, since the data should still be cached in RAM, but I have not verified this. However, from a fast SSD, reading about 1.5GB of data file takes less than 15 seconds, so this is not a bottleneck.

To support both 0 and 1 based indexing, the internal reading method is templated over an IndexOffset integer parameter, which is either one or zero. In that way, we get optimized code for the default (=0) setting, but can still easily support 1-based indexing.

Why do we allow for spaces after the separators, but not before? The reason is simply that these spaces are automatically skipped by the conversion functions from the standard library, so disallowing them would in fact be extra work on our side. On the other hand, by not allowing spaces after the numbers, we can immediately check if the feature list has ended (space) or will continue (,), without needing to look ahead.