DiSMEC++
|
Classes | |
struct | XMCHeader |
Collects the data from the header of an xmc file XMC data format. More... | |
Functions | |
XMCHeader | parse_xmc_header (const std::string &content) |
Parses the header (number of examples, features, labels) of an XMC dataset file. More... | |
std::vector< long > | count_features_per_example (std::istream &source, std::size_t num_examples=100 '000) |
Extracts number of nonzero features for each instance. More... | |
template<class F > | |
const char * | parse_labels (const char *line, F &&callback) |
parses the labels part of a xmc dataset line. More... | |
template<long IndexOffset> | |
void | read_into_buffers (std::istream &source, SparseFeatures &feature_buffer, std::vector< std::vector< long >> &label_buffer) |
iterates over the lines in source and puts the corresponding features and labels into the given buffers. More... | |
std::ostream & | write_label_list (std::ostream &stream, const std::vector< int > &labels) |
std::vector<long> anonymous_namespace{xmc.cpp}::count_features_per_example | ( | std::istream & | source, |
std::size_t | num_examples = 100'000 |
||
) |
Extracts number of nonzero features for each instance.
This iterates over the lines in source
and extracts the number of nonzero features for each line. You can optionally supply the number of examples expected, which will be used to reserve memory in the counter buffer. Completely empty lines are ignored, as are lines that start with # (see XMC data format). This function does not validate that the data is given in the correct format. It just counts the number of occurences of the colon :
character, which in correctly formatted lines corresponds to the number of labels.
source | The stream from which to read. Should not contain the header. |
num_examples | Number of examples to expect. This is used to reserve space in the result vector. Optional, but if not given may result in additional allocations being performed and/or too much memory being used. |
,
, then we can reserve also the label vector Definition at line 75 of file xmc.cpp.
Referenced by dismec::io::read_xmc_dataset(), and TEST_CASE().
const char* anonymous_namespace{xmc.cpp}::parse_labels | ( | const char * | line, |
F && | callback | ||
) |
parses the labels part of a xmc dataset line.
Returns a pointer to where the label part ends and feature parsing should start. Each labels is parsed as an integer number (with possibly leading spaces), followed by either a comma, indicating more labels, or a whitespace indicating this was the last label. If the first character is a white space, this is interpreted as the absence of labels. This function expects that comments and empty lines have already been skipped.
line | Pointer to a null-terminated string. |
callback | A function that takes a single parameter of type long, which will be called for each label that is encountered. |
If | number parsing fails, or the format is not as expected. |
line
from which the feature parsing should start. Definition at line 114 of file xmc.cpp.
References dismec::io::parse_long(), and THROW_ERROR.
Referenced by read_into_buffers(), and TEST_CASE().
XMCHeader anonymous_namespace{xmc.cpp}::parse_xmc_header | ( | const std::string & | content | ) |
Parses the header (number of examples, features, labels) of an XMC dataset file.
parses the given line as the header of an xmc dataset. The header is expected to consist of three whitespace separated positive integers in the order #examples #features #labels
.
If | any of the parsed numbers is non-positive, or if the parsing itself fails. |
Definition at line 31 of file xmc.cpp.
References dismec::io::parse_header(), and THROW_ERROR.
Referenced by dismec::io::read_xmc_dataset(), and TEST_CASE().
void anonymous_namespace{xmc.cpp}::read_into_buffers | ( | std::istream & | source, |
SparseFeatures & | feature_buffer, | ||
std::vector< std::vector< long >> & | label_buffer | ||
) |
iterates over the lines in source
and puts the corresponding features and labels into the given buffers.
These are expected to be pre-allocated with the correct size. This means that feature_buffer
has to be an empty sparse matrix of dimensions num_examples x num_features
, and label_buffer
should be a vector (of empty vectors) of size num_labels
. To speed up reading, it is advisable to reserve the appropriate amount of space in the buffers, though this is not technically necessary.
IndexOffsetThe | template-parameter IndexOffset is used to switch between 0-based and 1-based indexing. Internally, we always use 0-based indexing, so if IndexOffset != 0 we subtract the offset from the indices that are read from the file. |
source | istream from which the lines are read. If the file has a header, this has to be skipped before calling read_into_buffers . |
feature_buffer | Shared pointer to an empty sparse matrix where rows correspond to examples and columns correspond to features. |
label_buffer | Vector of vectors, where the inner vectors will list the indices of the examples in which the label (as given by the outer index) is present. The outer vector has to be of size num_label . |
If | feature, label or example index are out of bounds. |
Definition at line 164 of file xmc.cpp.
References parse_labels(), dismec::io::parse_sparse_vector_from_text(), dismec::ssize(), and THROW_ERROR.
std::ostream& anonymous_namespace{xmc.cpp}::write_label_list | ( | std::ostream & | stream, |
const std::vector< int > & | labels | ||
) |
Definition at line 276 of file xmc.cpp.
References dismec::ssize().
Referenced by dismec::io::save_xmc_dataset().