DiSMEC++
dismec::io Namespace Reference

Namespaces

 detail
 
 model
 namespace for all model-related io functions.
 
 prediction
 

Classes

struct  MatrixHeader
 Collects the rows and columns parsed from a plain-text matrix file. More...
 
struct  LoLBinarySparse
 Binary Sparse Matrix in List-of-Lists format. More...
 
struct  NpyHeaderData
 Contains the data of the header of a npy file with an array that has at most 2 dimensions. More...
 

Enumerations

enum class  IndexMode { ZERO_BASED , ONE_BASED }
 Enum to decide whether indices in an xmc file are starting from 0 or from 1. More...
 

Functions

long parse_long (const char *string, const char **out)
 
template<class F >
void parse_sparse_vector_from_text (const char *feature_part, F &&callback)
 parses sparse features given in index:value text format. More...
 
std::ostream & write_vector_as_text (std::ostream &stream, const Eigen::Ref< const DenseRealVector > &data)
 Writes the given vector as space-separated human-readable numbers. More...
 
std::istream & read_vector_from_text (std::istream &stream, Eigen::Ref< DenseRealVector > data)
 Reads the given vector as space-separated human-readable numbers. More...
 
template<class T >
void binary_dump (std::streambuf &target, const T *begin, const T *end)
 
template<class T >
void binary_load (std::streambuf &target, T *begin, T *end)
 
MatrixHeader parse_header (const std::string &content)
 
LoLBinarySparse read_binary_matrix_as_lol (std::istream &source)
 
 REGISTER_DTYPE (float, "<f4")
 
 REGISTER_DTYPE (double, "<f8")
 
 REGISTER_DTYPE (std::int32_t, "<i4")
 
 REGISTER_DTYPE (std::int64_t, "<i8")
 
 REGISTER_DTYPE (std::uint32_t, "<u4")
 
 REGISTER_DTYPE (std::uint64_t, "<u8")
 
bool is_npy (std::istream &target)
 Check whether the stream is a npy file. More...
 
void write_npy_header (std::streambuf &target, std::string_view description)
 Writes the header for a npy file. More...
 
std::string make_npy_description (std::string_view dtype_desc, bool column_major, std::size_t size)
 Creates a string with the data description dictionary for (1 dimensional) arrays. More...
 
std::string make_npy_description (std::string_view dtype_desc, bool column_major, std::size_t rows, std::size_t cols)
 Creates a string with the data description dictionary for matrices. More...
 
NpyHeaderData parse_npy_header (std::streambuf &source)
 Parses the header of the npy file given by source. More...
 
template<class S >
const char * data_type_string ()
 
template<class Derived >
std::string make_npy_description (const Eigen::DenseBase< Derived > &matrix)
 Generates the npy description string based on an Eigen matrix. More...
 
types::DenseRowMajor< real_tload_matrix_from_npy (std::istream &source)
 Loads a matrix from a numpy array. More...
 
types::DenseRowMajor< real_tload_matrix_from_npy (const std::string &path)
 
void save_matrix_to_npy (std::ostream &source, const types::DenseRowMajor< real_t > &)
 Saves a matrix to a numpy array. More...
 
void save_matrix_to_npy (const std::string &path, const types::DenseRowMajor< real_t > &)
 
MultiLabelData read_slice_dataset (std::istream &features, std::istream &labels)
 reads a dataset given in slice format. More...
 
MultiLabelData read_slice_dataset (const std::filesystem::path &features, const std::filesystem::path &labels)
 
MultiLabelData read_xmc_dataset (const std::filesystem::path &source, IndexMode mode=IndexMode::ZERO_BASED)
 Reads a dataset given in the extreme multilabel classification format. More...
 
MultiLabelData read_xmc_dataset (std::istream &source, std::string_view name, IndexMode mode=IndexMode::ZERO_BASED)
 reads a dataset given in the extreme multilabel classification format. More...
 
void save_xmc_dataset (std::ostream &target, const MultiLabelData &data)
 Saves the given dataset in XMC format. More...
 
void save_xmc_dataset (const std::filesystem::path &target, const MultiLabelData &data, int precision=4)
 

Detailed Description

io namespace TODO convert this code to use the faster <charconv> methods once gcc implements them for floats

Enumeration Type Documentation

◆ IndexMode

enum dismec::io::IndexMode
strong

Enum to decide whether indices in an xmc file are starting from 0 or from 1.

Enumerator
ZERO_BASED 

labels and feature indices are 0, 1, ..., num - 1

ONE_BASED 

labels and feature indices are 1, 2, ..., num

Definition at line 67 of file xmc.h.

Function Documentation

◆ binary_dump()

template<class T >
void dismec::io::binary_dump ( std::streambuf &  target,
const T *  begin,
const T *  end 
)

◆ binary_load()

template<class T >
void dismec::io::binary_load ( std::streambuf &  target,
T *  begin,
T *  end 
)

◆ data_type_string()

template<class S >
const char* dismec::io::data_type_string ( )

Given data type S, this returns the string representation used by numpy. For common data types, these are instantiated in io/numpy.cpp.

◆ is_npy()

bool dismec::io::is_npy ( std::istream &  target)

Check whether the stream is a npy file.

This peeks at the next 6 bytes of target and checks whether they form the npy magic string. In any case, the read pointer is set back to the original position.

Definition at line 22 of file numpy.cpp.

References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::MAGIC_SIZE, and THROW_ERROR.

Referenced by anonymous_namespace{slice.cpp}::load_features(), and TrainingProgram::make_config().

◆ load_matrix_from_npy() [1/2]

Eigen::Matrix< real_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor > dismec::io::load_matrix_from_npy ( const std::string &  path)

Definition at line 346 of file numpy.cpp.

References load_matrix_from_npy(), and THROW_ERROR.

◆ load_matrix_from_npy() [2/2]

Eigen::Matrix< real_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor > dismec::io::load_matrix_from_npy ( std::istream &  source)

◆ make_npy_description() [1/3]

template<class Derived >
std::string dismec::io::make_npy_description ( const Eigen::DenseBase< Derived > &  matrix)

Generates the npy description string based on an Eigen matrix.

Template Parameters
DerivedThe derived type of the Eigen matrix
Parameters
matrixConst reference to the eigen matrix.
Returns
A string for the description dict of the matrix.

Definition at line 85 of file numpy.h.

References make_npy_description().

◆ make_npy_description() [2/3]

std::string dismec::io::make_npy_description ( std::string_view  dtype_desc,
bool  column_major,
std::size_t  rows,
std::size_t  cols 
)

Creates a string with the data description dictionary for matrices.

Parameters
dtype_descDescription string for the data element.
column_majorWhether the format is column_major or row_major.
rowsThe number of rows in the matrix.
colsThe number of columns in the matrix.
Returns
A string containing a literal python dictionary.

Definition at line 52 of file numpy.cpp.

◆ make_npy_description() [3/3]

std::string dismec::io::make_npy_description ( std::string_view  dtype_desc,
bool  column_major,
std::size_t  size 
)

Creates a string with the data description dictionary for (1 dimensional) arrays.

Parameters
dtype_descDescription string for the data element.
column_majorWhether the format is column_major or row_major. Not really relevant for 1D I guess.
sizeThe number of elements in the array.
Returns
A string containing a literal python dictionary.

Definition at line 48 of file numpy.cpp.

Referenced by make_npy_description(), dismec::io::model::save_dense_weights_npy(), anonymous_namespace{numpy.cpp}::save_matrix_to_npy_imp(), and TEST_CASE().

◆ parse_header()

io::MatrixHeader dismec::io::parse_header ( const std::string &  content)

Given a string containing a matrix header, parses it into rows and columns. The input string should contain exactly two positive integers, otherwise an exception will be thrown.

Definition at line 49 of file common.cpp.

References THROW_ERROR.

Referenced by anonymous_namespace{slice.cpp}::load_features(), anonymous_namespace{xmc.cpp}::parse_xmc_header(), read_binary_matrix_as_lol(), and TEST_CASE().

◆ parse_long()

long dismec::io::parse_long ( const char *  string,
const char **  out 
)
inline

Parses an integer using std::strtol. In contrast to the std function, the output parameter is const here, and we enforce base 10.

Definition at line 34 of file common.h.

Referenced by anonymous_namespace{numpy.cpp}::parse_description(), anonymous_namespace{xmc.cpp}::parse_labels(), and parse_sparse_vector_from_text().

◆ parse_npy_header()

io::NpyHeaderData dismec::io::parse_npy_header ( std::streambuf &  source)

Parses the header of the npy file given by source.

After calling this function, the read pointer of source will be positioned such that subsequent reads access the data portion of the npy file.

Exceptions
std::runtime_errorIf the magic bytes don't match, the version is unknown, or any other parsing error occurs.

Definition at line 280 of file numpy.cpp.

References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::MAGIC_SIZE, anonymous_namespace{numpy.cpp}::parse_description(), anonymous_namespace{numpy.cpp}::read_header_length(), and THROW_ERROR.

Referenced by dismec::io::model::load_dense_weights_npy(), anonymous_namespace{numpy.cpp}::load_matrix_from_npy_imp(), and TrainingProgram::make_config().

◆ parse_sparse_vector_from_text()

template<class F >
void dismec::io::parse_sparse_vector_from_text ( const char *  feature_part,
F &&  callback 
)

parses sparse features given in index:value text format.

The callback is called with index and value of each feature. The features are expected for be integers immediately followed by a colon :, followed by a floating point number (see e.g. XMC data format).

Parameters
feature_partPointer to the part of the line where the features start, e.g. the return value of parse_labels. Has to be \0 terminated.
callbackA function that takes two parameters, the first of type long which is the feature index, and the second of type double which is the feature value.
Exceptions
Ifnumber parsing fails, or the format is not as expected.

Definition at line 52 of file common.h.

References parse_long(), dismec::io::detail::print_char(), and THROW_ERROR.

Referenced by dismec::io::model::load_sparse_weights_txt(), read_binary_matrix_as_lol(), anonymous_namespace{xmc.cpp}::read_into_buffers(), and TEST_CASE().

◆ read_binary_matrix_as_lol()

io::LoLBinarySparse dismec::io::read_binary_matrix_as_lol ( std::istream &  source)

Reads a sparse binary matrix file in the format index:1.0 as a list-of-list of the non-zero entries. The first line of the file should be the shape of the matrix.

Definition at line 76 of file common.cpp.

References parse_header(), parse_sparse_vector_from_text(), and THROW_ERROR.

Referenced by read_slice_dataset(), and TrainingProgram::run().

◆ read_slice_dataset() [1/2]

dismec::MultiLabelData dismec::io::read_slice_dataset ( const std::filesystem::path &  features,
const std::filesystem::path &  labels 
)

Definition at line 52 of file slice.cpp.

References read_slice_dataset().

◆ read_slice_dataset() [2/2]

dismec::MultiLabelData dismec::io::read_slice_dataset ( std::istream &  features,
std::istream &  labels 
)

reads a dataset given in slice format.

For a description of the data format, see Slice data format

Parameters
featuresAn input stream from which the feature data is read.
labelsAn input stream from which the labels will be read
Returns
The parsed multi-label dataset.
Exceptions
std::runtime_errorif the parser encounters an error in the data format.

Definition at line 36 of file slice.cpp.

References anonymous_namespace{slice.cpp}::load_features(), read_binary_matrix_as_lol(), and THROW_ERROR.

Referenced by anonymous_namespace{py_data.cpp}::load_slice(), read_slice_dataset(), and TEST_CASE().

◆ read_vector_from_text()

std::istream & dismec::io::read_vector_from_text ( std::istream &  stream,
Eigen::Ref< DenseRealVector data 
)

Reads the given vector as space-separated human-readable numbers.

This function expects that data is already of the correct size, and tries to read as many items as this specifies.

Returns
For convenience, this function returns a reference to the stream.

Definition at line 37 of file common.cpp.

References THROW_ERROR.

Referenced by dismec::io::model::load_dense_weights_txt(), anonymous_namespace{slice.cpp}::load_features(), TrainingProgram::make_config(), and TEST_CASE().

◆ read_xmc_dataset() [1/2]

dismec::MultiLabelData dismec::io::read_xmc_dataset ( const std::filesystem::path &  source,
IndexMode  mode = IndexMode::ZERO_BASED 
)

Reads a dataset given in the extreme multilabel classification format.

For a description of the data format, see XMC data format

Parameters
sourcePath to the file which we want to load, or an input stream.
modeWhether indices are assumed to start from 0 (the default) or 1.
Returns
The parsed multi-label dataset.
Exceptions
std::runtime_error,ifthe file cannot be opened, or if the parser encounters an error in the data format.

Definition at line 216 of file xmc.cpp.

Referenced by anonymous_namespace{py_data.cpp}::load_xmc(), main(), TrainingProgram::run(), and TEST_CASE().

◆ read_xmc_dataset() [2/2]

dismec::MultiLabelData dismec::io::read_xmc_dataset ( std::istream &  source,
std::string_view  name,
IndexMode  mode = IndexMode::ZERO_BASED 
)

reads a dataset given in the extreme multilabel classification format.

For a description of the data format, see XMC data format

Parameters
sourceAn input stream from which the data is read. Since the reader does two passes over the data, this needs to be resettable to the beginning.
nameWhat name to display in the logging status updates.
modeWhether indices are assumed to start from 0 (the default) or 1.
Returns
The parsed multi-label dataset.
Exceptions
std::runtime_errorif the parser encounters an error in the data format.

Definition at line 225 of file xmc.cpp.

References anonymous_namespace{xmc.cpp}::count_features_per_example(), anonymous_namespace{xmc.cpp}::parse_xmc_header(), dismec::ssize(), and THROW_EXCEPTION.

◆ REGISTER_DTYPE() [1/6]

dismec::io::REGISTER_DTYPE ( double  )

◆ REGISTER_DTYPE() [2/6]

dismec::io::REGISTER_DTYPE ( float  )

◆ REGISTER_DTYPE() [3/6]

dismec::io::REGISTER_DTYPE ( std::int32_t  )

◆ REGISTER_DTYPE() [4/6]

dismec::io::REGISTER_DTYPE ( std::int64_t  )

◆ REGISTER_DTYPE() [5/6]

dismec::io::REGISTER_DTYPE ( std::uint32_t  )

◆ REGISTER_DTYPE() [6/6]

dismec::io::REGISTER_DTYPE ( std::uint64_t  )

◆ save_matrix_to_npy() [1/2]

void dismec::io::save_matrix_to_npy ( const std::string &  path,
const types::DenseRowMajor< real_t > &   
)

◆ save_matrix_to_npy() [2/2]

void dismec::io::save_matrix_to_npy ( std::ostream &  source,
const types::DenseRowMajor< real_t > &   
)

Saves a matrix to a numpy array.

Referenced by TEST_CASE().

◆ save_xmc_dataset() [1/2]

void dismec::io::save_xmc_dataset ( const std::filesystem::path &  target,
const MultiLabelData data,
int  precision = 4 
)

Definition at line 323 of file xmc.cpp.

References dismec::confusion_matrix_detail::precision(), and save_xmc_dataset().

◆ save_xmc_dataset() [2/2]

void dismec::io::save_xmc_dataset ( std::ostream &  target,
const MultiLabelData data 
)

Saves the given dataset in XMC format.

Parameters
dataThe dataset to be saved. Only supports datasets with sparse features.
targetThe output stream where we will put the data.
Todo:
insert proper checks that data is sparse

TODO handle this in a CSR format instead of LoL

Definition at line 294 of file xmc.cpp.

References dismec::DatasetBase::get_features(), dismec::MultiLabelData::get_label_instances(), dismec::DatasetBase::num_examples(), dismec::DatasetBase::num_features(), dismec::MultiLabelData::num_labels(), dismec::opaque_int_type< Tag, T >::to_index(), and anonymous_namespace{xmc.cpp}::write_label_list().

Referenced by main(), TrainingProgram::run(), anonymous_namespace{py_data.cpp}::save_xmc(), save_xmc_dataset(), and TEST_CASE().

◆ write_npy_header()

void dismec::io::write_npy_header ( std::streambuf &  target,
std::string_view  description 
)

Writes the header for a npy file.

This write the npy header to target, with the data description already provided. This means that this function writes the magic bytes and version number, pads description to achieve 64 bit alignment of data, and writes the header length, description, and padding to `target. The header is terminated with a newline character.

Parameters
targetThe stream buffer to which the data is written. Note: If this is a file stream, it should be in binary mode!
descriptionThe description of the data, as a string-formatted python dictionary.

Definition at line 32 of file numpy.cpp.

References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::NPY_PADDING, and THROW_ERROR.

Referenced by dismec::io::model::save_dense_weights_npy(), anonymous_namespace{numpy.cpp}::save_matrix_to_npy_imp(), and TEST_CASE().

◆ write_vector_as_text()

std::ostream & dismec::io::write_vector_as_text ( std::ostream &  stream,
const Eigen::Ref< const DenseRealVector > &  data 
)

Writes the given vector as space-separated human-readable numbers.

This function does not check if the writing was successful.

Returns
For convenience, this function returns a reference to the stream.

Definition at line 21 of file common.cpp.

Referenced by dismec::io::model::save_dense_weights_txt(), and TEST_CASE().