DiSMEC++
|
Namespaces | |
detail | |
model | |
namespace for all model-related io functions. | |
prediction | |
Classes | |
struct | MatrixHeader |
Collects the rows and columns parsed from a plain-text matrix file. More... | |
struct | LoLBinarySparse |
Binary Sparse Matrix in List-of-Lists format. More... | |
struct | NpyHeaderData |
Contains the data of the header of a npy file with an array that has at most 2 dimensions. More... | |
Enumerations | |
enum class | IndexMode { ZERO_BASED , ONE_BASED } |
Enum to decide whether indices in an xmc file are starting from 0 or from 1. More... | |
Functions | |
long | parse_long (const char *string, const char **out) |
template<class F > | |
void | parse_sparse_vector_from_text (const char *feature_part, F &&callback) |
parses sparse features given in index:value text format. More... | |
std::ostream & | write_vector_as_text (std::ostream &stream, const Eigen::Ref< const DenseRealVector > &data) |
Writes the given vector as space-separated human-readable numbers. More... | |
std::istream & | read_vector_from_text (std::istream &stream, Eigen::Ref< DenseRealVector > data) |
Reads the given vector as space-separated human-readable numbers. More... | |
template<class T > | |
void | binary_dump (std::streambuf &target, const T *begin, const T *end) |
template<class T > | |
void | binary_load (std::streambuf &target, T *begin, T *end) |
MatrixHeader | parse_header (const std::string &content) |
LoLBinarySparse | read_binary_matrix_as_lol (std::istream &source) |
REGISTER_DTYPE (float, "<f4") | |
REGISTER_DTYPE (double, "<f8") | |
REGISTER_DTYPE (std::int32_t, "<i4") | |
REGISTER_DTYPE (std::int64_t, "<i8") | |
REGISTER_DTYPE (std::uint32_t, "<u4") | |
REGISTER_DTYPE (std::uint64_t, "<u8") | |
bool | is_npy (std::istream &target) |
Check whether the stream is a npy file. More... | |
void | write_npy_header (std::streambuf &target, std::string_view description) |
Writes the header for a npy file. More... | |
std::string | make_npy_description (std::string_view dtype_desc, bool column_major, std::size_t size) |
Creates a string with the data description dictionary for (1 dimensional) arrays. More... | |
std::string | make_npy_description (std::string_view dtype_desc, bool column_major, std::size_t rows, std::size_t cols) |
Creates a string with the data description dictionary for matrices. More... | |
NpyHeaderData | parse_npy_header (std::streambuf &source) |
Parses the header of the npy file given by source . More... | |
template<class S > | |
const char * | data_type_string () |
template<class Derived > | |
std::string | make_npy_description (const Eigen::DenseBase< Derived > &matrix) |
Generates the npy description string based on an Eigen matrix. More... | |
types::DenseRowMajor< real_t > | load_matrix_from_npy (std::istream &source) |
Loads a matrix from a numpy array. More... | |
types::DenseRowMajor< real_t > | load_matrix_from_npy (const std::string &path) |
void | save_matrix_to_npy (std::ostream &source, const types::DenseRowMajor< real_t > &) |
Saves a matrix to a numpy array. More... | |
void | save_matrix_to_npy (const std::string &path, const types::DenseRowMajor< real_t > &) |
MultiLabelData | read_slice_dataset (std::istream &features, std::istream &labels) |
reads a dataset given in slice format. More... | |
MultiLabelData | read_slice_dataset (const std::filesystem::path &features, const std::filesystem::path &labels) |
MultiLabelData | read_xmc_dataset (const std::filesystem::path &source, IndexMode mode=IndexMode::ZERO_BASED) |
Reads a dataset given in the extreme multilabel classification format. More... | |
MultiLabelData | read_xmc_dataset (std::istream &source, std::string_view name, IndexMode mode=IndexMode::ZERO_BASED) |
reads a dataset given in the extreme multilabel classification format. More... | |
void | save_xmc_dataset (std::ostream &target, const MultiLabelData &data) |
Saves the given dataset in XMC format. More... | |
void | save_xmc_dataset (const std::filesystem::path &target, const MultiLabelData &data, int precision=4) |
io namespace TODO convert this code to use the faster <charconv> methods once gcc implements them for floats
|
strong |
void dismec::io::binary_dump | ( | std::streambuf & | target, |
const T * | begin, | ||
const T * | end | ||
) |
Definition at line 110 of file common.h.
References THROW_ERROR.
Referenced by dismec::io::model::save_dense_weights_npy(), anonymous_namespace{numpy.cpp}::save_matrix_to_npy_imp(), and TEST_CASE().
void dismec::io::binary_load | ( | std::streambuf & | target, |
T * | begin, | ||
T * | end | ||
) |
Definition at line 120 of file common.h.
References THROW_ERROR.
Referenced by dismec::io::model::load_dense_weights_npy(), anonymous_namespace{numpy.cpp}::load_matrix_from_npy_imp(), TrainingProgram::make_config(), and TEST_CASE().
const char* dismec::io::data_type_string | ( | ) |
Given data type S
, this returns the string representation used by numpy. For common data types, these are instantiated in io/numpy.cpp
.
bool dismec::io::is_npy | ( | std::istream & | target | ) |
Check whether the stream is a npy file.
This peeks at the next 6 bytes of target and checks whether they form the npy magic string. In any case, the read pointer is set back to the original position.
Definition at line 22 of file numpy.cpp.
References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::MAGIC_SIZE, and THROW_ERROR.
Referenced by anonymous_namespace{slice.cpp}::load_features(), and TrainingProgram::make_config().
Eigen::Matrix< real_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor > dismec::io::load_matrix_from_npy | ( | const std::string & | path | ) |
Definition at line 346 of file numpy.cpp.
References load_matrix_from_npy(), and THROW_ERROR.
Eigen::Matrix< real_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor > dismec::io::load_matrix_from_npy | ( | std::istream & | source | ) |
Loads a matrix from a numpy array.
Definition at line 342 of file numpy.cpp.
Referenced by dismec::init::create_numpy_initializer(), anonymous_namespace{slice.cpp}::load_features(), load_matrix_from_npy(), TrainingProgram::run(), and TEST_CASE().
std::string dismec::io::make_npy_description | ( | const Eigen::DenseBase< Derived > & | matrix | ) |
Generates the npy description string based on an Eigen matrix.
Derived | The derived type of the Eigen matrix |
matrix | Const reference to the eigen matrix. |
Definition at line 85 of file numpy.h.
References make_npy_description().
std::string dismec::io::make_npy_description | ( | std::string_view | dtype_desc, |
bool | column_major, | ||
std::size_t | rows, | ||
std::size_t | cols | ||
) |
Creates a string with the data description dictionary for matrices.
dtype_desc | Description string for the data element. |
column_major | Whether the format is column_major or row_major. |
rows | The number of rows in the matrix. |
cols | The number of columns in the matrix. |
std::string dismec::io::make_npy_description | ( | std::string_view | dtype_desc, |
bool | column_major, | ||
std::size_t | size | ||
) |
Creates a string with the data description dictionary for (1 dimensional) arrays.
dtype_desc | Description string for the data element. |
column_major | Whether the format is column_major or row_major. Not really relevant for 1D I guess. |
size | The number of elements in the array. |
Definition at line 48 of file numpy.cpp.
Referenced by make_npy_description(), dismec::io::model::save_dense_weights_npy(), anonymous_namespace{numpy.cpp}::save_matrix_to_npy_imp(), and TEST_CASE().
io::MatrixHeader dismec::io::parse_header | ( | const std::string & | content | ) |
Given a string containing a matrix header, parses it into rows and columns. The input string should contain exactly two positive integers, otherwise an exception will be thrown.
Definition at line 49 of file common.cpp.
References THROW_ERROR.
Referenced by anonymous_namespace{slice.cpp}::load_features(), anonymous_namespace{xmc.cpp}::parse_xmc_header(), read_binary_matrix_as_lol(), and TEST_CASE().
|
inline |
Parses an integer using std::strtol
. In contrast to the std function, the output parameter is const here, and we enforce base 10.
Definition at line 34 of file common.h.
Referenced by anonymous_namespace{numpy.cpp}::parse_description(), anonymous_namespace{xmc.cpp}::parse_labels(), and parse_sparse_vector_from_text().
io::NpyHeaderData dismec::io::parse_npy_header | ( | std::streambuf & | source | ) |
Parses the header of the npy file given by source
.
After calling this function, the read pointer of source will be positioned such that subsequent reads access the data portion of the npy file.
std::runtime_error | If the magic bytes don't match, the version is unknown, or any other parsing error occurs. |
Definition at line 280 of file numpy.cpp.
References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::MAGIC_SIZE, anonymous_namespace{numpy.cpp}::parse_description(), anonymous_namespace{numpy.cpp}::read_header_length(), and THROW_ERROR.
Referenced by dismec::io::model::load_dense_weights_npy(), anonymous_namespace{numpy.cpp}::load_matrix_from_npy_imp(), and TrainingProgram::make_config().
void dismec::io::parse_sparse_vector_from_text | ( | const char * | feature_part, |
F && | callback | ||
) |
parses sparse features given in index:value text format.
The callback
is called with index and value of each feature. The features are expected for be integers immediately followed by a colon :
, followed by a floating point number (see e.g. XMC data format).
feature_part | Pointer to the part of the line where the features start, e.g. the return value of parse_labels . Has to be \0 terminated. |
callback | A function that takes two parameters, the first of type long which is the feature index, and the second of type double which is the feature value. |
If | number parsing fails, or the format is not as expected. |
Definition at line 52 of file common.h.
References parse_long(), dismec::io::detail::print_char(), and THROW_ERROR.
Referenced by dismec::io::model::load_sparse_weights_txt(), read_binary_matrix_as_lol(), anonymous_namespace{xmc.cpp}::read_into_buffers(), and TEST_CASE().
io::LoLBinarySparse dismec::io::read_binary_matrix_as_lol | ( | std::istream & | source | ) |
Reads a sparse binary matrix file in the format index:1.0 as a list-of-list of the non-zero entries. The first line of the file should be the shape of the matrix.
Definition at line 76 of file common.cpp.
References parse_header(), parse_sparse_vector_from_text(), and THROW_ERROR.
Referenced by read_slice_dataset(), and TrainingProgram::run().
dismec::MultiLabelData dismec::io::read_slice_dataset | ( | const std::filesystem::path & | features, |
const std::filesystem::path & | labels | ||
) |
Definition at line 52 of file slice.cpp.
References read_slice_dataset().
dismec::MultiLabelData dismec::io::read_slice_dataset | ( | std::istream & | features, |
std::istream & | labels | ||
) |
reads a dataset given in slice format.
For a description of the data format, see Slice data format
features | An input stream from which the feature data is read. |
labels | An input stream from which the labels will be read |
std::runtime_error | if the parser encounters an error in the data format. |
Definition at line 36 of file slice.cpp.
References anonymous_namespace{slice.cpp}::load_features(), read_binary_matrix_as_lol(), and THROW_ERROR.
Referenced by anonymous_namespace{py_data.cpp}::load_slice(), read_slice_dataset(), and TEST_CASE().
std::istream & dismec::io::read_vector_from_text | ( | std::istream & | stream, |
Eigen::Ref< DenseRealVector > | data | ||
) |
Reads the given vector as space-separated human-readable numbers.
This function expects that data
is already of the correct size, and tries to read as many items as this specifies.
Definition at line 37 of file common.cpp.
References THROW_ERROR.
Referenced by dismec::io::model::load_dense_weights_txt(), anonymous_namespace{slice.cpp}::load_features(), TrainingProgram::make_config(), and TEST_CASE().
dismec::MultiLabelData dismec::io::read_xmc_dataset | ( | const std::filesystem::path & | source, |
IndexMode | mode = IndexMode::ZERO_BASED |
||
) |
Reads a dataset given in the extreme multilabel classification format.
For a description of the data format, see XMC data format
source | Path to the file which we want to load, or an input stream. |
mode | Whether indices are assumed to start from 0 (the default) or 1. |
std::runtime_error,if | the file cannot be opened, or if the parser encounters an error in the data format. |
Definition at line 216 of file xmc.cpp.
Referenced by anonymous_namespace{py_data.cpp}::load_xmc(), main(), TrainingProgram::run(), and TEST_CASE().
dismec::MultiLabelData dismec::io::read_xmc_dataset | ( | std::istream & | source, |
std::string_view | name, | ||
IndexMode | mode = IndexMode::ZERO_BASED |
||
) |
reads a dataset given in the extreme multilabel classification format.
For a description of the data format, see XMC data format
source | An input stream from which the data is read. Since the reader does two passes over the data, this needs to be resettable to the beginning. |
name | What name to display in the logging status updates. |
mode | Whether indices are assumed to start from 0 (the default) or 1. |
std::runtime_error | if the parser encounters an error in the data format. |
Definition at line 225 of file xmc.cpp.
References anonymous_namespace{xmc.cpp}::count_features_per_example(), anonymous_namespace{xmc.cpp}::parse_xmc_header(), dismec::ssize(), and THROW_EXCEPTION.
dismec::io::REGISTER_DTYPE | ( | double | ) |
dismec::io::REGISTER_DTYPE | ( | float | ) |
dismec::io::REGISTER_DTYPE | ( | std::int32_t | ) |
dismec::io::REGISTER_DTYPE | ( | std::int64_t | ) |
dismec::io::REGISTER_DTYPE | ( | std::uint32_t | ) |
dismec::io::REGISTER_DTYPE | ( | std::uint64_t | ) |
void dismec::io::save_matrix_to_npy | ( | const std::string & | path, |
const types::DenseRowMajor< real_t > & | |||
) |
void dismec::io::save_matrix_to_npy | ( | std::ostream & | source, |
const types::DenseRowMajor< real_t > & | |||
) |
Saves a matrix to a numpy array.
Referenced by TEST_CASE().
void dismec::io::save_xmc_dataset | ( | const std::filesystem::path & | target, |
const MultiLabelData & | data, | ||
int | precision = 4 |
||
) |
Definition at line 323 of file xmc.cpp.
References dismec::confusion_matrix_detail::precision(), and save_xmc_dataset().
void dismec::io::save_xmc_dataset | ( | std::ostream & | target, |
const MultiLabelData & | data | ||
) |
Saves the given dataset in XMC format.
data | The dataset to be saved. Only supports datasets with sparse features. |
target | The output stream where we will put the data. |
TODO handle this in a CSR format instead of LoL
Definition at line 294 of file xmc.cpp.
References dismec::DatasetBase::get_features(), dismec::MultiLabelData::get_label_instances(), dismec::DatasetBase::num_examples(), dismec::DatasetBase::num_features(), dismec::MultiLabelData::num_labels(), dismec::opaque_int_type< Tag, T >::to_index(), and anonymous_namespace{xmc.cpp}::write_label_list().
Referenced by main(), TrainingProgram::run(), anonymous_namespace{py_data.cpp}::save_xmc(), save_xmc_dataset(), and TEST_CASE().
void dismec::io::write_npy_header | ( | std::streambuf & | target, |
std::string_view | description | ||
) |
Writes the header for a npy file.
This write the npy header to target
, with the data description
already provided. This means that this function writes the magic bytes and version number, pads description
to achieve 64 bit alignment of data, and writes the header length, description, and padding to `target. The header is terminated with a newline character.
target | The stream buffer to which the data is written. Note: If this is a file stream, it should be in binary mode! |
description | The description of the data, as a string-formatted python dictionary. |
Definition at line 32 of file numpy.cpp.
References anonymous_namespace{numpy.cpp}::MAGIC, anonymous_namespace{numpy.cpp}::NPY_PADDING, and THROW_ERROR.
Referenced by dismec::io::model::save_dense_weights_npy(), anonymous_namespace{numpy.cpp}::save_matrix_to_npy_imp(), and TEST_CASE().
std::ostream & dismec::io::write_vector_as_text | ( | std::ostream & | stream, |
const Eigen::Ref< const DenseRealVector > & | data | ||
) |
Writes the given vector as space-separated human-readable numbers.
This function does not check if the writing was successful.
Definition at line 21 of file common.cpp.
Referenced by dismec::io::model::save_dense_weights_txt(), and TEST_CASE().