DiSMEC++
Model data format

Models are saved in multiple files. One file contains metadata, whereas the weights are stored in separate files. We support multiple formats for storing the weights, but the metadata file has always the same structure.

Metadata File

The metadata is saved as json. It contains the following keys

  • "num-features": Number of features, i.e. the size of a single weight vector
  • "num-labels": Number of labels, i.e. the number of weight vectors.
  • "date": Contains the data and time when the file was created.
  • "weights": Contains info on where the weights are stored. This is an array of dicts, where each entry corresponds to one weights file. Each weights file stores a contiguous subset (as seen over labels) of the weights. Each entry into the vector has the keys "first", which is the index of the first label in the file, "count" which is the number of weight vectors, "file" which is the file name relative to the metadata file and "weight-format", which specifies the format in which the weights are saved.

There are several advantages to allowing the weights to be distributed over multiple files. For one, it allows partial saves, e.g. if one wants to do checkpointing. While this could be achieved by simply appending for the human-readable text formats, it may require rewriting the entire file in other settings, e.g. when using compressed files. Secondly, in a distributed setting with a shared, networked file system, we can reduce the amount of data transfer. In distributed training, each worker can save its own weight files, and the main program only needs to be notified that it should update the metadata file. In distributed prediction, each worker only needs to load the weights for the labels it is responsible for, and does not have to parse the entire weight file only to discard most of the weights.

I am planning to add the following additional metadata:

  • version: To specify which version of the library was used when creating the file
  • training: A dict which contains information about the training process
  • custom: Guaranteed to never be written by the library, so save to use for others.

Dense Text Format

Writes all the weights as space-separated numbers. Each line in the file corresponds to a single weight vector. Note that this means that the rows in the text file correspond to the columns in the weight matrix. This is to make it easier to read only a subset of weight vectors from the text file.

Writing is implemented in io::model::save_dense_weights_txt(), and reading in io::model::load_dense_weights_txt()

The advantage of this format is that it is human readable and very portable. However, it is not efficient both in terms of storage and in terms of read/write performance. This can be influenced to some degree by setting adjusting the precision, i.e. the number of digits written.

Sparse Text Format

Writes all the weights exceeding a given threshold in a sparse format. Each row corresponds to one weight vector, and consists of index:value pairs separated by whitespace. Here, index is the 0-based position in the weight vector and value its corresponding value.

This format is human readable and portable, and may be much more space efficient than the dense text format. Storage requirements can be adjusted by setting the precision with which the nonzero weights are written, and by setting the threshold below which weights are culled.

Writing is implemented in io::model::save_as_sparse_weights_txt(), which can also save dense models by culling weights below a specified threshold. Reading is implemented through io::model::load_sparse_weights_txt().

Dense Numpy Format

Writes the weights as a matrix to a .npy file. The data is written in row-major format to allow loading a subset of the labels by reading contiguous parts of the file. Since the output is binary, we operate directly on a stream-buffer here.

This format is more space efficient than the text format, and also has much lower computational overhead, since it does not require any number parsing or formatting. As a rough estimate, for eurlex (~20M weights) saving as (dense) text takes about 8.8 seconds; as npy it takes only about 200 ms. The file size decreases from 230 MB to 80 MB. Similar (though slightly less) speedups appear for loading the model.

Writing is implemented in io::model::save_dense_weights_npy() and reading is done using io::model::load_dense_weights_npy.