skmultilearn package

Multi-label classification module for Python

Scikit-multilearn-ng is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.

Subpackages

Submodules

skmultilearn.dataset module

skmultilearn.dataset.available_data_sets()

Lists available data sets and their variants

Returns

dict[(set_name, variant_name)] -> [md5, file_name]

available datasets and their variants with the key pertaining to the (set_name, variant_name) and values include md5 and file name on server

skmultilearn.dataset.clear_data_home(data_home=None)

Delete all the content of the data home cache.

Parameters

data_homestr (default is None)

the path to the directory in which scikit-multilearn data sets should be stored.

skmultilearn.dataset.download_dataset(set_name, variant, data_home=None)

Downloads a data set

Parameters

set_namestr

name of set from available_data_sets()

variantstr

variant of the data set from available_data_sets()

data_homedefault None, str

custom base folder for data, if None, default is used

Returns

str

path to the downloaded data set file on disk

skmultilearn.dataset.get_data_home(data_home=None, subdirectory='')

Return the path of the scikit-multilearn data dir.

This folder is used by some large dataset loaders to avoid downloading the data several times.

By default the data_home is set to a folder named 'scikit_ml_learn_data' in the user home folder.

Alternatively, it can be set by the 'SCIKIT_ML_LEARN_DATA' environment variable or programmatically by giving an explicit folder path. The '~' symbol is expanded to the user home folder.

If the folder does not already exist, it is automatically created.

Parameters

data_homestr (default is None)

the path to the directory in which scikit-multilearn data sets should be stored, if None the path is generated as stated above

subdirectorystr, default ‘’

return path subdirectory under data_home if data_home passed or under default if not passed

Returns

str

the path to the data home

skmultilearn.dataset.load_dataset(set_name, variant, data_home=None)

Loads a selected variant of the given data set

Parameters

set_namestr

name of set from available_data_sets()

variantstr

variant of the data set

data_homedefault None, str

custom base folder for data, if None, default is used

Returns

dict

the loaded multilabel data set variant in the scikit-multilearn format, see data_sets

skmultilearn.dataset.load_dataset_dump(filename)

Loads a compressed data set dump

Parameters

filenamestr

path to dump file, if without .bz2 ending, the .bz2 extension will be appended.

Returns

Xarray_like, numpy.matrix or scipy.sparse matrix, shape=(n_samples, n_features)

input feature matrix

yarray_like, numpy.matrix or scipy.sparse matrix of {0, 1}, shape=(n_samples, n_labels)

binary indicator matrix with label assignments

names of attributes: List[str]

list of attribute names for X columns

names of labels: List[str]

list of label names for y columns

skmultilearn.dataset.load_from_arff(filename, label_count, label_location='end', input_feature_type='float', encode_nominal=True, load_sparse=False, return_attribute_definitions=False)

Method for loading ARFF files as numpy array

Parameters

filenamestr

path to ARFF file

label_count: integer

number of labels in the ARFF file

label_location: str {“start”, “end”} (default is “end”)

whether the ARFF file contains labels at the beginning of the attributes list (“start”, MEKA format) or at the end (“end”, MULAN format)

input_feature_type: numpy.type as string (default is “float”)

the desire type of the contents of the return ‘X’ array-likes, default ‘i8’, should be a numpy type, see http://docs.scipy.org/doc/numpy/user/basics.types.html

encode_nominal: bool (default is True)

whether convert categorical data into numeric factors - required for some scikit classifiers that can’t handle non-numeric input features.

load_sparse: boolean (default is False)

whether to read arff file as a sparse file format, liac-arff breaks if sparse reading is enabled for non-sparse ARFFs.

return_attribute_definitions: boolean (default is False)

whether to return the definitions for each attribute in the dataset

Returns

Xscipy.sparse.lil_matrix of input_feature_type, shape=(n_samples, n_features)

input feature matrix

yscipy.sparse.lil_matrix of {0, 1}, shape=(n_samples, n_labels)

binary indicator matrix with label assignments

names of attributesList[str]

list of attribute names from ARFF file

skmultilearn.dataset.save_dataset_dump(input_space, labels, feature_names, label_names, filename=None)

Saves a compressed data set dump

Parameters

input_space: array-like of array-likes

Input space array-like of input feature vectors

labels: array-like of binary label vectors

Array-like of labels assigned to each input vector, as a binary indicator vector (i.e. if 5th position has value 1 then the input vector has label no. 5)

feature_names: array-like,optional

names of features

label_names: array-like, optional

names of labels

filenamestr, optional

Path to dump file, if without .bz2, the .bz2 extension will be appended.

skmultilearn.dataset.save_to_arff(X, y, label_location='end', save_sparse=True, filename=None)

Method for dumping data to ARFF files

Parameters

Xarray_like, numpy.matrix or scipy.sparse matrix, shape=(n_samples, n_features)

input feature matrix

yarray_like, numpy.matrix or scipy.sparse matrix of {0, 1}, shape=(n_samples, n_labels)

binary indicator matrix with label assignments

label_location: string {“start”, “end”} (default is “end”)

whether the ARFF file will contain labels at the beginning of the attributes list (“start”, MEKA format) or at the end (“end”, MULAN format)

save_sparse: boolean

Whether to save in ARFF’s sparse dictionary-like format instead of listing all zeroes within file, very useful in multi-label classification.

filenamestr or None

Path to ARFF file, if None, the ARFF representation is returned as string

Returns

str or None

the ARFF dump string, if filename is None

skmultilearn.utils module

skmultilearn.utils.get_matrix_in_format(original_matrix, matrix_format)

Converts matrix to format

Parameters

original_matrixnp.matrix or scipy matrix or np.array of np. arrays

matrix to convert

matrix_formatstring

format

Returns

matrixscipy matrix

matrix in given format

skmultilearn.utils.matrix_creation_function_for_format(sparse_format)
skmultilearn.utils.measure_per_label(measure, y_true, y_predicted)

Return per label results of a scikit-learn compatible quality measure

Parameters

measurecallable

scikit-compatible quality measure function

y_truesparse matrix

ground truth

y_predictedsparse matrix

the predicted result

Returns

List[int or float]

score from a given measure depending on what the measure returns


Cite us

If you use scikit-multilearn-ng in your research and publish it, please consider citing scikit-multilearn:

@ARTICLE{2017arXiv170201460S,
    author = {{Szyma{'n}ski}, P. and {Kajdanowicz}, T.},
    title = "{A scikit-based Python environment for performing multi-label classification}",
    journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
    eprint = {1702.01460},
    primaryClass = "cs.LG",
    keywords = {Computer Science - Learning, Computer Science - Mathematical Software},
    year = 2017,
    month = feb,
}