scikit-multilearn-ng
¶In this section you will learn the basic concepts behind multi-label classification.
TLDR: Multilabel data is a type of data where each sample can belong to multiple classes. For example, in the case of a movie, it can belong to multiple genres, such as "action" and "comedy". In the case of a news article, it can belong to multiple topics, such as "politics" and "sports".
scikit-multilearn-ng
expects on input:
X
to be a matrix of shape (n_samples, n_features)y
to be a matrix of shape (n_samples, n_labels)Let’s load up a data set to see this in practice:
from skmultilearn.dataset import load_dataset
X, y, _, _ = load_dataset('emotions', 'train')
emotions:train - exists, not redownloading
X, y
(<391x72 sparse matrix of type '<class 'numpy.float64'>' with 28059 stored elements in List of Lists format>, <391x6 sparse matrix of type '<class 'numpy.int64'>' with 709 stored elements in List of Lists format>)
We can see that in the case of emotions data the values are:
By matrix scikit-multilearn-ng
understands following the A[i,j]
element accessing scheme. Sparse matrices should be used instead of dense ones, especially for the output space. Scikit-multilearn-ng will internally convert dense representations to sparse representations that are most suitable to a given classification procedure. Scikit-multilearn will output
X
can store any type of data a given classification method can handle is allowed, but nominal encoding is always helpful. Nominal encoding is enabled by default when loading data with skmultilearn.dataset.Dataset.load_arff_to_numpy
helper, which also returns sparse representations of X
and y
loaded from ARFF data file.
y
is expected to be a binary integer indicator matrix of shape. In the binary indicator matrix each matrix element A[i,j]
should be either 1
if label j
is assigned to an object no i
, and 0
if not.
We highly recommend for every multi-label output space to be stored in sparse matrices and expect scikit-multilearn-ng classifiers to operate only on sparse binary label indicator matrices internally. This is also the format of predicted label assignments. Sparse representation is employed as default because it is very rare for a real-world output space y
to be dense. Usually, the number of labels assigned per instance is just a small portion of all labels. The average percentage of labels assigned per object is called label density and in established data sets it tends to be small http://mulan.sourceforge.net/datasets-mlc.html.
Multilabel classification is a type of classification where the goal is to assign multiple labels to each sample. For example, in the case of a movie, the goal is to assign multiple genres to each movie. In the case of a news article, the goal is to assign multiple topics to each article. One approach to multilabel classification is to train a separate classifier for each label. This category is called "problem transformation" and includes methods such as Binary Relevance, Classifier Chains, and Label Powerset. Another approach is to train a single classifier that can predict multiple labels at once, such as MLkNN, MLARAM, and MLTSVM.
Here is a basic test train split example:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
X_train.shape, X_test.shape
((195, 72), (196, 72))
In the case of multilabel classification, we need to split the data into training and testing sets. For many reasons, described here and here traditional single-label approaches to stratifying data fail to provide balanced data set divisions which prevents classifiers from generalizing information, thus one should use a multilabel stratification approach.
We will use the skmultilearn.model_selection.iterative_stratification
module to split the data into training and testing sets.
from skmultilearn.model_selection import iterative_train_test_split
X_train, X_test, y_train, y_test = iterative_train_test_split(X, y, test_size = 0.5)
X_train.shape, X_test.shape
((191, 72), (200, 72))
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
# initialize Binary Relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(
classifier = GaussianNB(),
require_dense = [True, True]
)
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
predictions
<200x6 sparse matrix of type '<class 'numpy.int64'>' with 487 stored elements in Compressed Sparse Column format>
Now let's evaluate the performance of the Binary Relevance method on the emotions data set.
from sklearn.metrics import accuracy_score, f1_score, recall_score
f1_score(y_test, predictions, average='weighted'), accuracy_score(y_test, predictions), recall_score(y_test, predictions, average='weighted')
(0.6470091169076729, 0.195, 0.7605633802816901)
Let's see how the MLkNN method performs on the emotions data set.
from skmultilearn.adapt import MLkNN
# initialize MLkNN multi-label classifier, which is a multi-label adaptation of the k-nearest neighbour algorithm,
# with k=20, which is the number of neighbours of each input instance to take into account
classifier = MLkNN(k=20)
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
f1_score(y_test, predictions, average='weighted'), accuracy_score(y_test, predictions), recall_score(y_test, predictions, average='weighted')
(0.4039511287404211, 0.115, 0.3492957746478873)
Let's see how to find the best parameters for the MLkNN method on the emotions data set using GridSearchCV. Note that this can be computationally expensive.
from sklearn.model_selection import GridSearchCV
parameters = {'k': range(1,20)}
score = 'f1_weighted'
classifier = GridSearchCV(MLkNN(), parameters, scoring=score)
classifier.fit(X, y)
classifier.best_params_, classifier.best_score_
({'k': 5}, 0.5018561128889709)
Here we can see that k=5
is the best parameter for the MLkNN method on the emotions dataset with a weighted F1 score of 0.501
. We can conclude that Binary Relevance performs better than MLkNN (based on the F1 score) in this case.
from skmultilearn.problem_transform import ClassifierChain
# initialize Classifier Chain multi-label classifier
# with a gaussian naive bayes base classifier
classifier = ClassifierChain(
classifier = GaussianNB(),
require_dense = [True, True]
)
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
f1_score(y_test, predictions, average='weighted'), accuracy_score(y_test, predictions), recall_score(y_test, predictions, average='weighted')
(0.6484772423333367, 0.24, 0.7352112676056338)