skmultilearn.cluster package

The skmultilearn.cluster module gathers label space clustering methods.

Name

Description

FixedLabelSpaceClusterer

Return a predefined fixed clustering, usually driven by expert knowledge

MatrixLabelSpaceClusterer

Cluster the label space using a scikit-compatible matrix-based clusterer

GraphToolLabelGraphClusterer

Fits a Stochastic Block Model to the Label Graph and infers the communities

StochasticBlockModel

A Stochastic Blockmodel class

IGraphLabelGraphClusterer

Clusters label space using igraph community detection

RandomLabelSpaceClusterer

Randomly divides label space into equally-sized clusters

NetworkXLabelGraphClusterer

Cluster label space with NetworkX community detection

class skmultilearn.cluster.FixedLabelSpaceClusterer(clusters=None)

Bases: LabelSpaceClustererBase

Return a fixed label space partition

This clusterer takes a predefined fixed clustering of the label space and returns it in fit_predict as the label space division. This is useful for employing expert knowledge about label space division or partitions in ensemble classifiers such as: LabelSpacePartitioningClassifier or MajorityVotingClassifier.

Parameters

clustersarray of arrays of int

provided partition of the label space in the for of numpy array of numpy arrays of indexes for each partition, ex. [[0,1],[2,3]]

An example use of the fixed clusterer with a label partitioning classifier to train randomforests for a set of subproblems defined upon expert knowledge:

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import FixedLabelSpaceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.ensemble import RandomForestClassifier

classifier = LabelSpacePartitioningClassifier(
    classifier = LabelPowerset(
        classifier=RandomForestClassifier(n_estimators=100),
        require_dense = [False, True]
    ),
    require_dense = [True, True],
    clusterer = FixedLabelSpaceClusterer(clustering=[[1,2,3], [0,4]])
)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)
fit_predict(X, y)

Returns the provided label space division

Parameters

XNone

currently unused, left for scikit compatibility

yscipy.sparse

label space of shape (n_samples, n_labels)

Returns

arrray of arrays of label indexes (numpy.ndarray)

label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.IGraphLabelGraphClusterer(graph_builder, method)

Bases: LabelGraphClustererBase

Clusters the label space using igraph community detection methods

This clusterer constructs an igraph representation of the Label Graph generated by graph builder and detects communities in it using community detection methods from the igraph library. Detected communities are converted to a label space clustering. The approach has been described in this paper concerning data-driven label space division.

Parameters

graph_builder: a GraphBuilderBase inherited transformer

the graph builder to provide the adjacency matrix and weight map for the underlying graph

method: string

the community detection method to use, this clusterer supports the following community detection methods:

Method name string

Description

fastgreedy

Detecting communities with largest modularity using incremental greedy search

infomap

Detecting communities through information flow compressing simulated via random walks

label_propagation

Detecting communities from colorings via multiple label propagation on the graph

leading_eigenvector

Detecting communities with largest modularity through adjacency matrix eigenvectors

multilevel

Recursive communitiy detection with largest modularity step by step maximization

walktrap

Finding communities by trapping many random walks

Attributes

graph_igraph.Graph

the igraph Graph object containing the graph representation of graph builder’s adjacency matrix and weights

weights_{ ‘weight’list of values in edge order of graph edges }

edge weights stored in a format recognizable by the igraph module

Note

This clusterer is GPL-licenced and will taint your code with GPL restrictions.

References

If you use this clusterer please cite the igraph paper and the clustering paper:

@Article{igraph,
    title = {The igraph software package for complex network research},
    author = {Gabor Csardi and Tamas Nepusz},
    journal = {InterJournal},
    volume = {Complex Systems},
    pages = {1695},
    year = {2006},
    url = {http://igraph.org},
}

@Article{datadriven,
    author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian},
    title = {How Is a Data-Driven Approach Better than Random Choice in
    Label Space Division for Multi-Label Classification?},
    journal = {Entropy},
    volume = {18},
    year = {2016},
    number = {8},
    article_number = {282},
    url = {http://www.mdpi.com/1099-4300/18/8/282},
    issn = {1099-4300},
    doi = {10.3390/e18080282}
}

Examples

An example code for using this clusterer with a classifier looks like this:

from sklearn.ensemble import RandomForestClassifier
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.cluster import IGraphLabelGraphClusterer, LabelCooccurrenceGraphBuilder
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

# construct base forest classifier
base_classifier = RandomForestClassifier(n_estimators=1000)

# construct a graph builder that will include
# label relations weighted by how many times they
# co-occurred in the data, without self-edges
graph_builder = LabelCooccurrenceGraphBuilder(
    weighted = True,
    include_self_edges = False
)

# setup problem transformation approach with sparse matrices for random forest
problem_transform_classifier = LabelPowerset(classifier=base_classifier,
    require_dense=[False, False])

# setup the clusterer to use, we selected the fast greedy modularity-maximization approach
clusterer = IGraphLabelGraphClusterer(graph_builder=graph_builder, method='fastgreedy')

# setup the ensemble metaclassifier
classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

For more use cases see the label relations exploration guide.

fit_predict(X, y)

Performs clustering on y and returns list of label lists

Builds a label graph using the provided graph builder’s transform method on y and then detects communities using the selected method.

Sets self.weights_ and self.graph_.

Parameters

XNone

currently unused, left for scikit compatibility

yscipy.sparse

label space of shape (n_samples, n_labels)

Returns

arrray of arrays of label indexes (numpy.ndarray)

label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.LabelCooccurrenceGraphBuilder(weighted=None, include_self_edges=None, normalize_self_edges=None)

Bases: GraphBuilderBase

Base class providing API and common functions for all label co-occurence based multi-label classifiers.

This graph builder constructs a Label Graph based on the output matrix where two label nodes are connected when at least one sample is labeled with both of them. If the graph is weighted, the weight of an edge between two label nodes is the number of samples labeled with these two labels. Self-edge weights contain the number of samples with a given label.

Parameters

weighted: bool

decide whether to generate a weighted or unweighted graph.

include_self_edgesbool

decide whether to include self-edge i.e. label 1 - label 1 in co-occurrence graph

normalize_self_edges: bool

if including self edges, divide the (i, i) edge by 2.0, requires include_self_edges=True

References

If you use this graph builder please cite the clustering paper:

@Article{datadriven,
    author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian},
    title = {How Is a Data-Driven Approach Better than Random Choice in
    Label Space Division for Multi-Label Classification?},
    journal = {Entropy},
    volume = {18},
    year = {2016},
    number = {8},
    article_number = {282},
    url = {http://www.mdpi.com/1099-4300/18/8/282},
    issn = {1099-4300},
    doi = {10.3390/e18080282}
}

Examples

A full example of building a modularity-based label space division based on the Label Co-occurrence Graph and classifying with a separate classifier chain per subspace.

from skmultilearn.cluster import LabelCooccurrenceGraphBuilder, NetworkXLabelGraphClusterer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import GaussianNB

graph_builder = LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False, normalize_self_edges=False)
clusterer = NetworkXLabelGraphClusterer(graph_builder, method='louvain')
classifier = LabelSpacePartitioningClassifier(
    classifier = ClassifierChain(classifier=GaussianNB()),
    clusterer = clusterer
)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)

For more use cases see the label relations exploration guide.

transform(y)

Generate adjacency matrix from label matrix

This function generates a weighted or unweighted co-occurence Label Graph adjacency matrix in dictionary of keys format based on input binary label vectors

Parameters

ynumpy.ndarray or scipy.sparse

dense or sparse binary matrix with shape (n_samples, n_labels)

Returns

Dict[(int, int), float]

weight map with a tuple of label indexes as keys and a the number of samples in which the two co-occurred

class skmultilearn.cluster.MatrixLabelSpaceClusterer(clusterer=None, pass_input_space=False)

Bases: LabelSpaceClustererBase

Cluster the label space using a scikit-compatible matrix-based clusterer

Parameters

clusterersklearn.base.ClusterMixin

a clonable instance of a scikit-compatible clusterer, will be automatically put under self.clusterer.

pass_input_spacebool (default is False)

whether to take X into consideration upon clustering, use only if you know that the clusterer can handle two parameters for clustering, will be automatically put under self.pass_input_space.

Example code for using this clusterer looks like this:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.cluster import MatrixLabelSpaceClusterer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

# construct base forest classifier
base_classifier = RandomForestClassifier(n_estimators=1030)

# setup problem transformation approach with sparse matrices for random forest
problem_transform_classifier = LabelPowerset(classifier=base_classifier,
    require_dense=[False, False])

# setup the clusterer
clusterer = MatrixLabelSpaceClusterer(clusterer=KMeans(n_clusters=3))

# setup the ensemble metaclassifier
classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)
fit_predict(X, y)

Clusters the output space

The clusterer’s fit_predict method is executed on either X and y.T vectors (if self.pass_input_space is true) or just y.T to detect clusters of labels.

The transposition of label space is used to align with the format expected by scikit-learn classifiers, i.e. we cluster labels with label assignment vectors as samples.

Returns

arrray of arrays of label indexes (numpy.ndarray)

label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap)

Bases: LabelSpaceClustererBase

Randomly divides the label space into equally-sized clusters

This method divides the label space by drawing without replacement a desired number of equally sized subsets of label space, in a partitioning or overlapping scheme.

Parameters

cluster_sizeint

desired size of a single cluster, will be automatically put under self.cluster_size.

cluster_count: int

number of clusters to divide into, will be automatically put under self.cluster_count.

allow_overlapbool

whether to allow overlapping clusters or not, will be automatically put under self.allow_overlap.

Examples

The following code performs random label space partitioning.

from skmultilearn.cluster import RandomLabelSpaceClusterer

# assume X,y contain the data, example y contains 5 labels
cluster_count = 2
cluster_size = y.shape[1]//cluster_count # == 2
clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=False)
clr.fit_predict(X,y)
# Result:
# array([list([0, 4]), list([2, 3]), list([1])], dtype=object)

Note that the leftover labels that did not fit in cluster_size x cluster_count classifiers will be appended to an additional last cluster of size at most cluster_size - 1.

You can also use this class to get a random division of the label space, even with multiple overlaps:

from skmultilearn.cluster import RandomLabelSpaceClusterer

cluster_size = 3
cluster_count = 5
clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=True)
clr.fit_predict(X,y)

# Result
# array([[2, 1, 3],
#        [3, 0, 4],
#        [2, 3, 1],
#        [2, 3, 4],
#        [3, 4, 0],
#        [3, 0, 2]])

Note that you will never get the same label subset twice.

fit_predict(X, y)

Cluster the output space

Parameters

X : currently unused, left for scikit compatibility y : scipy.sparse

label space of shape (n_samples, n_labels)

Returns

arrray of arrays of label indexes (numpy.ndarray)

label space division, each sublist represents labels that are in that community


Cite us

If you use scikit-multilearn-ng in your research and publish it, please consider citing scikit-multilearn:

@ARTICLE{2017arXiv170201460S,
    author = {{Szyma{'n}ski}, P. and {Kajdanowicz}, T.},
    title = "{A scikit-based Python environment for performing multi-label classification}",
    journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
    eprint = {1702.01460},
    primaryClass = "cs.LG",
    keywords = {Computer Science - Learning, Computer Science - Mathematical Software},
    year = 2017,
    month = feb,
}