skmultilearn.cluster package¶

The skmultilearn.cluster module gathers label space clustering methods.

Name	Description
`FixedLabelSpaceClusterer`	Return a predefined fixed clustering, usually driven by expert knowledge
`MatrixLabelSpaceClusterer`	Cluster the label space using a scikit-compatible matrix-based clusterer
`GraphToolLabelGraphClusterer`	Fits a Stochastic Block Model to the Label Graph and infers the communities
`StochasticBlockModel`	A Stochastic Blockmodel class
`IGraphLabelGraphClusterer`	Clusters label space using igraph community detection
`RandomLabelSpaceClusterer`	Randomly divides label space into equally-sized clusters
`NetworkXLabelGraphClusterer`	Cluster label space with NetworkX community detection

class skmultilearn.cluster.FixedLabelSpaceClusterer(clusters=None)¶

Bases: LabelSpaceClustererBase

Return a fixed label space partition

This clusterer takes a predefined fixed clustering of the label space and returns it in fit_predict as the label space division. This is useful for employing expert knowledge about label space division or partitions in ensemble classifiers such as: LabelSpacePartitioningClassifier or MajorityVotingClassifier.

Parameters¶

clustersarray of arrays of int: provided partition of the label space in the for of numpy array of numpy arrays of indexes for each partition, ex. [[0,1],[2,3]]

An example use of the fixed clusterer with a label partitioning classifier to train randomforests for a set of subproblems defined upon expert knowledge:

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import FixedLabelSpaceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.ensemble import RandomForestClassifier

classifier = LabelSpacePartitioningClassifier(
    classifier = LabelPowerset(
        classifier=RandomForestClassifier(n_estimators=100),
        require_dense = [False, True]
    ),
    require_dense = [True, True],
    clusterer = FixedLabelSpaceClusterer(clustering=[[1,2,3], [0,4]])
)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

fit_predict(X, y)¶

Returns the provided label space division

Parameters¶

XNone: currently unused, left for scikit compatibility
yscipy.sparse: label space of shape (n_samples, n_labels)

Returns¶

arrray of arrays of label indexes (numpy.ndarray): label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.IGraphLabelGraphClusterer(graph_builder, method)¶

Bases: LabelGraphClustererBase

Clusters the label space using igraph community detection methods

This clusterer constructs an igraph representation of the Label Graph generated by graph builder and detects communities in it using community detection methods from the igraph library. Detected communities are converted to a label space clustering. The approach has been described in this paper concerning data-driven label space division.

Parameters¶

graph_builder: a GraphBuilderBase inherited transformer

the graph builder to provide the adjacency matrix and weight map for the underlying graph

method: string

the community detection method to use, this clusterer supports the following community detection methods:

Method name string	Description
fastgreedy	Detecting communities with largest modularity using incremental greedy search
infomap	Detecting communities through information flow compressing simulated via random walks
label_propagation	Detecting communities from colorings via multiple label propagation on the graph
leading_eigenvector	Detecting communities with largest modularity through adjacency matrix eigenvectors
multilevel	Recursive communitiy detection with largest modularity step by step maximization
walktrap	Finding communities by trapping many random walks

Attributes¶

graph_igraph.Graph: the igraph Graph object containing the graph representation of graph builder’s adjacency matrix and weights
weights_{ ‘weight’list of values in edge order of graph edges }: edge weights stored in a format recognizable by the igraph module

Note

This clusterer is GPL-licenced and will taint your code with GPL restrictions.

References¶

If you use this clusterer please cite the igraph paper and the clustering paper:

@Article{igraph,
    title = {The igraph software package for complex network research},
    author = {Gabor Csardi and Tamas Nepusz},
    journal = {InterJournal},
    volume = {Complex Systems},
    pages = {1695},
    year = {2006},
    url = {http://igraph.org},
}

@Article{datadriven,
    author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian},
    title = {How Is a Data-Driven Approach Better than Random Choice in
    Label Space Division for Multi-Label Classification?},
    journal = {Entropy},
    volume = {18},
    year = {2016},
    number = {8},
    article_number = {282},
    url = {http://www.mdpi.com/1099-4300/18/8/282},
    issn = {1099-4300},
    doi = {10.3390/e18080282}
}

Examples¶

An example code for using this clusterer with a classifier looks like this:

from sklearn.ensemble import RandomForestClassifier
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.cluster import IGraphLabelGraphClusterer, LabelCooccurrenceGraphBuilder
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

# construct base forest classifier
base_classifier = RandomForestClassifier(n_estimators=1000)

# construct a graph builder that will include
# label relations weighted by how many times they
# co-occurred in the data, without self-edges
graph_builder = LabelCooccurrenceGraphBuilder(
    weighted = True,
    include_self_edges = False
)

# setup problem transformation approach with sparse matrices for random forest
problem_transform_classifier = LabelPowerset(classifier=base_classifier,
    require_dense=[False, False])

# setup the clusterer to use, we selected the fast greedy modularity-maximization approach
clusterer = IGraphLabelGraphClusterer(graph_builder=graph_builder, method='fastgreedy')

# setup the ensemble metaclassifier
classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

For more use cases see the label relations exploration guide.

fit_predict(X, y)¶

Performs clustering on y and returns list of label lists

Builds a label graph using the provided graph builder’s transform method on y and then detects communities using the selected method.

Sets self.weights_ and self.graph_.

Parameters¶

XNone: currently unused, left for scikit compatibility
yscipy.sparse: label space of shape (n_samples, n_labels)

Returns¶

arrray of arrays of label indexes (numpy.ndarray): label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.LabelCooccurrenceGraphBuilder(weighted=None, include_self_edges=None, normalize_self_edges=None)¶

Bases: GraphBuilderBase

Base class providing API and common functions for all label co-occurence based multi-label classifiers.

This graph builder constructs a Label Graph based on the output matrix where two label nodes are connected when at least one sample is labeled with both of them. If the graph is weighted, the weight of an edge between two label nodes is the number of samples labeled with these two labels. Self-edge weights contain the number of samples with a given label.

Parameters¶

weighted: bool: decide whether to generate a weighted or unweighted graph.
include_self_edgesbool: decide whether to include self-edge i.e. label 1 - label 1 in co-occurrence graph
normalize_self_edges: bool: if including self edges, divide the (i, i) edge by 2.0, requires include_self_edges=True

References¶

If you use this graph builder please cite the clustering paper:

@Article{datadriven,
    author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian},
    title = {How Is a Data-Driven Approach Better than Random Choice in
    Label Space Division for Multi-Label Classification?},
    journal = {Entropy},
    volume = {18},
    year = {2016},
    number = {8},
    article_number = {282},
    url = {http://www.mdpi.com/1099-4300/18/8/282},
    issn = {1099-4300},
    doi = {10.3390/e18080282}
}

Examples¶

A full example of building a modularity-based label space division based on the Label Co-occurrence Graph and classifying with a separate classifier chain per subspace.

from skmultilearn.cluster import LabelCooccurrenceGraphBuilder, NetworkXLabelGraphClusterer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import GaussianNB

graph_builder = LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False, normalize_self_edges=False)
clusterer = NetworkXLabelGraphClusterer(graph_builder, method='louvain')
classifier = LabelSpacePartitioningClassifier(
    classifier = ClassifierChain(classifier=GaussianNB()),
    clusterer = clusterer
)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)

For more use cases see the label relations exploration guide.

transform(y)¶

Generate adjacency matrix from label matrix

This function generates a weighted or unweighted co-occurence Label Graph adjacency matrix in dictionary of keys format based on input binary label vectors

Parameters¶

ynumpy.ndarray or scipy.sparse: dense or sparse binary matrix with shape (n_samples, n_labels)

Returns¶

Dict[(int, int), float]: weight map with a tuple of label indexes as keys and a the number of samples in which the two co-occurred

class skmultilearn.cluster.MatrixLabelSpaceClusterer(clusterer=None, pass_input_space=False)¶

Bases: LabelSpaceClustererBase

Cluster the label space using a scikit-compatible matrix-based clusterer

Parameters¶

clusterersklearn.base.ClusterMixin: a clonable instance of a scikit-compatible clusterer, will be automatically put under self.clusterer.
pass_input_spacebool (default is False): whether to take X into consideration upon clustering, use only if you know that the clusterer can handle two parameters for clustering, will be automatically put under self.pass_input_space.

Example code for using this clusterer looks like this:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.cluster import MatrixLabelSpaceClusterer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

# construct base forest classifier
base_classifier = RandomForestClassifier(n_estimators=1030)

# setup problem transformation approach with sparse matrices for random forest
problem_transform_classifier = LabelPowerset(classifier=base_classifier,
    require_dense=[False, False])

# setup the clusterer
clusterer = MatrixLabelSpaceClusterer(clusterer=KMeans(n_clusters=3))

# setup the ensemble metaclassifier
classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

fit_predict(X, y)¶

Clusters the output space

The clusterer’s fit_predict method is executed on either X and y.T vectors (if self.pass_input_space is true) or just y.T to detect clusters of labels.

The transposition of label space is used to align with the format expected by scikit-learn classifiers, i.e. we cluster labels with label assignment vectors as samples.

Returns¶

arrray of arrays of label indexes (numpy.ndarray): label space division, each sublist represents labels that are in that community

class skmultilearn.cluster.RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap)¶

Bases: LabelSpaceClustererBase

Randomly divides the label space into equally-sized clusters

This method divides the label space by drawing without replacement a desired number of equally sized subsets of label space, in a partitioning or overlapping scheme.

Parameters¶

cluster_sizeint: desired size of a single cluster, will be automatically put under self.cluster_size.
cluster_count: int: number of clusters to divide into, will be automatically put under self.cluster_count.
allow_overlapbool: whether to allow overlapping clusters or not, will be automatically put under self.allow_overlap.

Examples¶

The following code performs random label space partitioning.

from skmultilearn.cluster import RandomLabelSpaceClusterer

# assume X,y contain the data, example y contains 5 labels
cluster_count = 2
cluster_size = y.shape[1]//cluster_count # == 2
clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=False)
clr.fit_predict(X,y)
# Result:
# array([list([0, 4]), list([2, 3]), list([1])], dtype=object)

Note that the leftover labels that did not fit in cluster_size x cluster_count classifiers will be appended to an additional last cluster of size at most cluster_size - 1.

You can also use this class to get a random division of the label space, even with multiple overlaps:

from skmultilearn.cluster import RandomLabelSpaceClusterer

cluster_size = 3
cluster_count = 5
clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=True)
clr.fit_predict(X,y)

# Result
# array([[2, 1, 3],
#        [3, 0, 4],
#        [2, 3, 1],
#        [2, 3, 4],
#        [3, 4, 0],
#        [3, 0, 2]])

Note that you will never get the same label subset twice.

fit_predict(X, y)¶

Cluster the output space

Parameters¶

X : currently unused, left for scikit compatibility y : scipy.sparse

label space of shape (n_samples, n_labels)

Returns¶

arrray of arrays of label indexes (numpy.ndarray): label space division, each sublist represents labels that are in that community

Cite us

If you use scikit-multilearn-ng in your research and publish it, please consider citing scikit-multilearn:

@ARTICLE{2017arXiv170201460S,
    author = {{Szyma{'n}ski}, P. and {Kajdanowicz}, T.},
    title = "{A scikit-based Python environment for performing multi-label classification}",
    journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
    eprint = {1702.01460},
    primaryClass = "cs.LG",
    keywords = {Computer Science - Learning, Computer Science - Mathematical Software},
    year = 2017,
    month = feb,
}

skmultilearn.cluster package¶

Parameters¶

Parameters¶

Returns¶

Parameters¶

Attributes¶

References¶

Examples¶

Parameters¶

Returns¶

Parameters¶

References¶

Examples¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Examples¶

Parameters¶

Returns¶

scikit-multilearn-ng

Navigation

Related Topics