skmultilearn.cluster package¶
The skmultilearn.cluster
module gathers label space clustering methods.
Name |
Description |
---|---|
Return a predefined fixed clustering, usually driven by expert knowledge |
|
Cluster the label space using a scikit-compatible matrix-based clusterer |
|
|
Fits a Stochastic Block Model to the Label Graph and infers the communities |
|
A Stochastic Blockmodel class |
Clusters label space using igraph community detection |
|
Randomly divides label space into equally-sized clusters |
|
|
Cluster label space with NetworkX community detection |
- class skmultilearn.cluster.FixedLabelSpaceClusterer(clusters=None)¶
Bases:
LabelSpaceClustererBase
Return a fixed label space partition
This clusterer takes a predefined fixed
clustering
of the label space and returns it in fit_predict as the label space division. This is useful for employing expert knowledge about label space division or partitions in ensemble classifiers such as:LabelSpacePartitioningClassifier
orMajorityVotingClassifier
.Parameters¶
- clustersarray of arrays of int
provided partition of the label space in the for of numpy array of numpy arrays of indexes for each partition, ex.
[[0,1],[2,3]]
An example use of the fixed clusterer with a label partitioning classifier to train randomforests for a set of subproblems defined upon expert knowledge:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier from skmultilearn.cluster import FixedLabelSpaceClusterer from skmultilearn.problem_transform import LabelPowerset from sklearn.ensemble import RandomForestClassifier classifier = LabelSpacePartitioningClassifier( classifier = LabelPowerset( classifier=RandomForestClassifier(n_estimators=100), require_dense = [False, True] ), require_dense = [True, True], clusterer = FixedLabelSpaceClusterer(clustering=[[1,2,3], [0,4]]) ) # train classifier.fit(X_train, y_train) # predict predictions = classifier.predict(X_test)
- fit_predict(X, y)¶
Returns the provided label space division
Parameters¶
- XNone
currently unused, left for scikit compatibility
- yscipy.sparse
label space of shape
(n_samples, n_labels)
Returns¶
- arrray of arrays of label indexes (numpy.ndarray)
label space division, each sublist represents labels that are in that community
- class skmultilearn.cluster.IGraphLabelGraphClusterer(graph_builder, method)¶
Bases:
LabelGraphClustererBase
Clusters the label space using igraph community detection methods
This clusterer constructs an igraph representation of the Label Graph generated by graph builder and detects communities in it using community detection methods from the igraph library. Detected communities are converted to a label space clustering. The approach has been described in this paper concerning data-driven label space division.
Parameters¶
- graph_builder: a GraphBuilderBase inherited transformer
the graph builder to provide the adjacency matrix and weight map for the underlying graph
- method: string
the community detection method to use, this clusterer supports the following community detection methods:
Method name string
Description
Detecting communities with largest modularity using incremental greedy search
Detecting communities through information flow compressing simulated via random walks
Detecting communities from colorings via multiple label propagation on the graph
Detecting communities with largest modularity through adjacency matrix eigenvectors
Recursive communitiy detection with largest modularity step by step maximization
Finding communities by trapping many random walks
Attributes¶
- graph_igraph.Graph
the igraph Graph object containing the graph representation of graph builder’s adjacency matrix and weights
- weights_{ ‘weight’list of values in edge order of graph edges }
edge weights stored in a format recognizable by the igraph module
Note
This clusterer is GPL-licenced and will taint your code with GPL restrictions.
References¶
If you use this clusterer please cite the igraph paper and the clustering paper:
@Article{igraph, title = {The igraph software package for complex network research}, author = {Gabor Csardi and Tamas Nepusz}, journal = {InterJournal}, volume = {Complex Systems}, pages = {1695}, year = {2006}, url = {http://igraph.org}, } @Article{datadriven, author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian}, title = {How Is a Data-Driven Approach Better than Random Choice in Label Space Division for Multi-Label Classification?}, journal = {Entropy}, volume = {18}, year = {2016}, number = {8}, article_number = {282}, url = {http://www.mdpi.com/1099-4300/18/8/282}, issn = {1099-4300}, doi = {10.3390/e18080282} }
Examples¶
An example code for using this clusterer with a classifier looks like this:
from sklearn.ensemble import RandomForestClassifier from skmultilearn.problem_transform import LabelPowerset from skmultilearn.cluster import IGraphLabelGraphClusterer, LabelCooccurrenceGraphBuilder from skmultilearn.ensemble import LabelSpacePartitioningClassifier # construct base forest classifier base_classifier = RandomForestClassifier(n_estimators=1000) # construct a graph builder that will include # label relations weighted by how many times they # co-occurred in the data, without self-edges graph_builder = LabelCooccurrenceGraphBuilder( weighted = True, include_self_edges = False ) # setup problem transformation approach with sparse matrices for random forest problem_transform_classifier = LabelPowerset(classifier=base_classifier, require_dense=[False, False]) # setup the clusterer to use, we selected the fast greedy modularity-maximization approach clusterer = IGraphLabelGraphClusterer(graph_builder=graph_builder, method='fastgreedy') # setup the ensemble metaclassifier classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer) # train classifier.fit(X_train, y_train) # predict predictions = classifier.predict(X_test)
For more use cases see the label relations exploration guide.
- fit_predict(X, y)¶
Performs clustering on y and returns list of label lists
Builds a label graph using the provided graph builder’s transform method on y and then detects communities using the selected method.
Sets
self.weights_
andself.graph_
.Parameters¶
- XNone
currently unused, left for scikit compatibility
- yscipy.sparse
label space of shape
(n_samples, n_labels)
Returns¶
- arrray of arrays of label indexes (numpy.ndarray)
label space division, each sublist represents labels that are in that community
- class skmultilearn.cluster.LabelCooccurrenceGraphBuilder(weighted=None, include_self_edges=None, normalize_self_edges=None)¶
Bases:
GraphBuilderBase
Base class providing API and common functions for all label co-occurence based multi-label classifiers.
This graph builder constructs a Label Graph based on the output matrix where two label nodes are connected when at least one sample is labeled with both of them. If the graph is weighted, the weight of an edge between two label nodes is the number of samples labeled with these two labels. Self-edge weights contain the number of samples with a given label.
Parameters¶
- weighted: bool
decide whether to generate a weighted or unweighted graph.
- include_self_edgesbool
decide whether to include self-edge i.e. label 1 - label 1 in co-occurrence graph
- normalize_self_edges: bool
if including self edges, divide the (i, i) edge by 2.0, requires include_self_edges=True
References¶
If you use this graph builder please cite the clustering paper:
@Article{datadriven, author = {Szymański, Piotr and Kajdanowicz, Tomasz and Kersting, Kristian}, title = {How Is a Data-Driven Approach Better than Random Choice in Label Space Division for Multi-Label Classification?}, journal = {Entropy}, volume = {18}, year = {2016}, number = {8}, article_number = {282}, url = {http://www.mdpi.com/1099-4300/18/8/282}, issn = {1099-4300}, doi = {10.3390/e18080282} }
Examples¶
A full example of building a modularity-based label space division based on the Label Co-occurrence Graph and classifying with a separate classifier chain per subspace.
from skmultilearn.cluster import LabelCooccurrenceGraphBuilder, NetworkXLabelGraphClusterer from skmultilearn.ensemble import LabelSpacePartitioningClassifier from skmultilearn.problem_transform import ClassifierChain from sklearn.naive_bayes import GaussianNB graph_builder = LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False, normalize_self_edges=False) clusterer = NetworkXLabelGraphClusterer(graph_builder, method='louvain') classifier = LabelSpacePartitioningClassifier( classifier = ClassifierChain(classifier=GaussianNB()), clusterer = clusterer ) classifier.fit(X_train, y_train) prediction = classifier.predict(X_test)
For more use cases see the label relations exploration guide.
- transform(y)¶
Generate adjacency matrix from label matrix
This function generates a weighted or unweighted co-occurence Label Graph adjacency matrix in dictionary of keys format based on input binary label vectors
Parameters¶
- ynumpy.ndarray or scipy.sparse
dense or sparse binary matrix with shape
(n_samples, n_labels)
Returns¶
- Dict[(int, int), float]
weight map with a tuple of label indexes as keys and a the number of samples in which the two co-occurred
- class skmultilearn.cluster.MatrixLabelSpaceClusterer(clusterer=None, pass_input_space=False)¶
Bases:
LabelSpaceClustererBase
Cluster the label space using a scikit-compatible matrix-based clusterer
Parameters¶
- clusterersklearn.base.ClusterMixin
a clonable instance of a scikit-compatible clusterer, will be automatically put under
self.clusterer
.- pass_input_spacebool (default is False)
whether to take
X
into consideration upon clustering, use only if you know that the clusterer can handle two parameters for clustering, will be automatically put underself.pass_input_space
.
Example code for using this clusterer looks like this:
from sklearn.ensemble import RandomForestClassifier from sklearn.cluster import KMeans from skmultilearn.problem_transform import LabelPowerset from skmultilearn.cluster import MatrixLabelSpaceClusterer from skmultilearn.ensemble import LabelSpacePartitioningClassifier # construct base forest classifier base_classifier = RandomForestClassifier(n_estimators=1030) # setup problem transformation approach with sparse matrices for random forest problem_transform_classifier = LabelPowerset(classifier=base_classifier, require_dense=[False, False]) # setup the clusterer clusterer = MatrixLabelSpaceClusterer(clusterer=KMeans(n_clusters=3)) # setup the ensemble metaclassifier classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer) # train classifier.fit(X_train, y_train) # predict predictions = classifier.predict(X_test)
- fit_predict(X, y)¶
Clusters the output space
The clusterer’s
fit_predict
method is executed on either X and y.T vectors (ifself.pass_input_space
is true) or just y.T to detect clusters of labels.The transposition of label space is used to align with the format expected by scikit-learn classifiers, i.e. we cluster labels with label assignment vectors as samples.
Returns¶
- arrray of arrays of label indexes (numpy.ndarray)
label space division, each sublist represents labels that are in that community
- class skmultilearn.cluster.RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap)¶
Bases:
LabelSpaceClustererBase
Randomly divides the label space into equally-sized clusters
This method divides the label space by drawing without replacement a desired number of equally sized subsets of label space, in a partitioning or overlapping scheme.
Parameters¶
- cluster_sizeint
desired size of a single cluster, will be automatically put under
self.cluster_size
.- cluster_count: int
number of clusters to divide into, will be automatically put under
self.cluster_count
.- allow_overlapbool
whether to allow overlapping clusters or not, will be automatically put under
self.allow_overlap
.
Examples¶
The following code performs random label space partitioning.
from skmultilearn.cluster import RandomLabelSpaceClusterer # assume X,y contain the data, example y contains 5 labels cluster_count = 2 cluster_size = y.shape[1]//cluster_count # == 2 clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=False) clr.fit_predict(X,y) # Result: # array([list([0, 4]), list([2, 3]), list([1])], dtype=object)
Note that the leftover labels that did not fit in cluster_size x cluster_count classifiers will be appended to an additional last cluster of size at most cluster_size - 1.
You can also use this class to get a random division of the label space, even with multiple overlaps:
from skmultilearn.cluster import RandomLabelSpaceClusterer cluster_size = 3 cluster_count = 5 clr = RandomLabelSpaceClusterer(cluster_size, cluster_count, allow_overlap=True) clr.fit_predict(X,y) # Result # array([[2, 1, 3], # [3, 0, 4], # [2, 3, 1], # [2, 3, 4], # [3, 4, 0], # [3, 0, 2]])
Note that you will never get the same label subset twice.
- fit_predict(X, y)¶
Cluster the output space
Parameters¶
X : currently unused, left for scikit compatibility y : scipy.sparse
label space of shape
(n_samples, n_labels)
Returns¶
- arrray of arrays of label indexes (numpy.ndarray)
label space division, each sublist represents labels that are in that community
Cite us
If you use scikit-multilearn-ng in your research and publish it, please consider citing scikit-multilearn:
@ARTICLE{2017arXiv170201460S,
author = {{Szyma{'n}ski}, P. and {Kajdanowicz}, T.},
title = "{A scikit-based Python environment for performing multi-label classification}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1702.01460},
primaryClass = "cs.LG",
keywords = {Computer Science - Learning, Computer Science - Mathematical Software},
year = 2017,
month = feb,
}