Large Classifier Systems in Bio- and Cheminformatics

Jörg Wicker: Large Classifier Systems in Bio- and Cheminformatics. Technische Universität München, 2013.

Abstract

Large classifier systems are machine learning algorithms that use multiple
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.

BibTeX (Download)

@phdthesis{wicker2013large,
title = {Large Classifier Systems in Bio- and Cheminformatics},
author = {Jörg Wicker},
url = {http://mediatum.ub.tum.de/node?id=1165858},
year  = {2013},
date = {2013-01-01},
school = {Technische Universität München},
abstract = {Large classifier systems are machine learning algorithms that use multiple 
classifiers to improve the prediction of target values in advanced 
classification tasks. Although learning problems in bio- and 
cheminformatics commonly provide data in schemes suitable for large 
classifier systems, they are rarely used in these domains. This thesis 
introduces two new classifiers incorporating systems of classifiers 
using Boolean matrix decomposition to handle data in a schema that 
often occurs in bio- and cheminformatics. 
 
The first approach, called MLC-BMaD (multi-label classification using 
Boolean matrix decomposition), uses Boolean matrix decomposition to 
decompose the labels in a multi-label classification task. The 
decomposed matrices are a compact representation of the information 
in the labels (first matrix) and the dependencies among the labels 
(second matrix). The first matrix is used in a further multi-label 
classification while the second matrix is used to generate the final 
matrix from the predicted values of the first matrix. 
MLC-BMaD was evaluated on six standard multi-label data sets, the 
experiments showed that MLC-BMaD can perform particularly well on data 
sets with a high number of labels and a small number of instances and 
can outperform standard multi-label algorithms. 
Subsequently, MLC-BMaD is extended to a special case of 
multi-relational learning, by considering the labels not as simple 
labels, but instances. The algorithm, called ClassFact 
(Classification factorization), uses both matrices in a multi-label 
classification. Each label represents a mapping between two 
instances. 
Experiments on three data sets from the domain of bioinformatics show 
that ClassFact can outperform the baseline method, which merges the 
relations into one, on hard classification tasks. 
 
Furthermore, large classifier systems are used on two cheminformatics 
data sets, the first one is used to predict the environmental fate of 
chemicals by predicting biodegradation pathways. The second is a data 
set from the domain of predictive toxicology. In biodegradation 
pathway prediction, I extend a knowledge-based system and incorporate 
a machine learning approach to predict a probability for 
biotransformation products based on the structure- and knowledge-based 
predictions of products, which are based on transformation rules. The 
use of multi-label classification improves the performance of the 
classifiers and extends the number of transformation rules that can be 
covered. 
For the prediction of toxic effects of chemicals, I applied large 
classifier systems to the ToxCasttexttrademark data set, which maps 
toxic effects to chemicals. As the given toxic effects are not easy to 
predict due to missing information and a skewed class 
distribution, I introduce a filtering step in the multi-label 
classification, which finds labels that are usable in multi-label 
prediction and does not take the others in the 
prediction into account. Experiments show 
that this approach can improve upon the baseline method using binary 
classification, as well as multi-label approaches using no filtering. 
 
The presented results show that large classifier systems can play a 
role in future research challenges, especially in bio- and 
cheminformatics, where data sets frequently consist of more complex 
structures and data can be rather small in terms of the number of 
instances compared to other domains.},
keywords = {application, biodegradation, bioinformatics, cheminformatics, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity},
pubstate = {published},
tppubtype = {phdthesis}
}