2018 |
||
| 1. | Stönner, Christof; Edtbauer, Achim; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Wicker, Jörg; Williams, Jonathan Proof of concept study: Testing human volatile organic compounds as tools for age classification of films Journal Article In: PLOS One, 13 (10), pp. 1-14, 2018. Abstract | Links | BibTeX | Altmetric | Tags: application, atmospheric chemistry, breath analysis, cinema data mining, data mining, emotional response analysis, machine learning, movie analysis, smell of fear, sof, time series @article{Stonner2018, title = {Proof of concept study: Testing human volatile organic compounds as tools for age classification of films}, author = {Christof Stönner and Achim Edtbauer and Bettina Derstorff and Efstratios Bourtsoukidis and Thomas Klüpfel and Jörg Wicker and Jonathan Williams}, doi = {10.1371/journal.pone.0203044}, year = {2018}, date = {2018-10-11}, journal = {PLOS One}, volume = {13}, number = {10}, pages = {1-14}, publisher = {Public Library of Science}, abstract = {Humans emit numerous volatile organic compounds (VOCs) through breath and skin. The nature and rate of these emissions are affected by various factors including emotional state. Previous measurements of VOCs and CO2 in a cinema have shown that certain chemicals are reproducibly emitted by audiences reacting to events in a particular film. Using data from films with various age classifications, we have studied the relationship between the emission of multiple VOCs and CO2 and the age classifier (0, 6, 12, and 16) with a view to developing a new chemically based and objective film classification method. We apply a random forest model built with time independent features extracted from the time series of every measured compound, and test predictive capability on subsets of all data. It was found that most compounds were not able to predict all age classifiers reliably, likely reflecting the fact that current classification is based on perceived sensibilities to many factors (e.g. incidences of violence, sex, antisocial behaviour, drug use, and bad language) rather than the visceral biological responses expressed in the data. However, promising results were found for isoprene which reliably predicted 0, 6 and 12 age classifiers for a variety of film genres and audience age groups. Therefore, isoprene emission per person might in future be a valuable aid to national classification boards, or even offer an alternative, objective, metric for rating films based on the reactions of large groups of people.}, keywords = {application, atmospheric chemistry, breath analysis, cinema data mining, data mining, emotional response analysis, machine learning, movie analysis, smell of fear, sof, time series}, pubstate = {published}, tppubtype = {article} } Humans emit numerous volatile organic compounds (VOCs) through breath and skin. The nature and rate of these emissions are affected by various factors including emotional state. Previous measurements of VOCs and CO2 in a cinema have shown that certain chemicals are reproducibly emitted by audiences reacting to events in a particular film. Using data from films with various age classifications, we have studied the relationship between the emission of multiple VOCs and CO2 and the age classifier (0, 6, 12, and 16) with a view to developing a new chemically based and objective film classification method. We apply a random forest model built with time independent features extracted from the time series of every measured compound, and test predictive capability on subsets of all data. It was found that most compounds were not able to predict all age classifiers reliably, likely reflecting the fact that current classification is based on perceived sensibilities to many factors (e.g. incidences of violence, sex, antisocial behaviour, drug use, and bad language) rather than the visceral biological responses expressed in the data. However, promising results were found for isoprene which reliably predicted 0, 6 and 12 age classifiers for a variety of film genres and audience age groups. Therefore, isoprene emission per person might in future be a valuable aid to national classification boards, or even offer an alternative, objective, metric for rating films based on the reactions of large groups of people. | |
2017 |
||
| 2. | Latino, Diogo; Wicker, Jörg; Gütlein, Martin; Schmid, Emanuel; Kramer, Stefan; Fenner, Kathrin Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data Journal Article In: Environmental Science: Process & Impact, 2017. Abstract | Links | BibTeX | Altmetric | Tags: application, biodegradation, cheminformatics, data mining, enviPath, multi-label classification, REST, web services @article{latino2017eawag, title = {Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data}, author = {Diogo Latino and Jörg Wicker and Martin Gütlein and Emanuel Schmid and Stefan Kramer and Kathrin Fenner}, doi = {10.1039/C6EM00697C}, year = {2017}, date = {2017-01-01}, journal = {Environmental Science: Process & Impact}, publisher = {The Royal Society of Chemistry}, abstract = {Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications.}, keywords = {application, biodegradation, cheminformatics, data mining, enviPath, multi-label classification, REST, web services}, pubstate = {published}, tppubtype = {article} } Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications. | |
2016 |
||
| 3. | Wicker, Jörg; Fenner, Kathrin; Kramer, Stefan A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction Incollection In: Lässig, Jörg; Kersting, Kristian; Morik, Katharina (Ed.): Computational Sustainability, pp. 75-97, Springer International Publishing, Cham, 2016, ISBN: 978-3-319-31858-5. Abstract | Links | BibTeX | Altmetric | Tags: application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways, multi-label classification @incollection{wicker2016ahybrid, title = {A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction}, author = {Jörg Wicker and Kathrin Fenner and Stefan Kramer}, editor = {Jörg Lässig and Kristian Kersting and Katharina Morik}, url = {http://dx.doi.org/10.1007/978-3-319-31858-5_5}, doi = {10.1007/978-3-319-31858-5_5}, isbn = {978-3-319-31858-5}, year = {2016}, date = {2016-04-21}, booktitle = {Computational Sustainability}, pages = {75-97}, publisher = {Springer International Publishing}, address = {Cham}, abstract = {One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS.}, keywords = {application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways, multi-label classification}, pubstate = {published}, tppubtype = {incollection} } One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS. | |
| 4. | Wicker, Jörg; Lorsbach, Tim; Gütlein, Martin; Schmid, Emanuel; Latino, Diogo; Kramer, Stefan; Fenner, Kathrin enviPath - The Environmental Contaminant Biotransformation Pathway Resource Journal Article In: Nucleic Acid Research, 44 (D1), pp. D502-D508, 2016. Abstract | Links | BibTeX | Altmetric | Tags: application, biodegradation, cheminformatics, computational sustainability, data mining, enviPath, linked data, machine learning, metabolic pathways, multi-label classification @article{wicker2016envipath, title = {enviPath - The Environmental Contaminant Biotransformation Pathway Resource}, author = {Jörg Wicker and Tim Lorsbach and Martin Gütlein and Emanuel Schmid and Diogo Latino and Stefan Kramer and Kathrin Fenner}, editor = {Michael Galperin}, url = {http://nar.oxfordjournals.org/content/44/D1/D502.abstract}, doi = {10.1093/nar/gkv1229}, year = {2016}, date = {2016-01-01}, journal = {Nucleic Acid Research}, volume = {44}, number = {D1}, pages = {D502-D508}, abstract = {The University of Minnesota Biocatalysis/Biodegradation Database and Pathway Prediction System (UM-BBD/PPS) has been a unique resource covering microbial biotransformation pathways of primarily xenobiotic chemicals for over 15 years. This paper introduces the successor system, enviPath (The Environmental Contaminant Biotransformation Pathway Resource), which is a complete redesign and reimplementation of UM-BBD/PPS. enviPath uses the database from the UM-BBD/PPS as a basis, extends the use of this database, and allows users to include their own data to support multiple use cases. Relative reasoning is supported for the refinement of predictions and to allow its extensions in terms of previously published, but not implemented machine learning models. User access is simplified by providing a REST API that simplifies the inclusion of enviPath into existing workflows. An RDF database is used to enable simple integration with other databases. enviPath is publicly available at https://envipath.org with free and open access to its core data.}, keywords = {application, biodegradation, cheminformatics, computational sustainability, data mining, enviPath, linked data, machine learning, metabolic pathways, multi-label classification}, pubstate = {published}, tppubtype = {article} } The University of Minnesota Biocatalysis/Biodegradation Database and Pathway Prediction System (UM-BBD/PPS) has been a unique resource covering microbial biotransformation pathways of primarily xenobiotic chemicals for over 15 years. This paper introduces the successor system, enviPath (The Environmental Contaminant Biotransformation Pathway Resource), which is a complete redesign and reimplementation of UM-BBD/PPS. enviPath uses the database from the UM-BBD/PPS as a basis, extends the use of this database, and allows users to include their own data to support multiple use cases. Relative reasoning is supported for the refinement of predictions and to allow its extensions in terms of previously published, but not implemented machine learning models. User access is simplified by providing a REST API that simplifies the inclusion of enviPath into existing workflows. An RDF database is used to enable simple integration with other databases. enviPath is publicly available at https://envipath.org with free and open access to its core data. | |
| 5. | Williams, Jonathan; Stönner, Christof; Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Kramer, Stefan Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath Journal Article In: Scientific Reports, 6 , 2016. Abstract | Links | BibTeX | Altmetric | Tags: application, atmospheric chemistry, causality, cheminformatics, data mining, emotional response analysis, smell of fear, sof @article{williams2015element, title = {Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath}, author = {Jonathan Williams and Christof Stönner and Jörg Wicker and Nicolas Krauter and Bettina Derstorff and Efstratios Bourtsoukidis and Thomas Klüpfel and Stefan Kramer}, url = {http://www.nature.com/articles/srep25464}, doi = {10.1038/srep25464}, year = {2016}, date = {2016-01-01}, journal = {Scientific Reports}, volume = {6}, publisher = {Nature Publishing Group}, abstract = {Human beings continuously emit chemicals into the air by breath and through the skin. In order to determine whether these emissions vary predictably in response to audiovisual stimuli, we have continuously monitored carbon dioxide and over one hundred volatile organic compounds in a cinema. It was found that many airborne chemicals in cinema air varied distinctively and reproducibly with time for a particular film, even in different screenings to different audiences. Application of scene labels and advanced data mining methods revealed that specific film events, namely "suspense" or "comedy" caused audiences to change their emission of specific chemicals. These event-type synchronous, broadcasted human chemosignals open the possibility for objective and non-invasive assessment of a human group response to stimuli by continuous measurement of chemicals in air. Such methods can be applied to research fields such as psychology and biology, and be valuable to industries such as film making and advertising.}, keywords = {application, atmospheric chemistry, causality, cheminformatics, data mining, emotional response analysis, smell of fear, sof}, pubstate = {published}, tppubtype = {article} } Human beings continuously emit chemicals into the air by breath and through the skin. In order to determine whether these emissions vary predictably in response to audiovisual stimuli, we have continuously monitored carbon dioxide and over one hundred volatile organic compounds in a cinema. It was found that many airborne chemicals in cinema air varied distinctively and reproducibly with time for a particular film, even in different screenings to different audiences. Application of scene labels and advanced data mining methods revealed that specific film events, namely "suspense" or "comedy" caused audiences to change their emission of specific chemicals. These event-type synchronous, broadcasted human chemosignals open the possibility for objective and non-invasive assessment of a human group response to stimuli by continuous measurement of chemicals in air. Such methods can be applied to research fields such as psychology and biology, and be valuable to industries such as film making and advertising. | |
2015 |
||
| 6. | Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Stönner, Christof; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Williams, Jonathan; Kramer, Stefan Cinema Data Mining: The Smell of Fear Inproceedings In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235-1304, ACM ACM, New York, NY, USA, 2015, ISBN: 978-1-4503-3664-2. Abstract | Links | BibTeX | Altmetric | Tags: application, atmospheric chemistry, breath analysis, causality, cinema data mining, data mining, emotional response analysis, movie analysis, smell of fear, sof, time series @inproceedings{wicker2015cinema, title = {Cinema Data Mining: The Smell of Fear}, author = {Jörg Wicker and Nicolas Krauter and Bettina Derstorff and Christof Stönner and Efstratios Bourtsoukidis and Thomas Klüpfel and Jonathan Williams and Stefan Kramer}, url = {https://wicker.nz/nwp-acm/authorize.php?id=N10031 http://doi.acm.org/10.1145/2783258.2783404}, doi = {10.1145/2783258.2783404}, isbn = {978-1-4503-3664-2}, year = {2015}, date = {2015-01-01}, booktitle = {Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, pages = {1235-1304}, publisher = {ACM}, address = {New York, NY, USA}, organization = {ACM}, series = {KDD '15}, abstract = {While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a movie theatre over a whole month in intervals of thirty seconds, and annotated the screened films by a controlled vocabulary compiled from multiple sources. To gain a better understanding of the data and to reveal unknown relationships, we have built prediction models for so-called forward prediction (the prediction of future VOCs from the past), backward prediction (the prediction of past scene labels from future VOCs) and for some forms of abductive reasoning and Granger causality. Experimental results show that some VOCs and some labels can be predicted with relatively low error, and that hints for causality with low p-values can be detected in the data.}, keywords = {application, atmospheric chemistry, breath analysis, causality, cinema data mining, data mining, emotional response analysis, movie analysis, smell of fear, sof, time series}, pubstate = {published}, tppubtype = {inproceedings} } While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a movie theatre over a whole month in intervals of thirty seconds, and annotated the screened films by a controlled vocabulary compiled from multiple sources. To gain a better understanding of the data and to reveal unknown relationships, we have built prediction models for so-called forward prediction (the prediction of future VOCs from the past), backward prediction (the prediction of past scene labels from future VOCs) and for some forms of abductive reasoning and Granger causality. Experimental results show that some VOCs and some labels can be predicted with relatively low error, and that hints for causality with low p-values can be detected in the data. | |
2013 |
||
| 7. | Wicker, Jörg Large Classifier Systems in Bio- and Cheminformatics PhD Thesis Technische Universität München, 2013. Abstract | Links | BibTeX | Tags: application, biodegradation, bioinformatics, cheminformatics, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity @phdthesis{wicker2013large, title = {Large Classifier Systems in Bio- and Cheminformatics}, author = {Jörg Wicker}, url = {http://mediatum.ub.tum.de/node?id=1165858}, year = {2013}, date = {2013-01-01}, school = {Technische Universität München}, abstract = {Large classifier systems are machine learning algorithms that use multiple classifiers to improve the prediction of target values in advanced classification tasks. Although learning problems in bio- and cheminformatics commonly provide data in schemes suitable for large classifier systems, they are rarely used in these domains. This thesis introduces two new classifiers incorporating systems of classifiers using Boolean matrix decomposition to handle data in a schema that often occurs in bio- and cheminformatics. The first approach, called MLC-BMaD (multi-label classification using Boolean matrix decomposition), uses Boolean matrix decomposition to decompose the labels in a multi-label classification task. The decomposed matrices are a compact representation of the information in the labels (first matrix) and the dependencies among the labels (second matrix). The first matrix is used in a further multi-label classification while the second matrix is used to generate the final matrix from the predicted values of the first matrix. MLC-BMaD was evaluated on six standard multi-label data sets, the experiments showed that MLC-BMaD can perform particularly well on data sets with a high number of labels and a small number of instances and can outperform standard multi-label algorithms. Subsequently, MLC-BMaD is extended to a special case of multi-relational learning, by considering the labels not as simple labels, but instances. The algorithm, called ClassFact (Classification factorization), uses both matrices in a multi-label classification. Each label represents a mapping between two instances. Experiments on three data sets from the domain of bioinformatics show that ClassFact can outperform the baseline method, which merges the relations into one, on hard classification tasks. Furthermore, large classifier systems are used on two cheminformatics data sets, the first one is used to predict the environmental fate of chemicals by predicting biodegradation pathways. The second is a data set from the domain of predictive toxicology. In biodegradation pathway prediction, I extend a knowledge-based system and incorporate a machine learning approach to predict a probability for biotransformation products based on the structure- and knowledge-based predictions of products, which are based on transformation rules. The use of multi-label classification improves the performance of the classifiers and extends the number of transformation rules that can be covered. For the prediction of toxic effects of chemicals, I applied large classifier systems to the ToxCasttexttrademark data set, which maps toxic effects to chemicals. As the given toxic effects are not easy to predict due to missing information and a skewed class distribution, I introduce a filtering step in the multi-label classification, which finds labels that are usable in multi-label prediction and does not take the others in the prediction into account. Experiments show that this approach can improve upon the baseline method using binary classification, as well as multi-label approaches using no filtering. The presented results show that large classifier systems can play a role in future research challenges, especially in bio- and cheminformatics, where data sets frequently consist of more complex structures and data can be rather small in terms of the number of instances compared to other domains.}, keywords = {application, biodegradation, bioinformatics, cheminformatics, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity}, pubstate = {published}, tppubtype = {phdthesis} } Large classifier systems are machine learning algorithms that use multiple classifiers to improve the prediction of target values in advanced classification tasks. Although learning problems in bio- and cheminformatics commonly provide data in schemes suitable for large classifier systems, they are rarely used in these domains. This thesis introduces two new classifiers incorporating systems of classifiers using Boolean matrix decomposition to handle data in a schema that often occurs in bio- and cheminformatics. The first approach, called MLC-BMaD (multi-label classification using Boolean matrix decomposition), uses Boolean matrix decomposition to decompose the labels in a multi-label classification task. The decomposed matrices are a compact representation of the information in the labels (first matrix) and the dependencies among the labels (second matrix). The first matrix is used in a further multi-label classification while the second matrix is used to generate the final matrix from the predicted values of the first matrix. MLC-BMaD was evaluated on six standard multi-label data sets, the experiments showed that MLC-BMaD can perform particularly well on data sets with a high number of labels and a small number of instances and can outperform standard multi-label algorithms. Subsequently, MLC-BMaD is extended to a special case of multi-relational learning, by considering the labels not as simple labels, but instances. The algorithm, called ClassFact (Classification factorization), uses both matrices in a multi-label classification. Each label represents a mapping between two instances. Experiments on three data sets from the domain of bioinformatics show that ClassFact can outperform the baseline method, which merges the relations into one, on hard classification tasks. Furthermore, large classifier systems are used on two cheminformatics data sets, the first one is used to predict the environmental fate of chemicals by predicting biodegradation pathways. The second is a data set from the domain of predictive toxicology. In biodegradation pathway prediction, I extend a knowledge-based system and incorporate a machine learning approach to predict a probability for biotransformation products based on the structure- and knowledge-based predictions of products, which are based on transformation rules. The use of multi-label classification improves the performance of the classifiers and extends the number of transformation rules that can be covered. For the prediction of toxic effects of chemicals, I applied large classifier systems to the ToxCasttexttrademark data set, which maps toxic effects to chemicals. As the given toxic effects are not easy to predict due to missing information and a skewed class distribution, I introduce a filtering step in the multi-label classification, which finds labels that are usable in multi-label prediction and does not take the others in the prediction into account. Experiments show that this approach can improve upon the baseline method using binary classification, as well as multi-label approaches using no filtering. The presented results show that large classifier systems can play a role in future research challenges, especially in bio- and cheminformatics, where data sets frequently consist of more complex structures and data can be rather small in terms of the number of instances compared to other domains. | |
2010 |
||
| 8. | Hardy, Barry; Douglas, Nicki; Helma, Christoph; Rautenberg, Micha; Jeliazkova, Nina; Jeliazkov, Vedrin; Nikolova, Ivelina; Benigni, Romualdo; Tcheremenskaia, Olga; Kramer, Stefan; Girschick, Tobias; Buchwald, Fabian; Wicker, Jörg; Karwath, Andreas; Gütlein, Martin; Maunz, Andreas; Sarimveis, Haralambos; Melagraki, Georgia; Afantitis, Antreas; Sopasakis, Pantelis; Gallagher, David; Poroikov, Vladimir; Filimonov, Dmitry; Zakharov, Alexey; Lagunin, Alexey; Gloriozova, Tatyana; Novikov, Sergey; Skvortsova, Natalia; Druzhilovsky, Dmitry; Chawla, Sunil; Ghosh, Indira; Ray, Surajit; Patel, Hitesh; Escher, Sylvia Collaborative development of predictive toxicology applications Journal Article In: Journal of Cheminformatics, 2 (1), pp. 7, 2010, ISSN: 1758-2946. Abstract | Links | BibTeX | Altmetric | Tags: application, cheminformatics, data mining, machine learning, REST, toxicity @article{hardy2010collaborative, title = {Collaborative development of predictive toxicology applications}, author = {Barry Hardy and Nicki Douglas and Christoph Helma and Micha Rautenberg and Nina Jeliazkova and Vedrin Jeliazkov and Ivelina Nikolova and Romualdo Benigni and Olga Tcheremenskaia and Stefan Kramer and Tobias Girschick and Fabian Buchwald and Jörg Wicker and Andreas Karwath and Martin Gütlein and Andreas Maunz and Haralambos Sarimveis and Georgia Melagraki and Antreas Afantitis and Pantelis Sopasakis and David Gallagher and Vladimir Poroikov and Dmitry Filimonov and Alexey Zakharov and Alexey Lagunin and Tatyana Gloriozova and Sergey Novikov and Natalia Skvortsova and Dmitry Druzhilovsky and Sunil Chawla and Indira Ghosh and Surajit Ray and Hitesh Patel and Sylvia Escher}, url = {http://www.jcheminf.com/content/2/1/7}, doi = {10.1186/1758-2946-2-7}, issn = {1758-2946}, year = {2010}, date = {2010-01-01}, journal = {Journal of Cheminformatics}, volume = {2}, number = {1}, pages = {7}, abstract = {OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.}, keywords = {application, cheminformatics, data mining, machine learning, REST, toxicity}, pubstate = {published}, tppubtype = {article} } OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way. | |
| 9. | Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach Journal Article In: Bioinformatics, 26 (6), pp. 814-821, 2010. Abstract | Links | BibTeX | Altmetric | Tags: application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways @article{wicker2010predicting, title = {Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach}, author = {Jörg Wicker and Kathrin Fenner and Lynda Ellis and Larry Wackett and Stefan Kramer}, url = {http://bioinformatics.oxfordjournals.org/content/26/6/814.full}, doi = {10.1093/bioinformatics/btq024}, year = {2010}, date = {2010-01-01}, journal = {Bioinformatics}, volume = {26}, number = {6}, pages = {814-821}, publisher = {Oxford University Press}, abstract = {Motivation: Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this article, we propose a hybrid knowledge- and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. As the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice.Results: Results from leave-one-out cross-validation show that a recall and precision of ∼0.8 can be achieved for a subset of 13 transformation rules. Therefore, it is possible to optimize precision without compromising recall. We are currently integrating the results into an experimental version of the UM-PPS server.Availability: The program is freely available on the web at http://wwwkramer.in.tum.de/research/applications/biodegradation/data.Contact: kramer@in.tum.de}, keywords = {application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways}, pubstate = {published}, tppubtype = {article} } Motivation: Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this article, we propose a hybrid knowledge- and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. As the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice.Results: Results from leave-one-out cross-validation show that a recall and precision of ∼0.8 can be achieved for a subset of 13 transformation rules. Therefore, it is possible to optimize precision without compromising recall. We are currently integrating the results into an experimental version of the UM-PPS server.Availability: The program is freely available on the web at http://wwwkramer.in.tum.de/research/applications/biodegradation/data.Contact: kramer@in.tum.de | |
2008 |
||
| 10. | Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction Inproceedings In: Bridewell, Will; Calders, Toon; de Medeiros, Ana Karla; Kramer, Stefan; Pechenizkiy, Mykola; Todorovski, Ljupco (Ed.): Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008, 2008. Links | BibTeX | Tags: application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways @inproceedings{wicker2008machine, title = {Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction}, author = {Jörg Wicker and Kathrin Fenner and Lynda Ellis and Larry Wackett and Stefan Kramer}, editor = {Will Bridewell and Toon Calders and Ana Karla de Medeiros and Stefan Kramer and Mykola Pechenizkiy and Ljupco Todorovski}, url = {http://www.ecmlpkdd2008.org/files/pdf/workshops/ipm/9.pdf}, year = {2008}, date = {2008-01-01}, booktitle = {Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008}, keywords = {application, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways}, pubstate = {published}, tppubtype = {inproceedings} } | |
2018 |
||
| 1. | Proof of concept study: Testing human volatile organic compounds as tools for age classification of films Journal Article In: PLOS One, 13 (10), pp. 1-14, 2018. | |
2017 |
||
| 2. | Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data Journal Article In: Environmental Science: Process & Impact, 2017. | |
2016 |
||
| 3. | A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction Incollection In: Lässig, Jörg; Kersting, Kristian; Morik, Katharina (Ed.): Computational Sustainability, pp. 75-97, Springer International Publishing, Cham, 2016, ISBN: 978-3-319-31858-5. | |
| 4. | enviPath - The Environmental Contaminant Biotransformation Pathway Resource Journal Article In: Nucleic Acid Research, 44 (D1), pp. D502-D508, 2016. | |
| 5. | Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath Journal Article In: Scientific Reports, 6 , 2016. | |
2015 |
||
| 6. | Cinema Data Mining: The Smell of Fear Inproceedings In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235-1304, ACM ACM, New York, NY, USA, 2015, ISBN: 978-1-4503-3664-2. | |
2013 |
||
| 7. | Large Classifier Systems in Bio- and Cheminformatics PhD Thesis Technische Universität München, 2013. | |
2010 |
||
| 8. | Collaborative development of predictive toxicology applications Journal Article In: Journal of Cheminformatics, 2 (1), pp. 7, 2010, ISSN: 1758-2946. | |
| 9. | Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach Journal Article In: Bioinformatics, 26 (6), pp. 814-821, 2010. | |
2008 |
||
| 10. | Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction Inproceedings In: Bridewell, Will; Calders, Toon; de Medeiros, Ana Karla; Kramer, Stefan; Pechenizkiy, Mykola; Todorovski, Ljupco (Ed.): Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008, 2008. | |