
I am senior lecturer at the School of Computer Science of the University of Auckland and CTO of enviPath. I am member of the Machine Learning Group at UoA. My main research area is machine learning and its application to bioinformatics, cheminformatics, computational sustainability, and privacy. My approach to research is to use interesting and challenging questions in other research areas and develop new machine learning methods that address them to potentially advance not only the field of machine learning, but also the area it is applied to. In my career, I worked on diverse machine learning topics including autoencoders, Boolean matrix decomposition, inductive databases, multi-label classification, privacy-preserving data mining, adversarial learning, and time series analysis.
I am currently looking for PhD, Honours, or Masters students, if you are interested in any of my research areas, contact me by mail.
Recent Publications
Journal Articles |
||
Stepišnik, Tomaž; Škrlj, Blaž; Wicker, Jörg; Kocev, Dragi A comprehensive comparison of molecular feature representations for use in predictive modeling Journal Article Computers in Biology and Medicine, 130 , pp. 104197, 2021, ISSN: 0010-4825. Abstract | Links | BibTeX | Altmetric @article{stepisnik2021comprehensive, title = {A comprehensive comparison of molecular feature representations for use in predictive modeling}, author = {Toma\v{z} Stepi\v{s}nik and Bla\v{z} \v{S}krlj and J\"{o}rg Wicker and Dragi Kocev}, url = {http://www.sciencedirect.com/science/article/pii/S001048252030528X}, doi = {10.1016/j.compbiomed.2020.104197}, issn = {0010-4825}, year = {2021}, date = {2021-03-01}, journal = {Computers in Biology and Medicine}, volume = {130}, pages = {104197}, abstract = {Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations. | ||
Roeslin, Samuel; Ma, Quincy; Juárez-Garcia, Hugon; Gómez-Bernal, Alonso; Wicker, Jörg; Wotherspoon, Liam A machine learning damage prediction model for the 2017 Puebla-Morelos, Mexico, earthquake Journal Article Earthquake Spectra, 36 (2), pp. 314-339, 2020. Abstract | Links | BibTeX | Altmetric @article{roeslin2020machine, title = {A machine learning damage prediction model for the 2017 Puebla-Morelos, Mexico, earthquake}, author = {Samuel Roeslin and Quincy Ma and Hugon Ju\'{a}rez-Garcia and Alonso G\'{o}mez-Bernal and J\"{o}rg Wicker and Liam Wotherspoon}, doi = {https://doi.org/10.1177/8755293020936714}, year = {2020}, date = {2020-07-30}, journal = {Earthquake Spectra}, volume = {36}, number = {2}, pages = {314-339}, abstract = {The 2017 Puebla, Mexico, earthquake event led to significant damage in many buildings in Mexico City. In the months following the earthquake, civil engineering students conducted detailed building assessments throughout the city. They collected building damage information and structural characteristics for 340 buildings in the Mexico City urban area, with an emphasis on the Roma and Condesa neighborhoods where they assessed 237 buildings. These neighborhoods are of particular interest due to the availability of seismic records captured by nearby recording stations, and preexisting information from when the neighborhoods were affected by the 1985 Michoac\'{a}n earthquake. This article presents a case study on developing a damage prediction model using machine learning. It details a framework suitable for working with future post-earthquake observation data. Four algorithms able to perform classification tasks were trialed. Random forest, the best performing algorithm, achieves more than 65% prediction accuracy. The study of the feature importance for the random forest shows that the building location, seismic demand, and building height are the parameters that influence the model output the most.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The 2017 Puebla, Mexico, earthquake event led to significant damage in many buildings in Mexico City. In the months following the earthquake, civil engineering students conducted detailed building assessments throughout the city. They collected building damage information and structural characteristics for 340 buildings in the Mexico City urban area, with an emphasis on the Roma and Condesa neighborhoods where they assessed 237 buildings. These neighborhoods are of particular interest due to the availability of seismic records captured by nearby recording stations, and preexisting information from when the neighborhoods were affected by the 1985 Michoacán earthquake. This article presents a case study on developing a damage prediction model using machine learning. It details a framework suitable for working with future post-earthquake observation data. Four algorithms able to perform classification tasks were trialed. Random forest, the best performing algorithm, achieves more than 65% prediction accuracy. The study of the feature importance for the random forest shows that the building location, seismic demand, and building height are the parameters that influence the model output the most. | ||
Inproceedings |
||
Chester, Andrew; Koh, Yun Sing; Wicker, Jörg; Sun, Quan; Lee, Junjae Balancing Utility and Fairness against Privacy in Medical Data Inproceedings IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1226-1233, IEEE, 2020. Abstract | Links | BibTeX | Altmetric @inproceedings{chester2020balancing, title = {Balancing Utility and Fairness against Privacy in Medical Data}, author = {Andrew Chester and Yun Sing Koh and J\"{o}rg Wicker and Quan Sun and Junjae Lee}, url = {https://ieeexplore.ieee.org/abstract/document/9308226}, doi = {10.1109/SSCI47803.2020.9308226}, year = {2020}, date = {2020-12-01}, booktitle = {IEEE Symposium Series on Computational Intelligence (SSCI)}, pages = {1226-1233}, publisher = {IEEE}, abstract = {There are numerous challenges when designing algorithms that interact with sensitive data, such as, medical or financial records. One of these challenges is privacy. However, there is a tension between privacy, utility (model accuracy), and fairness. While de-identification techniques, such as generalisation and suppression, have been proposed to enable privacy protection, it comes with a cost, specifically to fairness and utility. Recent work on fairness in algorithm design defines fairness as a guarantee of similar outputs for "similar" input data. This notion is discussed in connection to de-identification. This research investigates the trade-off between privacy, fairness, and utility. In contrast, other work investigates the trade-off between privacy and utility of the data or accuracy of the model overall. In this research, we investigate the effects of two standard de-identification techniques, k-anonymity and differential privacy, on both utility and fairness. We propose two measures to calculate the trade-off between privacy-utility and privacy-fairness. Although other research has provided guarantees for privacy regarding utility, this research focuses on the trade-offs given set de-identification levels and relies on guarantees provided by the privacy preservation methods. We discuss the effects of de-identification on data of different characteristics, class imbalance and outcome imbalance. We evaluated this is on synthetic datasets and standard real-world datasets. As a case study, we analysed the Medical Expenditure Panel Survey dataset. }, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } There are numerous challenges when designing algorithms that interact with sensitive data, such as, medical or financial records. One of these challenges is privacy. However, there is a tension between privacy, utility (model accuracy), and fairness. While de-identification techniques, such as generalisation and suppression, have been proposed to enable privacy protection, it comes with a cost, specifically to fairness and utility. Recent work on fairness in algorithm design defines fairness as a guarantee of similar outputs for "similar" input data. This notion is discussed in connection to de-identification. This research investigates the trade-off between privacy, fairness, and utility. In contrast, other work investigates the trade-off between privacy and utility of the data or accuracy of the model overall. In this research, we investigate the effects of two standard de-identification techniques, k-anonymity and differential privacy, on both utility and fairness. We propose two measures to calculate the trade-off between privacy-utility and privacy-fairness. Although other research has provided guarantees for privacy regarding utility, this research focuses on the trade-offs given set de-identification levels and relies on guarantees provided by the privacy preservation methods. We discuss the effects of de-identification on data of different characteristics, class imbalance and outcome imbalance. We evaluated this is on synthetic datasets and standard real-world datasets. As a case study, we analysed the Medical Expenditure Panel Survey dataset. | ||
Dost, Katharina; Taskova, Katerina; Riddle, Pat; Wicker, Jörg Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias Inproceedings Forthcoming 2020 IEEE International Conference on Data Mining (ICDM), IEEE, Forthcoming. @inproceedings{dost2020your, title = {Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias}, author = {Katharina Dost and Katerina Taskova and Pat Riddle and J\"{o}rg Wicker}, year = {2020}, date = {2020-11-17}, booktitle = {2020 IEEE International Conference on Data Mining (ICDM)}, publisher = {IEEE}, abstract = {Machine Learning typically assumes that training and test set are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias? In contrast to prior work, this paper introduces a new method, Imitate, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample. We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.}, keywords = {}, pubstate = {forthcoming}, tppubtype = {inproceedings} } Machine Learning typically assumes that training and test set are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias? In contrast to prior work, this paper introduces a new method, Imitate, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample. We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions. | ||
Roeslin, Samuel; Ma, Quincy; and Chigullapally, Pavan; Wicker, Jörg; Wotherspoon, Liam Feature Engineering for a Seismic Loss Prediction Model using Machine Learning, Christchurch Experience Inproceedings Forthcoming 17th World Conference on Earthquake Engineering, Forthcoming. @inproceedings{roeslin2020feature, title = { Feature Engineering for a Seismic Loss Prediction Model using Machine Learning, Christchurch Experience}, author = {Samuel Roeslin and Quincy Ma and and Pavan Chigullapally and J\"{o}rg Wicker and Liam Wotherspoon}, year = {2020}, date = {2020-09-17}, booktitle = {17th World Conference on Earthquake Engineering}, abstract = {The city of Christchurch, New Zealand experienced four major earthquakes (MW > 5.9) and multiple aftershocks between 4 September 2010 and 23 December 2011. This series of earthquakes, commonly known as the Canterbury Earthquake Sequence (CES), induced over NZ$40 billion in total economic losses. Liquefaction alone led to building damage in 51,000 of the 140,000 residential buildings, with around 15,000 houses left unpractical to repair. Widespread damage to residential buildings highlighted the need for improved seismic prediction tools and to better understand factors influencing damage. Fortunately, due to New Zealand unique insurance setting, up to 80% of the losses were insured. Over the entire CES, insurers received more than 650,000 claims. This research project employs multi-disciplinary empirical data gathered during and prior to the CES to develop a seismic loss prediction model for residential buildings in Christchurch using machine learning. The intent is to develop a procedure for developing insights from post-earthquake data that is subjected to continuous updating, to enable identification of critical parameters affecting losses, and to apply such a model to establish priority building stock for risk mitigation measures. The following paper describes the complex data preparation process required for the application of machine learning techniques. The paper covers the production of a merged dataset with information from the Earthquake Commission (EQC) claim database, building characteristics from RiskScape, seismic demand interpolated from GeoNet strong motion records, liquefaction occurrence from the New Zealand Geotechnical Database (NZGD) and soil conditions from Land Resource Information Systems (LRIS).}, keywords = {}, pubstate = {forthcoming}, tppubtype = {inproceedings} } The city of Christchurch, New Zealand experienced four major earthquakes (MW > 5.9) and multiple aftershocks between 4 September 2010 and 23 December 2011. This series of earthquakes, commonly known as the Canterbury Earthquake Sequence (CES), induced over NZ$40 billion in total economic losses. Liquefaction alone led to building damage in 51,000 of the 140,000 residential buildings, with around 15,000 houses left unpractical to repair. Widespread damage to residential buildings highlighted the need for improved seismic prediction tools and to better understand factors influencing damage. Fortunately, due to New Zealand unique insurance setting, up to 80% of the losses were insured. Over the entire CES, insurers received more than 650,000 claims. This research project employs multi-disciplinary empirical data gathered during and prior to the CES to develop a seismic loss prediction model for residential buildings in Christchurch using machine learning. The intent is to develop a procedure for developing insights from post-earthquake data that is subjected to continuous updating, to enable identification of critical parameters affecting losses, and to apply such a model to establish priority building stock for risk mitigation measures. The following paper describes the complex data preparation process required for the application of machine learning techniques. The paper covers the production of a merged dataset with information from the Earthquake Commission (EQC) claim database, building characteristics from RiskScape, seismic demand interpolated from GeoNet strong motion records, liquefaction occurrence from the New Zealand Geotechnical Database (NZGD) and soil conditions from Land Resource Information Systems (LRIS). |