Tampere University of Technology

TUTCRIS Research Portal

Regularization in Machine Learning with Applications in Biology

Research output: Book/ReportDoctoral thesisCollection of Articles

Details

Original languageEnglish
PublisherTampere University
Number of pages81
Volume59
ISBN (Electronic)978-952-03-1085-1
ISBN (Print)978-952-03-1084-4
Publication statusPublished - 17 May 2019
Publication typeG5 Doctoral dissertation (article)

Publication series

NameTampere University Dissertations
Volume59
ISSN (Print)2489-9860
ISSN (Electronic)2490-0028

Abstract

Over recent years, data-intensive science has been playing an increasingly essential role in biological discovery and biomedical science. The explosion of information in biology poses challenges in organizing data, discovering relevant information from the data, extracting salient features, and providing a comprehensive understanding of the overall biological process. Traditional manual approaches are no longer a feasible solution due to the heterogeneous, complex and unstructured nature of the biological data. Therefore, a framework for efficient, robust, automated, and fast data-intensive methods and pipelines are required to handle and explore these data.

The field of machine learning provides a comprehensive array of computational tools for analyzing such data. The aim of this thesis is to learn a simple model from such biological data, where the learned model illustrates an overview of the underlying data-generating process. This, in turn, should allow the extraction of salient features for predictive analysis of unobserved data. In this thesis, we consider solving the biological problem from the context of the dimension of the data. High-dimension data refers to the phenomenon where the number of data points is larger than the features describing the data points, or vice versa, or both. High-dimensionality can lead to the risk of overfitting of the model, a phenomenon in which the performance of the model is poorly described for the predictive analysis of unobserved data.

To this end, an attempt to add additional information or to modify the learning algorithm, a strategy known as regularization, is indispensable to increase the generalization capability of the model. The results of this study indicate that a regularized version of simple linear models often outperforms more sophisticated methods. Moreover, implicit automated feature-selection capabilities in sparse regularized parameter estimations have made a significant contribution to the thesis. We also utilize a powerful ensemble tree-based method, random forest, which is effective for discovering nonlinear relationships among features as well as providing feature ranking.

Another important aspect in the learning process considered in this thesis is model selection i.e., the selection of one model from a hypothesized set of possible models. The set of candidate models is constructed by setting different values for the models’ hyperparameters before initializing the learning process. It is shown that an alternative Bayesian approach is computationally faster and has lower error rates than the traditional approaches to model selection, such as grid search and cross-validation. Moreover, we propose a closed-form expression for the area under the receiver operating characteristic curve, a performance metric, in the context of a linear classifier.

In addition, we consider the unsupervised machine-learning paradigm in which the ground truth of biological data is not provided, such as for microarray gene expression profiling. The results show that clustering methods can be used effectively to explore the data and discover similarities. Our results indicate that careful selection of the machine-learning approaches can create powerful, yet simple computational modeling and analysis that can provide new and useful insights into heterogeneous biological applications.

Field of science, Statistics Finland