TUTCRIS - Tampereen teknillinen yliopisto


Regularization in Machine Learning with Applications in Biology



KustantajaTampere University
ISBN (elektroninen)978-952-03-1085-1
ISBN (painettu)978-952-03-1084-4
TilaJulkaistu - 17 toukokuuta 2019
OKM-julkaisutyyppiG5 Artikkeliväitöskirja


NimiTampere University Dissertations
ISSN (painettu)2489-9860
ISSN (elektroninen)2490-0028


Over recent years, data-intensive science has been playing an increasingly essential role in biological discovery and biomedical science. The explosion of information in biology poses challenges in organizing data, discovering relevant information from the data, extracting salient features, and providing a comprehensive understanding of the overall biological process. Traditional manual approaches are no longer a feasible solution due to the heterogeneous, complex and unstructured nature of the biological data. Therefore, a framework for efficient, robust, automated, and fast data-intensive methods and pipelines are required to handle and explore these data.

The field of machine learning provides a comprehensive array of computational tools for analyzing such data. The aim of this thesis is to learn a simple model from such biological data, where the learned model illustrates an overview of the underlying data-generating process. This, in turn, should allow the extraction of salient features for predictive analysis of unobserved data. In this thesis, we consider solving the biological problem from the context of the dimension of the data. High-dimension data refers to the phenomenon where the number of data points is larger than the features describing the data points, or vice versa, or both. High-dimensionality can lead to the risk of overfitting of the model, a phenomenon in which the performance of the model is poorly described for the predictive analysis of unobserved data.

To this end, an attempt to add additional information or to modify the learning algorithm, a strategy known as regularization, is indispensable to increase the generalization capability of the model. The results of this study indicate that a regularized version of simple linear models often outperforms more sophisticated methods. Moreover, implicit automated feature-selection capabilities in sparse regularized parameter estimations have made a significant contribution to the thesis. We also utilize a powerful ensemble tree-based method, random forest, which is effective for discovering nonlinear relationships among features as well as providing feature ranking.

Another important aspect in the learning process considered in this thesis is model selection i.e., the selection of one model from a hypothesized set of possible models. The set of candidate models is constructed by setting different values for the models’ hyperparameters before initializing the learning process. It is shown that an alternative Bayesian approach is computationally faster and has lower error rates than the traditional approaches to model selection, such as grid search and cross-validation. Moreover, we propose a closed-form expression for the area under the receiver operating characteristic curve, a performance metric, in the context of a linear classifier.

In addition, we consider the unsupervised machine-learning paradigm in which the ground truth of biological data is not provided, such as for microarray gene expression profiling. The results show that clustering methods can be used effectively to explore the data and discover similarities. Our results indicate that careful selection of the machine-learning approaches can create powerful, yet simple computational modeling and analysis that can provide new and useful insights into heterogeneous biological applications.

Latausten tilastot

Ei tietoja saatavilla