Regularization in Machine Learning with Applications in Biology
|Tila||Julkaistu - 17 toukokuuta 2019|
|Nimi||Tampere University Dissertations|
The ﬁeld of machine learning provides a comprehensive array of computational tools for analyzing such data. The aim of this thesis is to learn a simple model from such biological data, where the learned model illustrates an overview of the underlying data-generating process. This, in turn, should allow the extraction of salient features for predictive analysis of unobserved data. In this thesis, we consider solving the biological problem from the context of the dimension of the data. High-dimension data refers to the phenomenon where the number of data points is larger than the features describing the data points, or vice versa, or both. High-dimensionality can lead to the risk of overﬁtting of the model, a phenomenon in which the performance of the model is poorly described for the predictive analysis of unobserved data.
To this end, an attempt to add additional information or to modify the learning algorithm, a strategy known as regularization, is indispensable to increase the generalization capability of the model. The results of this study indicate that a regularized version of simple linear models often outperforms more sophisticated methods. Moreover, implicit automated feature-selection capabilities in sparse regularized parameter estimations have made a signiﬁcant contribution to the thesis. We also utilize a powerful ensemble tree-based method, random forest, which is eﬀective for discovering nonlinear relationships among features as well as providing feature ranking.
Another important aspect in the learning process considered in this thesis is model selection i.e., the selection of one model from a hypothesized set of possible models. The set of candidate models is constructed by setting diﬀerent values for the models’ hyperparameters before initializing the learning process. It is shown that an alternative Bayesian approach is computationally faster and has lower error rates than the traditional approaches to model selection, such as grid search and cross-validation. Moreover, we propose a closed-form expression for the area under the receiver operating characteristic curve, a performance metric, in the context of a linear classiﬁer.
In addition, we consider the unsupervised machine-learning paradigm in which the ground truth of biological data is not provided, such as for microarray gene expression proﬁling. The results show that clustering methods can be used eﬀectively to explore the data and discover similarities. Our results indicate that careful selection of the machine-learning approaches can create powerful, yet simple computational modeling and analysis that can provide new and useful insights into heterogeneous biological applications.