Portfolio EDEND: 2014-09-15

Publisher name: Oxford University Press PY - 2015/1/1 Y1 - 2015/1/1 N2 - Motivation: MS2-GFP-tagging of RNA is currently the only method to measure intervals between consecutive transcription events in live cells. For this, new transcripts must be accurately detected from intensity time traces. Results: We present a novel method for automatically estimating RNA numbers and production intervals from temporal data of cell fluorescence intensities that reduces uncertainty by exploiting temporal information. We also derive a robust variant, more resistant to outliers caused e.g. by RNAs moving out of focus. Using Monte Carlo simulations, we show that the quantification of RNA numbers and production intervals is generally improved compared with previous methods. Finally, we analyze data from live Escherichia coli and show statistically significant differences to previous methods. The new methods can be used to quantify numbers and production intervals of any fluorescent probes, which are present in low copy numbers, are brighter than the cell background and degrade slowly. Availability: Source code is available under Mozilla Public License at http://www.cs.tut.fi/%7ehakkin22/jumpdet/. Contact: AB - Motivation: MS2-GFP-tagging of RNA is currently the only method to measure intervals between consecutive transcription events in live cells. For this, new transcripts must be accurately detected from intensity time traces. Results: We present a novel method for automatically estimating RNA numbers and production intervals from temporal data of cell fluorescence intensities that reduces uncertainty by exploiting temporal information. We also derive a robust variant, more resistant to outliers caused e.g. by RNAs moving out of focus. Using Monte Carlo simulations, we show that the quantification of RNA numbers and production intervals is generally improved compared with previous methods. Finally, we analyze data from live Escherichia coli and show statistically significant differences to previous methods. The new methods can be used to quantify numbers and production intervals of any fluorescent probes, which are present in low copy numbers, are brighter than the cell background and degrade slowly. Availability: Source code is available under Mozilla Public License at http://www.cs.tut.fi/%7ehakkin22/jumpdet/. Contact: UR - http://www.scopus.com/inward/record.url?scp=84922352843&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/btu592 DO - 10.1093/bioinformatics/btu592 M3 - Article VL - 31 SP - 69 EP - 75 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 1 ER - TY - JOUR T1 - Majorization-minimization for manifold embedding AU - Yang, Zhirong AU - Peltonen, Jaakko AU - Kaski, Samuel PY - 2015 Y1 - 2015 N2 - Nonlinear dimensionality reduction by manifold embedding has become a popular and powerful approach both for visualization and as preprocessing for predictive tasks, but more efficient optimization algorithms are still crucially needed. Majorization-Minimization (MM) is a promising approach that monotonically decreases the cost function, but it remains unknown how to tightly majorize the manifold embedding objective functions such that the resulting MM algorithms are efficient and robust. We propose a new MM procedure that yields fast MM algorithms for a wide variety of manifold embedding problems. In our majorization step, two parts of the cost function are respectively upper bounded by quadratic and Lipschitz surrogates, and the resulting upper bound can be minimized in closed form. For cost functions amenable to such QL-majorization, the MM yields monotonic improvement and is efficient: In experiments, the newly developed MM algorithms outperformed five state-ofthe-art optimization approaches in manifold embedding tasks. AB - Nonlinear dimensionality reduction by manifold embedding has become a popular and powerful approach both for visualization and as preprocessing for predictive tasks, but more efficient optimization algorithms are still crucially needed. Majorization-Minimization (MM) is a promising approach that monotonically decreases the cost function, but it remains unknown how to tightly majorize the manifold embedding objective functions such that the resulting MM algorithms are efficient and robust. We propose a new MM procedure that yields fast MM algorithms for a wide variety of manifold embedding problems. In our majorization step, two parts of the cost function are respectively upper bounded by quadratic and Lipschitz surrogates, and the resulting upper bound can be minimized in closed form. For cost functions amenable to such QL-majorization, the MM yields monotonic improvement and is efficient: In experiments, the newly developed MM algorithms outperformed five state-ofthe-art optimization approaches in manifold embedding tasks. UR - http://www.scopus.com/inward/record.url?scp=84954311496&partnerID=8YFLogxK M3 - Article VL - 38 SP - 1088 EP - 1097 JO - Journal of Machine Learning Research JF - Journal of Machine Learning Research SN - 1532-4435 ER - TY - JOUR T1 - NetBioV T2 - An R package for visualizing large network data in biology and medicine AU - Tripathi, Shailesh AU - Dehmer, Matthias AU - Emmert-Streib, Frank PY - 2014/4/2 Y1 - 2014/4/2 N2 - NetBioV (Network Biology Visualization) is an R package that allows the visualization of large network data in biology and medicine. The purpose of NetBioV is to enable an organized and reproducible visualization of networks by emphasizing or highlighting specific structural properties that are of biological relevance. AB - NetBioV (Network Biology Visualization) is an R package that allows the visualization of large network data in biology and medicine. The purpose of NetBioV is to enable an organized and reproducible visualization of networks by emphasizing or highlighting specific structural properties that are of biological relevance. UR - http://www.scopus.com/inward/record.url?scp=84911403383&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/btu384 DO - 10.1093/bioinformatics/btu384 M3 - Article VL - 30 SP - 2834 EP - 2836 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 19 ER - TY - JOUR T1 - Gene Sets Net Correlations Analysis (GSNCA) T2 - A multivariate differential coexpression test for gene sets AU - Rahmatallah, Yasir AU - Emmert-Streib, Frank AU - Glazko, Galina PY - 2014/2/1 Y1 - 2014/2/1 N2 - Motivation: To date, gene set analysis approaches primarily focus on identifying differentially expressed gene sets (pathways). Methods for identifying differentially coexpressed pathways also exist but are mostly based on aggregated pairwise correlations or other pairwise measures of coexpression. Instead, we propose Gene Sets Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes.Results: In GSNCA, weight factors are assigned to genes in proportion to the genes' cross-correlations (intergene correlations). The problem of finding the weight vectors is formulated as an eigenvector problem with a unique solution. GSNCA tests the null hypothesis that for a gene set there is no difference in the weight vectors of the genes between two conditions. In simulation studies and the analyses of experimental data, we demonstrate that GSNCA captures changes in the structure of genes' cross-correlations rather than differences in the averaged pairwise correlations. Thus, GSNCA infers differences in coexpression networks, however, bypassing method-dependent steps of network inference. As an additional result from GSNCA, we define hub genes as genes with the largest weights and show that these genes correspond frequently to major and specific pathway regulators, as well as to genes that are most affected by the biological difference between two conditions. In summary, GSNCA is a new approach for the analysis of differentially coexpressed pathways that also evaluates the importance of the genes in the pathways, thus providing unique information that may result in the generation of novel biological hypotheses. AB - Motivation: To date, gene set analysis approaches primarily focus on identifying differentially expressed gene sets (pathways). Methods for identifying differentially coexpressed pathways also exist but are mostly based on aggregated pairwise correlations or other pairwise measures of coexpression. Instead, we propose Gene Sets Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes.Results: In GSNCA, weight factors are assigned to genes in proportion to the genes' cross-correlations (intergene correlations). The problem of finding the weight vectors is formulated as an eigenvector problem with a unique solution. GSNCA tests the null hypothesis that for a gene set there is no difference in the weight vectors of the genes between two conditions. In simulation studies and the analyses of experimental data, we demonstrate that GSNCA captures changes in the structure of genes' cross-correlations rather than differences in the averaged pairwise correlations. Thus, GSNCA infers differences in coexpression networks, however, bypassing method-dependent steps of network inference. As an additional result from GSNCA, we define hub genes as genes with the largest weights and show that these genes correspond frequently to major and specific pathway regulators, as well as to genes that are most affected by the biological difference between two conditions. In summary, GSNCA is a new approach for the analysis of differentially coexpressed pathways that also evaluates the importance of the genes in the pathways, thus providing unique information that may result in the generation of novel biological hypotheses. UR - http://www.scopus.com/inward/record.url?scp=84893275855&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/btt687 DO - 10.1093/bioinformatics/btt687 M3 - Article VL - 30 SP - 360 EP - 368 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 3 ER - TY - JOUR T1 - Structured orthogonal families of one and two strata prime basis factorial models AU - Rodrigues, Paulo C. AU - Moreira, Elsa E. AU - Jesus, Vera M. AU - Mexia, João T. PY - 2014 Y1 - 2014 N2 - The models in structured families correspond to the treatments of a fixed effects base design π, on the fixed effects parameters of the models, is studied. Analyzing such a families enables the study of the action of nesting factors on the effects and interactions of nested factors. When π has an orthogonal structure, the family of models is said to be orthogonal. The models in the family can have one, two or more strata. Models with more than one stratum are obtained through nesting of one stratum models. A general treatment of the case in which the base design has orthogonal structure is presented and a special emphasis is given to the families of prime basis factorials models. These last models are, as it is well known, widely used in fertilization trials. AB - The models in structured families correspond to the treatments of a fixed effects base design π, on the fixed effects parameters of the models, is studied. Analyzing such a families enables the study of the action of nesting factors on the effects and interactions of nested factors. When π has an orthogonal structure, the family of models is said to be orthogonal. The models in the family can have one, two or more strata. Models with more than one stratum are obtained through nesting of one stratum models. A general treatment of the case in which the base design has orthogonal structure is presented and a special emphasis is given to the families of prime basis factorials models. These last models are, as it is well known, widely used in fertilization trials. KW - Factorial designs KW - Families of models KW - Nested models KW - Orthogonal models KW - Two strata models UR - http://www.scopus.com/inward/record.url?scp=84903887341&partnerID=8YFLogxK U2 - 10.1007/s00362-013-0507-0 DO - 10.1007/s00362-013-0507-0 M3 - Article VL - 55 SP - 603 EP - 614 JO - Statistical Papers JF - Statistical Papers SN - 0932-5026 IS - 3 ER - TY - JOUR T1 - Information retrieval perspective to meta-visualization AU - Peltonen, Jaakko AU - Lin, Ziyuan PY - 2013 Y1 - 2013 N2 - In visual data exploration with scatter plots, no single plot is sufficient to analyze complicated high-dimensional data sets. Given numerous visualizations created with different features or methods, meta-visualization is needed to analyze the visualizations together. We solve how to arrange numerous visualizations onto a meta-visualization display, so that their similarities and differences can be analyzed. We introduce a machine learning approach to optimize the meta-visualization, based on an information retrieval perspective: Two visualizations are similar if the analyst would retrieve similar neighborhoods between data samples from either visualization. Based on the approach, we introduce a nonlinear embedding method for meta-visualization: it optimizes locations of visualizations on a display, so that visualizations giving similar information about data are close to each other. AB - In visual data exploration with scatter plots, no single plot is sufficient to analyze complicated high-dimensional data sets. Given numerous visualizations created with different features or methods, meta-visualization is needed to analyze the visualizations together. We solve how to arrange numerous visualizations onto a meta-visualization display, so that their similarities and differences can be analyzed. We introduce a machine learning approach to optimize the meta-visualization, based on an information retrieval perspective: Two visualizations are similar if the analyst would retrieve similar neighborhoods between data samples from either visualization. Based on the approach, we introduce a nonlinear embedding method for meta-visualization: it optimizes locations of visualizations on a display, so that visualizations giving similar information about data are close to each other. KW - Meta-visualization KW - Neighbor embedding KW - Nonlinear dimensionality reduction UR - http://www.scopus.com/inward/record.url?scp=84908485499&partnerID=8YFLogxK M3 - Article VL - 29 SP - 165 EP - 180 JO - Journal of Machine Learning Research JF - Journal of Machine Learning Research SN - 1532-4435 ER - TY - JOUR T1 - Gene set analysis for self-contained tests T2 - Complex null and specific alternative hypotheses AU - Rahmatallah, Y. AU - Emmert-Streib, F. AU - Glazko, G. PY - 2012/12 Y1 - 2012/12 N2 - Motivation: The analysis of differentially expressed gene sets became a routine in the analyses of gene expression data. There is a multitude of tests available, ranging from aggregation tests that summarize gene-level statistics for a gene set to true multivariate tests, accounting for intergene correlations. Most of them detect complex departures from the null hypothesis but when the null hypothesis is rejected the specific alternative leading to the rejection is not easily identifiable. Results: In this article we compare the power and Type I error rates of minimum-spanning tree (MST)-based non-parametric multivariate tests with several multivariate and aggregation tests, which are frequently used for pathway analyses. In our simulation study, we demonstrate that MST-based tests have power that is for many settings comparable with the power of conventional approaches, but outperform them in specific regions of the parameter space corresponding to biologically relevant configurations. Further, we find for simulated and for gene expression data that MST-based tests discriminate well against shift and scale alternatives. As a general result, we suggest a two-step practical analysis strategy that may increase the interpretability of experimental data: first, apply the most powerful multivariate test to find the subset of pathways for which the null hypothesis is rejected and second, apply MST-based tests to these pathways to select those that support specific alternative hypotheses. AB - Motivation: The analysis of differentially expressed gene sets became a routine in the analyses of gene expression data. There is a multitude of tests available, ranging from aggregation tests that summarize gene-level statistics for a gene set to true multivariate tests, accounting for intergene correlations. Most of them detect complex departures from the null hypothesis but when the null hypothesis is rejected the specific alternative leading to the rejection is not easily identifiable. Results: In this article we compare the power and Type I error rates of minimum-spanning tree (MST)-based non-parametric multivariate tests with several multivariate and aggregation tests, which are frequently used for pathway analyses. In our simulation study, we demonstrate that MST-based tests have power that is for many settings comparable with the power of conventional approaches, but outperform them in specific regions of the parameter space corresponding to biologically relevant configurations. Further, we find for simulated and for gene expression data that MST-based tests discriminate well against shift and scale alternatives. As a general result, we suggest a two-step practical analysis strategy that may increase the interpretability of experimental data: first, apply the most powerful multivariate test to find the subset of pathways for which the null hypothesis is rejected and second, apply MST-based tests to these pathways to select those that support specific alternative hypotheses. UR - http://www.scopus.com/inward/record.url?scp=84870441671&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/bts579 DO - 10.1093/bioinformatics/bts579 M3 - Article VL - 28 SP - 3073 EP - 3080 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 23 ER - TY - JOUR T1 - A comparison between joint regression analysis and the AMMI model T2 - A case study with barley AU - Pereira, Dulce G. AU - Rodrigues, Paulo C. AU - Mejza, Stanislaw AU - Mexia, João T. PY - 2012/2 Y1 - 2012/2 N2 - Joint regression analysis (JRA) and additive main effects and multiplicative interaction (AMMI) models are compared in order to (i) access the ability of describing a genotype by environment interaction effects and (ii) evaluate the agreement between the winners of mega-environments obtained from the AMMI analysis and the genotypes in the upper contour of the JRA. An iterative algorithm is used to obtain the environmental indexes for JRA, and standard multiple comparison procedures are adapted for genotype comparison and selection. This study includes three data sets from a spring barley (Hordeum vulgare L.) breeding programme carried out between 2004 and 2006 in Czech Republic. The results from both techniques are integrated in order to advise plant breeders, farmers and agronomists for better genotype selection and prediction for new years and/or new environments. AB - Joint regression analysis (JRA) and additive main effects and multiplicative interaction (AMMI) models are compared in order to (i) access the ability of describing a genotype by environment interaction effects and (ii) evaluate the agreement between the winners of mega-environments obtained from the AMMI analysis and the genotypes in the upper contour of the JRA. An iterative algorithm is used to obtain the environmental indexes for JRA, and standard multiple comparison procedures are adapted for genotype comparison and selection. This study includes three data sets from a spring barley (Hordeum vulgare L.) breeding programme carried out between 2004 and 2006 in Czech Republic. The results from both techniques are integrated in order to advise plant breeders, farmers and agronomists for better genotype selection and prediction for new years and/or new environments. KW - AMMI models KW - joint regression analysis KW - mega-environments KW - multiple comparisons KW - spring barley KW - zigzag algorithm UR - http://www.scopus.com/inward/record.url?scp=84856990001&partnerID=8YFLogxK U2 - 10.1080/00949655.2011.615839 DO - 10.1080/00949655.2011.615839 M3 - Article VL - 82 SP - 193 EP - 207 JO - JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION JF - JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION SN - 0094-9655 IS - 2 ER - TY - JOUR T1 - A stochastic model for survival of early prostate cancer with adjustments for leadtime, length bias, and over-detection AU - Wu, Grace Hui Min AU - Auvinen, Anssi AU - Yen, Amy Ming Fang AU - Hakama, Matti AU - Walter, Stephen D. AU - Chen, Hsiu Hsi PY - 2012/1 Y1 - 2012/1 N2 - To compare the survival between screen-detected and clinically detected cancers, we applied a series of non-homogeneous stochastic processes to deal with leadtime, length bias, and over-detection by using full information on detection modes obtained from the Finnish randomized controlled trial for prostate cancer screening. The results show after 9-year follow-up the hazard ratio of prostate cancer death for screen-detected cases against clinically detected cases increased from 0.24 (95% CI: 0.16-0.35) without correction for these biases, to 0.76 after correction for leadtime and length biases, and finally to 1.03 (95% CI: 0.79-1.33) for a further adjustment for over-detection. Adjustment for leadtime and length bias but no over-detection led to a 24% reduction in prostate cancer death as a result of prostate-specific antigen test. The further calibration of over-detection indicates no gain in survival of screen-detected prostate cancers (excluding over-detected case as stayer considered in the mover-stayer model) as compared with the control group in the absence of screening that is considered as the mover. However, whether the model assumption on over-detection is robust should be validated with other data sets and longer follow-up. AB - To compare the survival between screen-detected and clinically detected cancers, we applied a series of non-homogeneous stochastic processes to deal with leadtime, length bias, and over-detection by using full information on detection modes obtained from the Finnish randomized controlled trial for prostate cancer screening. The results show after 9-year follow-up the hazard ratio of prostate cancer death for screen-detected cases against clinically detected cases increased from 0.24 (95% CI: 0.16-0.35) without correction for these biases, to 0.76 after correction for leadtime and length biases, and finally to 1.03 (95% CI: 0.79-1.33) for a further adjustment for over-detection. Adjustment for leadtime and length bias but no over-detection led to a 24% reduction in prostate cancer death as a result of prostate-specific antigen test. The further calibration of over-detection indicates no gain in survival of screen-detected prostate cancers (excluding over-detected case as stayer considered in the mover-stayer model) as compared with the control group in the absence of screening that is considered as the mover. However, whether the model assumption on over-detection is robust should be validated with other data sets and longer follow-up. KW - Leadtime and length bias KW - Mass screening KW - Prostate neoplasms KW - Prostate-specific antigen KW - Stochastic processes UR - http://www.scopus.com/inward/record.url?scp=84862974650&partnerID=8YFLogxK U2 - 10.1002/bimj.201000107 DO - 10.1002/bimj.201000107 M3 - Article VL - 54 SP - 20 EP - 44 JO - Biometrical Journal JF - Biometrical Journal SN - 0323-3847 IS - 1 ER - TY - JOUR T1 - Clustering-based method for developing a genomic copy number alteration signature for predicting the metastatic potential of prostate cancer AU - Pearlman, Alexander AU - Campbell, Christopher AU - Brooks, Eric AU - Genshaft, Alex AU - Shajahan, Shahin AU - Ittman, Michael AU - Bova, G. Steven AU - Melamed, Jonathan AU - Holcomb, Ilona AU - Schneider, Robert J. AU - Ostrer, Harry PY - 2012 Y1 - 2012 N2 - The transition of cancer from a localized tumor to a distant metastasis is not well understood for prostate and many other cancers, partly, because of the scarcity of tumor samples, especially metastases, from cancer patients with long-term clinical follow-up. To overcome this limitation, we developed a semi-supervised clustering method using the tumor genomic DNA copy number alterations to classify each patient into inferred clinical outcome groups of metastatic potential. Our data set was comprised of 294 primary tumors and 49 metastases from 5 independent cohorts of prostate cancer patients. The alterations were modeled based on Darwins evolutionary selection theory and the genes overlapping these altered genomic regions were used to develop a metastatic potential score for a prostate cancer primary tumor. The function of the proteins encoded by some of the predictor genes promote escape from anoikis, a pathway of apoptosis, deregulated in metastases. We evaluated the metastatic potential score with other clinical predictors available at diagnosis using a Cox proportional hazards model and show our proposed score was the only significant predictor of metastasis free survival. The metastasis gene signature and associated score could be applied directly to copy number alteration profiles from patient biopsies positive for prostate cancer. AB - The transition of cancer from a localized tumor to a distant metastasis is not well understood for prostate and many other cancers, partly, because of the scarcity of tumor samples, especially metastases, from cancer patients with long-term clinical follow-up. To overcome this limitation, we developed a semi-supervised clustering method using the tumor genomic DNA copy number alterations to classify each patient into inferred clinical outcome groups of metastatic potential. Our data set was comprised of 294 primary tumors and 49 metastases from 5 independent cohorts of prostate cancer patients. The alterations were modeled based on Darwins evolutionary selection theory and the genes overlapping these altered genomic regions were used to develop a metastatic potential score for a prostate cancer primary tumor. The function of the proteins encoded by some of the predictor genes promote escape from anoikis, a pathway of apoptosis, deregulated in metastases. We evaluated the metastatic potential score with other clinical predictors available at diagnosis using a Cox proportional hazards model and show our proposed score was the only significant predictor of metastasis free survival. The metastasis gene signature and associated score could be applied directly to copy number alteration profiles from patient biopsies positive for prostate cancer. UR - http://www.scopus.com/inward/record.url?scp=84864941918&partnerID=8YFLogxK U2 - 10.1155/2012/873570 DO - 10.1155/2012/873570 M3 - Article JO - JOURNAL OF PROBABILITY AND STATISTICS JF - JOURNAL OF PROBABILITY AND STATISTICS SN - 1687-952X M1 - 873570 ER - TY - JOUR T1 - Reorientational versus Kerr dark and gray solitary waves using modulation theory AU - Assanto, Gaetano AU - Marchant, T. R. AU - Minzoni, Antonmaria A. AU - Smyth, Noel F. PY - 2011/12/9 Y1 - 2011/12/9 N2 - We develop a modulation theory model based on a Lagrangian formulation to investigate the evolution of dark and gray optical spatial solitary waves for both the defocusing nonlinear Schrödinger (NLS) equation and the nematicon equations describing nonlinear beams, nematicons, in self-defocusing nematic liquid crystals. Since it has an exact soliton solution, the defocusing NLS equation is used as a test bed for the modulation theory applied to the nematicon equations, which have no exact solitary wave solution. We find that the evolution of dark and gray NLS solitons, as well as nematicons, is entirely driven by the emission of diffractive radiation, in contrast to the evolution of bright NLS solitons and bright nematicons. Moreover, the steady nematicon profile is nonmonotonic due to the long-range nonlocality associated with the perturbation of the optic axis. Excellent agreement is obtained with numerical solutions of both the defocusing NLS and nematicon equations. The comparisons for the nematicon solutions raise a number of subtle issues relating to the definition and measurement of the width of a dark or gray nematicon. AB - We develop a modulation theory model based on a Lagrangian formulation to investigate the evolution of dark and gray optical spatial solitary waves for both the defocusing nonlinear Schrödinger (NLS) equation and the nematicon equations describing nonlinear beams, nematicons, in self-defocusing nematic liquid crystals. Since it has an exact soliton solution, the defocusing NLS equation is used as a test bed for the modulation theory applied to the nematicon equations, which have no exact solitary wave solution. We find that the evolution of dark and gray NLS solitons, as well as nematicons, is entirely driven by the emission of diffractive radiation, in contrast to the evolution of bright NLS solitons and bright nematicons. Moreover, the steady nematicon profile is nonmonotonic due to the long-range nonlocality associated with the perturbation of the optic axis. Excellent agreement is obtained with numerical solutions of both the defocusing NLS and nematicon equations. The comparisons for the nematicon solutions raise a number of subtle issues relating to the definition and measurement of the width of a dark or gray nematicon. UR - http://www.scopus.com/inward/record.url?scp=84555189254&partnerID=8YFLogxK U2 - 10.1103/PhysRevE.84.066602 DO - 10.1103/PhysRevE.84.066602 M3 - Article VL - 84 JO - Physical Review E JF - Physical Review E SN - 1539-3755 IS - 6 M1 - 066602 ER - TY - JOUR T1 - BACOM T2 - In silico detection of genomic deletion types and correction of normal cell contamination in copy number data AU - Yu, Guoqiang AU - Zhang, Bai AU - Bova, G. Steven AU - Xu, Jianfeng AU - Shih, Ie Ming AU - Wang, Yue PY - 2011/6 Y1 - 2011/6 N2 - Motivation: Identification of somatic DNA copy number alterations (CNAs) and significant consensus events (SCEs) in cancer genomes is a main task in discovering potential cancer-driving genes such as oncogenes and tumor suppressors. The recent development of SNP array technology has facilitated studies on copy number changes at a genome-wide scale with high resolution. However, existing copy number analysis methods are oblivious to normal cell contamination and cannot distinguish between contributions of cancerous and normal cells to the measured copy number signals. This contamination could significantly confound downstream analysis of CNAs and affect the power to detect SCEs in clinical samples. Results: We report here a statistically principled in silico approach, Bayesian Analysis of COpy number Mixtures (BACOM), to accurately estimate genomic deletion type and normal tissue contamination, and accordingly recover the true copy number profile in cancer cells. We tested the proposed method on two simulated datasets, two prostate cancer datasets and The Cancer Genome Atlas high-grade ovarian dataset, and obtained very promising results supported by the ground truth and biological plausibility. Moreover, based on a large number of comparative simulation studies, the proposed method gives significantly improved power to detect SCEs after in silico correction of normal tissue contamination. We develop a cross-platform open-source Java application that implements the whole pipeline of copy number analysis of heterogeneous cancer tissues including relevant processing steps. We also provide an R interface, bacomR, for running BACOM within the R environment, making it straightforward to include in existing data pipelines. AB - Motivation: Identification of somatic DNA copy number alterations (CNAs) and significant consensus events (SCEs) in cancer genomes is a main task in discovering potential cancer-driving genes such as oncogenes and tumor suppressors. The recent development of SNP array technology has facilitated studies on copy number changes at a genome-wide scale with high resolution. However, existing copy number analysis methods are oblivious to normal cell contamination and cannot distinguish between contributions of cancerous and normal cells to the measured copy number signals. This contamination could significantly confound downstream analysis of CNAs and affect the power to detect SCEs in clinical samples. Results: We report here a statistically principled in silico approach, Bayesian Analysis of COpy number Mixtures (BACOM), to accurately estimate genomic deletion type and normal tissue contamination, and accordingly recover the true copy number profile in cancer cells. We tested the proposed method on two simulated datasets, two prostate cancer datasets and The Cancer Genome Atlas high-grade ovarian dataset, and obtained very promising results supported by the ground truth and biological plausibility. Moreover, based on a large number of comparative simulation studies, the proposed method gives significantly improved power to detect SCEs after in silico correction of normal tissue contamination. We develop a cross-platform open-source Java application that implements the whole pipeline of copy number analysis of heterogeneous cancer tissues including relevant processing steps. We also provide an R interface, bacomR, for running BACOM within the R environment, making it straightforward to include in existing data pipelines. UR - http://www.scopus.com/inward/record.url?scp=79957859881&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/btr183 DO - 10.1093/bioinformatics/btr183 M3 - Article VL - 27 SP - 1473 EP - 1480 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 11 M1 - btr183 ER - TY - JOUR T1 - Dynamics of coupled repressilators T2 - The role of mRNA kinetics and transcription cooperativity AU - Potapov, I. AU - Volkov, E. AU - Kuznetsov, A. PY - 2011/3/4 Y1 - 2011/3/4 N2 - Oscillatory regulatory networks have been discovered in many cellular pathways. An especially challenging area is studying dynamics of cellular oscillators interacting with one another in a population. Synchronization is only one of and the simplest outcome of such interaction. It is suggested that the outcome depends on the structure of the network. Phase-attractive (synchronizing) and phase-repulsive coupling structures were distinguished for regulatory oscillators. In this paper, we question this separation. We study an example of two interacting repressilators (artificial regulatory oscillators based on cyclic repression). We show that changing the cooperativity of transcription repression (Hill coefficient) and reaction timescales dramatically alter synchronization properties. The network becomes birhythmic-it chooses between the in-phase and antiphase synchronization. Thus, the type of synchronization is not characteristic for the network structure. However, we conclude that the specific scenario of emergence and stabilization of synchronous solutions is much more characteristic. AB - Oscillatory regulatory networks have been discovered in many cellular pathways. An especially challenging area is studying dynamics of cellular oscillators interacting with one another in a population. Synchronization is only one of and the simplest outcome of such interaction. It is suggested that the outcome depends on the structure of the network. Phase-attractive (synchronizing) and phase-repulsive coupling structures were distinguished for regulatory oscillators. In this paper, we question this separation. We study an example of two interacting repressilators (artificial regulatory oscillators based on cyclic repression). We show that changing the cooperativity of transcription repression (Hill coefficient) and reaction timescales dramatically alter synchronization properties. The network becomes birhythmic-it chooses between the in-phase and antiphase synchronization. Thus, the type of synchronization is not characteristic for the network structure. However, we conclude that the specific scenario of emergence and stabilization of synchronous solutions is much more characteristic. UR - http://www.scopus.com/inward/record.url?scp=79953153880&partnerID=8YFLogxK U2 - 10.1103/PhysRevE.83.031901 DO - 10.1103/PhysRevE.83.031901 M3 - Article VL - 83 JO - Physical Review E JF - Physical Review E SN - 1539-3755 IS - 3 M1 - 031901 ER - TY - JOUR T1 - Generative modeling for maximizing precision and recall in information visualization AU - Peltonen, Jaakko AU - Kaski, Samuel PY - 2011 Y1 - 2011 N2 - Information visualization has recently been formulated as an information retrieval problem, where the goal is to find similar data points based on the visualized nonlinear projection, and the visualization is optimized to maximize a compromise between (smoothed) precision and recall. We turn the visualization into a generative modeling task where a simple user model parameterized by the data coordinates is optimized, neighborhood relations are the observed data, and straightforward maximum likelihood estimation corresponds to Stochastic Neighbor Embedding (SNE). While SNE maximizes pure recall, adding a mixture component that "explains away" misses allows our generative model to focus on maximizing precision as well. The resulting model is a generative solution to maximizing tradeoffs between precision and recall. The model outperforms earlier models in terms of precision and recall and in external validation by unsupervised classification. AB - Information visualization has recently been formulated as an information retrieval problem, where the goal is to find similar data points based on the visualized nonlinear projection, and the visualization is optimized to maximize a compromise between (smoothed) precision and recall. We turn the visualization into a generative modeling task where a simple user model parameterized by the data coordinates is optimized, neighborhood relations are the observed data, and straightforward maximum likelihood estimation corresponds to Stochastic Neighbor Embedding (SNE). While SNE maximizes pure recall, adding a mixture component that "explains away" misses allows our generative model to focus on maximizing precision as well. The resulting model is a generative solution to maximizing tradeoffs between precision and recall. The model outperforms earlier models in terms of precision and recall and in external validation by unsupervised classification. UR - http://www.scopus.com/inward/record.url?scp=84862299625&partnerID=8YFLogxK M3 - Article VL - 15 SP - 579 EP - 587 JO - Journal of Machine Learning Research JF - Journal of Machine Learning Research SN - 1532-4435 ER - TY - JOUR T1 - Revealing differences in gene network inference algorithms on the network level by ensemble methods AU - Altay, Gökmen AU - Emmert-Streib, Frank PY - 2010/5/25 Y1 - 2010/5/25 N2 - Motivation: The inference of regulatory networks from large-scale expression data holds great promise because of the potentially causal interpretation of these networks. However, due to the difficulty to establish reliable methods based on observational data there is so far only incomplete knowledge about possibilities and limitations of such inference methods in this context. Results: In this article, we conduct a statistical analysis investigating differences and similarities of four network inference algorithms, ARACNE, CLR, MRNET and RN, with respect to local network-based measures. We employ ensemble methods allowing to assess the inferability down to the level of individual edges. Our analysis reveals the bias of these inference methods with respect to the inference of various network components and, hence, provides guidance in the interpretation of inferred regulatory networks from expression data. Further, as application we predict the total number of regulatory interactions in human B cells and hypothesize about the role of Myc and its targets regarding molecular information processing. AB - Motivation: The inference of regulatory networks from large-scale expression data holds great promise because of the potentially causal interpretation of these networks. However, due to the difficulty to establish reliable methods based on observational data there is so far only incomplete knowledge about possibilities and limitations of such inference methods in this context. Results: In this article, we conduct a statistical analysis investigating differences and similarities of four network inference algorithms, ARACNE, CLR, MRNET and RN, with respect to local network-based measures. We employ ensemble methods allowing to assess the inferability down to the level of individual edges. Our analysis reveals the bias of these inference methods with respect to the inference of various network components and, hence, provides guidance in the interpretation of inferred regulatory networks from expression data. Further, as application we predict the total number of regulatory interactions in human B cells and hypothesize about the role of Myc and its targets regarding molecular information processing. UR - http://www.scopus.com/inward/record.url?scp=77954484005&partnerID=8YFLogxK U2 - 10.1093/bioinformatics/btq259 DO - 10.1093/bioinformatics/btq259 M3 - Article VL - 26 SP - 1738 EP - 1744 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 14 M1 - btq259 ER - TY - JOUR T1 - Exploratory analysis of spatiotemporal patterns of cellular automata by clustering compressibility AU - Emmert-Streib, Frank PY - 2010/2/8 Y1 - 2010/2/8 N2 - In this paper we study the classification of spatiotemporal pattern of one-dimensional cellular automata (CA) whereas the classification comprises CA rules including their initial conditions. We propose an exploratory analysis method based on the normalized compression distance (NCD) of spatiotemporal patterns which is used as dissimilarity measure for a hierarchical clustering. Our approach is different with respect to the following points. First, the classification of spatiotemporal pattern is comparative because the NCD evaluates explicitly the difference of compressibility among two objects, e.g., strings corresponding to spatiotemporal patterns. This is in contrast to all other measures applied so far in a similar context because they are essentially univariate. Second, Kolmogorov complexity, which underlies the NCD, was used in the classification of CA with respect to their spatiotemporal pattern. Third, our method is semiautomatic allowing us to investigate hundreds or thousands of CA rules or initial conditions simultaneously to gain insights into their organizational structure. Our numerical results are not only plausible confirming previous classification attempts but also shed light on the intricate influence of random initial conditions on the classification results. AB - In this paper we study the classification of spatiotemporal pattern of one-dimensional cellular automata (CA) whereas the classification comprises CA rules including their initial conditions. We propose an exploratory analysis method based on the normalized compression distance (NCD) of spatiotemporal patterns which is used as dissimilarity measure for a hierarchical clustering. Our approach is different with respect to the following points. First, the classification of spatiotemporal pattern is comparative because the NCD evaluates explicitly the difference of compressibility among two objects, e.g., strings corresponding to spatiotemporal patterns. This is in contrast to all other measures applied so far in a similar context because they are essentially univariate. Second, Kolmogorov complexity, which underlies the NCD, was used in the classification of CA with respect to their spatiotemporal pattern. Third, our method is semiautomatic allowing us to investigate hundreds or thousands of CA rules or initial conditions simultaneously to gain insights into their organizational structure. Our numerical results are not only plausible confirming previous classification attempts but also shed light on the intricate influence of random initial conditions on the classification results. UR - http://www.scopus.com/inward/record.url?scp=76749153776&partnerID=8YFLogxK U2 - 10.1103/PhysRevE.81.026103 DO - 10.1103/PhysRevE.81.026103 M3 - Article VL - 81 JO - Physical Review E JF - Physical Review E SN - 1539-3755 IS - 2 M1 - 026103 ER - TY - JOUR T1 - Unite and conquer T2 - Univariate and multivariate approaches for finding differentially expressed gene sets AU - Glazko, Galina V. AU - Emmert-Streib, Frank PY - 2009/9 Y1 - 2009/9 N2 - Motivation: Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses. Results: In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T2, N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one. AB - Motivation: Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses. Results: In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T2, N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one. U2 - 10.1093/bioinformatics/btp406 DO - 10.1093/bioinformatics/btp406 M3 - Article VL - 25 SP - 2348 EP - 2354 JO - Bioinformatics JF - Bioinformatics SN - 1367-4803 IS - 18 ER - TY - JOUR T1 - Fault tolerance of information processing in gene networks AU - Emmert-Streib, Frank AU - Dehmer, Matthias PY - 2009/2/15 Y1 - 2009/2/15 N2 - The major objective of this paper is to study the fault tolerance of gene networks. For single gene knockouts we investigate the disturbance of the communication abilities of gene networks globally. For our study we use directed scale-free networks resembling important properties of gene networks, e.g., signaling, or transcriptional regulatory networks, as well as metabolic networks and define a Markov chain on the network to model the communication dynamics. This allows us to evaluate the spread of information in the network and, hence, detect differences due to single gene knockouts in the gene-to-gene communication asymptotically regarding the limiting stationary distributions governed by the Markov chain. Further, we study the connection of the global effect of the perturbations with local properties of the network topology by means of statistical hypothesis tests. AB - The major objective of this paper is to study the fault tolerance of gene networks. For single gene knockouts we investigate the disturbance of the communication abilities of gene networks globally. For our study we use directed scale-free networks resembling important properties of gene networks, e.g., signaling, or transcriptional regulatory networks, as well as metabolic networks and define a Markov chain on the network to model the communication dynamics. This allows us to evaluate the spread of information in the network and, hence, detect differences due to single gene knockouts in the gene-to-gene communication asymptotically regarding the limiting stationary distributions governed by the Markov chain. Further, we study the connection of the global effect of the perturbations with local properties of the network topology by means of statistical hypothesis tests. KW - Information processing KW - Markov chain KW - Robustness KW - Scale-free network UR - http://www.scopus.com/inward/record.url?scp=57349185507&partnerID=8YFLogxK U2 - 10.1016/j.physa.2008.10.032 DO - 10.1016/j.physa.2008.10.032 M3 - Article VL - 388 SP - 541 EP - 548 JO - Physica A: Statistical Mechanics and Its Applications JF - Physica A: Statistical Mechanics and Its Applications SN - 0378-4371 IS - 4 ER - TY - GEN T1 - Performance of Variable Partial Factor approach in a slope design AU - Knuuti, Mika AU - Länsivaara, Tim PY - 2019 Y1 - 2019 N2 - Most of the design codes have moved from traditional total factor of safety method to the partial factor approach, aiming to cover the uncertainties better. The target has been to reach more consistent safety levels, but it has not always obtained. This has raised more interest towards reliability based design and its applications. In this paper, the performance of two partial factor approaches were compared from the reliability point of view; eurocode 7 design approach 3 and proposed Variable Partial Factor approach. The results show that the partial factor method with fixed partial factors cannot fully cover the uncertainties related to the design. The partial factors should be dependent on the level of uncertainty of the parameters. The results also shows that RBD can be applied in designer friendly way. In addition, some challenges in the determination of the characteristic values were pointed out. AB - Most of the design codes have moved from traditional total factor of safety method to the partial factor approach, aiming to cover the uncertainties better. The target has been to reach more consistent safety levels, but it has not always obtained. This has raised more interest towards reliability based design and its applications. In this paper, the performance of two partial factor approaches were compared from the reliability point of view; eurocode 7 design approach 3 and proposed Variable Partial Factor approach. The results show that the partial factor method with fixed partial factors cannot fully cover the uncertainties related to the design. The partial factors should be dependent on the level of uncertainty of the parameters. The results also shows that RBD can be applied in designer friendly way. In addition, some challenges in the determination of the characteristic values were pointed out. U2 - 10.22725/ICASP13.475 DO - 10.22725/ICASP13.475 M3 - Conference contribution BT - 13th International Conference on Applications of Statistics and Probability in Civil Engineering(ICASP13), Seoul, South Korea, May 26-30, 2019 ER -