Tampere University of Technology

TUTCRIS Research Portal

Defining Data Science by a Data-Driven Quantification of the Community

Research output: Contribution to journalArticleScientificpeer-review


Original languageEnglish
Pages (from-to)235-251
JournalMachine Learning and Knowledge Extraction
Publication statusPublished - 19 Dec 2018
Publication typeA1 Journal article-refereed


Data science is a new academic field that has received much attention in recent years. One reason for this is that our increasingly digitalized society generates more and more data in all areas of our lives and science and we are desperately seeking for solutions to deal with this problem. In this paper, we investigate the academic roots of data science. We are using data of scientists and their citations from Google Scholar, who have an interest in data science, to perform a quantitative analysis of the data science community. Furthermore, for decomposing the data science community into its major defining factors corresponding to the most important research fields, we introduce a statistical regression model that is fully automatic and robust with respect to a subsampling of the data. This statistical model allows us to define the ‘importance’ of a field as its predictive abilities. Overall, our method provides an objective answer to the question ‘What is data science?’.

Publication forum classification

Field of science, Statistics Finland