Tampere University of Technology

TUTCRIS Research Portal

Preventing keystroke based identification in open data sets

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Details

Original languageEnglish
Title of host publicationL@S 2017 - Proceedings of the 4th (2017) ACM Conference on Learning at Scale
PublisherACM
Pages101-109
Number of pages9
ISBN (Electronic)9781450344500
DOIs
Publication statusPublished - 12 Apr 2017
Publication typeA4 Article in a conference publication
EventACM Conference on Learning @ Scale -
Duration: 1 Jan 2000 → …

Conference

ConferenceACM Conference on Learning @ Scale
Period1/01/00 → …

Abstract

Large-scale courses such as Massive Online Open Courses (MOOCs) can be a great data source for researchers. Ideally, the data gathered on such courses should be openly available to all researchers. Studies could be easily replicated and novel studies on existing data could be conducted. However, very fine-grained data such as source code snapshots can contain hidden identifiers. For example, distinct typing patterns that identify individuals can be extracted from such data. Hence, simply removing explicit identifiers such as names and student numbers is not sufficient to protect the privacy of the users who have supplied the data. At the same time, removing all keystroke information would decrease the value of the shared data significantly. In this work, we study how keystroke data from a programming context could be modified to prevent keystroke latency based identification whilst still retaining information that can be used to e.g. infer programming experience. We investigate the degree of anonymization required to render identification of students based on their typing patterns unreliable. Then, we study whether the modified keystroke data can still be used to infer the programming experience of the students as a case study of whether the anonymized typing patterns have retained at least some informative value. We show that it is possible to modify data so that keystroke latency based identification is no longer accurate, but the programming experience of the students can still be inferred, i.e. the data still has value to researchers. In a broader context, our results indicate that information and anonymity are not necessarily mutually exclusive.

Keywords

  • Data anonymization, Data privacy, Keystroke dynamics, Programming experience inference, Source code snapshots

Publication forum classification

Field of science, Statistics Finland