TUTCRIS - Tampereen teknillinen yliopisto


Preventing keystroke based identification in open data sets



OtsikkoL@S 2017 - Proceedings of the 4th (2017) ACM Conference on Learning at Scale
ISBN (elektroninen)9781450344500
DOI - pysyväislinkit
TilaJulkaistu - 12 huhtikuuta 2017
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaACM Conference on Learning @ Scale -
Kesto: 1 tammikuuta 2000 → …


ConferenceACM Conference on Learning @ Scale
Ajanjakso1/01/00 → …


Large-scale courses such as Massive Online Open Courses (MOOCs) can be a great data source for researchers. Ideally, the data gathered on such courses should be openly available to all researchers. Studies could be easily replicated and novel studies on existing data could be conducted. However, very fine-grained data such as source code snapshots can contain hidden identifiers. For example, distinct typing patterns that identify individuals can be extracted from such data. Hence, simply removing explicit identifiers such as names and student numbers is not sufficient to protect the privacy of the users who have supplied the data. At the same time, removing all keystroke information would decrease the value of the shared data significantly. In this work, we study how keystroke data from a programming context could be modified to prevent keystroke latency based identification whilst still retaining information that can be used to e.g. infer programming experience. We investigate the degree of anonymization required to render identification of students based on their typing patterns unreliable. Then, we study whether the modified keystroke data can still be used to infer the programming experience of the students as a case study of whether the anonymized typing patterns have retained at least some informative value. We show that it is possible to modify data so that keystroke latency based identification is no longer accurate, but the programming experience of the students can still be inferred, i.e. the data still has value to researchers. In a broader context, our results indicate that information and anonymity are not necessarily mutually exclusive.