Unsupervised Topic Extraction from Privacy Policies

Companion Proceedings of the 2019 World Wide Web Conference (WWW’19 Companion), May 13–17, 2019, San Francisco, CA, USA

Bar Ilan University Faculty of Law Research Paper No. 19-10

6 Pages Posted: 23 May 2019 Last revised: 12 Aug 2019

See all articles by David Sarne

David Sarne

Bar-Ilan University - Department of Computer Science

Jonathan Schler

Bar-Ilan University

Alon Singer

H-F & Co. Law Offices

Ayelet Sela

Bar Ilan University Faculty of Law

Ittai Bar-Siman-Tov

Bar-Ilan University Law Faculty

Date Written: April 23, 2019

Abstract

This paper suggests the use of automatic topic modeling for largescale corpora of privacy policies using unsupervised learning techniques.

The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.

Keywords: Topic Modeling, Machine Learning, Unsuprevised Learning, Artificial Intelligence, Privacy Policies, Data Privacy, Information Privacy, Mobile Applications, Google Play

JEL Classification: C00, C02, C18, C19, C38, C60, C62, C69,C80, C88

Suggested Citation

Sarne, David and Schler, Jonathan and Singer, Alon and Sela, Ayelet and Bar-Siman-Tov, Ittai, Unsupervised Topic Extraction from Privacy Policies (April 23, 2019). Companion Proceedings of the 2019 World Wide Web Conference (WWW’19 Companion), May 13–17, 2019, San Francisco, CA, USA, Bar Ilan University Faculty of Law Research Paper No. 19-10, Available at SSRN: https://ssrn.com/abstract=3376558

David Sarne

Bar-Ilan University - Department of Computer Science ( email )

Israel

Jonathan Schler

Bar-Ilan University ( email )

Ramat Gan
Ramat-Gan, 52900
Israel

Alon Singer

H-F & Co. Law Offices ( email )

Israel

Ayelet Sela

Bar Ilan University Faculty of Law ( email )

Faculty of Law
Ramat Gan, 52900
Israel

Ittai Bar-Siman-Tov (Contact Author)

Bar-Ilan University Law Faculty ( email )

Faculty of Law
Ramat Gan, 52900
Israel
972-3-7387071 (Phone)
972-3-7384096 (Fax)

HOME PAGE: http://law.biu.ac.il/en/node/1726#tabs-tabset-1

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
32
Abstract Views
382
PlumX Metrics