ISG Talks are sponsored by Couchbase.
- This event has passed.
Vishal Chakraborty: Much Ado About Data-Undo: Semantically Meaningful Data Erasure
December 1, 2023 @ 1:00 pm - 2:00 pm
Data regulations, such as GDPR and CCPA, are increasingly being adopted globally to protect against unsafe data management practices. Such regulations are, often ambiguous (with multiple valid interpretations) when it comes to defining the expected dynamic behaviour of data processing systems. We will argue and show that it is possible to represent regulations such as GDPR formally as invariants using a (small set of) data processing concepts that capture system behaviour. When such concepts are grounded, i.e., they are provided with a single unambiguous interpretation, systems can achieve compliance by demonstrating that the system actions they implement maintain the invariants (representing the regulations). To illustrate our vision, we propose Data-CASE, a simple yet powerful model that (a) captures key data processing concepts and (b) a set of invariants that describe regulations in terms of these concepts.
Next, we use Data-CASE to study different interpretations of data erasure, a key component of almost all data regulations that exist today. We present a taxonomy of data erasure from the perspective of databases. Recent work has shown that in social media platforms and other applications where extensive data dependencies are present, data erasure is often implemented incorrectly/incompletely. Motivated by this, we formulate data erasure as a mechanism for preventing data leakage in databases by accounting for data dependencies such as logs, AI/ML models, materialized views, etc. We propose a SQL-like language to express such data dependencies which are an input to the data erasure mechanism. We show that the decision variant of our problem is NP-complete and present some algorithms to optimize overheads such as the cost of data, time taken, and additional number of erasures. We evaluate our implementations in PostgreSQL by analysing the overheads (time, space, additional erasures and computation) of offering semantically meaningful data erasure.