A platform engineer's dirty secret: deleting users is hard

There's rightly been a lot of discussion over the last few weeks about GDPR and its companion, the ePrivacy Directive. Internally, tech companies are scrambling: the architecture changes needed to support these changes need to affect every single user. Although they might provide related user-facing features only to people in the EU, the underlying data layers don't have meaningful differentiation between users from different countries, so the changes need to apply to everyone.

This is great news for proponents of individual privacy here in the US. I definitely count myself in that number.

One requirement making waves beneath the hood is the need for users to have their data completely deleted from a service. This isn't as easy as it might sound: a user's personal information typically isn't stored in one spot in a database, and isn't discrete from other users' information. Finding it and then ensuring it is removed without harming anyone else's experience is non-trivial in large systems, so perhaps understandably, most developers simply deactivate a user instead, leaving their data trail largely intact (but publicly inaccessible). That's not enough under this legislation.

The same may apply to files. Some years ago, researchers discovered that photos deleted from Facebook were lingering on their servers. It can be easier and cheaper just to remove access to a file than to actually physically remove it from disk. Content Delivery Networks also pose a problem: these are widely employed to optimize download speeds for content like photos and videos. This involves making copies of those files at "edge" locations that are geographically close to users around the world - so if you're accessing from Australia, you'll probably download it from an Australian node on the CDN. Sometimes, those copies linger long after those files are deleted.

Engineers are incentivized to provide fast, reliable implementations of required features and move onto the next thing. Storage is incredibly cheap, while processing time is less so. That means, in general, that they're likely to take the cheap, easy path and simply deactivate access to content rather than removing it. That's fine from a user experience perspective, but not from a user privacy and data rights perspective. GDPR, ePrivacy, and related legislation provide a much-needed stick to make content deletion do what the user expects it to do.

This sort of transparency of action is vital if we're going to have any sort of privacy online: if a user deletes content, they reasonably have the expectation that the content will really be deleted. If access is restricted to a few people, the user reasonably has the expectation that only those people can access it. Anything else is a breach of trust, not matter which terms may be hidden in the depths of the privacy policy. And if legislation is needed to bring about this transparency, then so be it.