While AI Accelerates, the Data Owner is Left in the Dust

Written by John Ackerly | Oct 16, 2024 2:37:52 PM

Solutions like Claude and ChatGPT are vacuums for information, relentlessly sucking up data from a vast landscape. Similarly, these tools are advancing by leaps and bounds in very short intervals. This leaves us in a precarious position; we must figure out how to protect and prioritize the data owner in relation to large language models (LLMs) before they are completely forgotten.

There have been several high-profile instances where data owners are neglected as generative AI providers seek to drive maturation for their models, rather than consider the who, what, where and when of said models’ training data. LinkedIn is under fire for automatically including user data in AI training models without asking for permission. One must opt out manually, on their own, to turn the feature off. The social network has been buzzing with disgruntled posters ever since the word got out on this clear violation of privacy.

It’s not just individuals’ private information that’s at stake: Organizations’ public-facing data also serves as training material. A few months ago, Conde Nast, the parent organization of major publications like The New Yorker and WIRED, issued a cease and desist letter to Perplexity, accusing the controversial AI search engine of plagiarism.

In May, actress Scarlett Johannson challenged Sam Altman and ChatGPT, claiming her likeness was used for an AI personal assistant without her consent.

When famous stars and prominent brands are struggling to maintain ownership of their data, you can imagine how daunting this battle can be for the average individual — whether they’re an artist, a writer, or a casual internet user who simply wants their own information to remain theirs. This is a fight that is only just beginning. Frankly, we’ve barely made it out of the first round, as this will undoubtedly be a constant clash in the rocketship AI advancement journey that we are all a part of. Unfortunately, it seems that clear-cut legal policies and agreements are not enough to keep ownership of data.

So what can be done? Can data governance exist in the age of artificial intelligence?

The answer is yes. The “how” requires a multi-pronged approach;

1. Data owners from all levels of prominence — individuals and organizations alike — need to raise their voices about this issue. If left unchecked, AI companies will continue to satiate the global appetite for rapid development and forgo the rights of those whose data is fueling their LLMs. There must be a constant call for transparency in regards to language model data sources, whether you’re a Hollywood A-lister, a publishing conglomerate, or a private citizen who wishes to live in a secure world.

2. Beyond simply speaking up, the best path for wide-scale data governance is through micro-security at the data level. Imagine the ability to classify or tag certain data as accessible to these LLMs, and tagging other data so as to prevent access or ingestion by LLMs. Exercising this level of control applied to the data itself will ultimately allow data owners to actually own what is truly theirs.

Today’s public LLMs are opaque in terms of the sources they use to generate content. We need better data provenance and chain of custody tied to the data that fuels these models’ outputs — and we need a way to hold generative AI companies accountable when their products violate copyright or privacy.

To protect our data — whether sensitive, proprietary, or simply private — we need to start thinking about each data object, not the network or device that holds it. Just as there has been a massive paradigm shift from monolithic applications to microservices in cloud computing, we are now seeing a similar shift when it comes to data security: It’s much more flexible, manageable, and realistic to bring security controls closer to the data rather than focusing on a single fortified perimeter.

Because the truth is that data will inevitably move outside of those boundaries, and when it does, it needs demonstrable guardrails for who can access it, when, and for what purpose. Technology standards like the open Trusted Data Format can facilitate these guardrails via encryption, attribute-based access control (ABAC) and data auditing, arming data owners with tools that may shield them from the AI information-harvesting vortex.

Watermarking is another tool that is paramount in tracking digital provenance. By applying a digital watermark on sensitive data, you’re able to monitor its distribution and use, allowing one to easily identify unauthorized usage. Watermarks can also be used to identify AI-generated content like deep fakes, which may be created by leveraging unauthorized data.

3. It is the duty of security and privacy leaders to ramp up education in regards to data on a mainstream scale. Unfortunately, through endless breaches and security incidents, we’ve become desensitized to our data being accessed without our knowledge or permission. Let’s be frank: We cannot expect the average individual to know all about the technology I listed in the previous point, let alone have the ability to implement it across their full digital footprint. The industry has long missed the opportunity to lend a helping hand to private citizens and enable them to control their data, wherever it may live. We must advocate for the data owner through education, knowledge-sharing, and building tools that put privacy at the center. We’re seeing an example of this occur with regards to LinkedIn’s AI policy. Savvy users are posting to educate their networks and spread the word about the new data-gathering feature and how to turn it off. But the onus isn’t just on the user: It’s also on the organization to respect its customers by respecting their data.

We’re building the future every single day with how we conceptualize generative AI and data governance. This is a complex topic, but the path to empowering data owners can be a simple one. By continuing to raise red flags where need be, utilizing the right tools and technology, and prioritizing education of the subject matter, we can continue to drive development of this transformative technology while respecting the rights and privacy of data owners everywhere.

View full post