AI vs everything else- Intellectual Property, Data Privacy, Proprietary Knowledge and Synthetic Data

Amartya Mandal
5 min readMay 24, 2023

Over the past few months, I’ve been tumbling down the rabbit hole of GenAI. Trust me when I say, it’s not your usual tech tussle. It’s like a 24/7 philosophy class that never ends. I mean, sure, I’ve seen and lived through the PC era and the iPhone moment, but this GenAI thing? It’s a whole different beast.

When you grapple with most tech, you master the in’s and out’s, contribute something cool, and then march on.

But with GenAI, you’ve got to go back to school and take a deep dive into the nature of human intelligence before you question on ethics of “how this model learns”.

Today though, we’re not going into all that. I’ve been trying to wrap my head around a very specific problem that revolves around data privacy, security, intellectual property rights, copyright, and patent issues related to GenAI.

I even had a sit-down with ChatGpt plus web browsing model (how meta is that?).

Let me paint you a picture

Let me paint you a picture. Imagine Company “A”, the big kid on the block in the healthcare sector. They’ve got a ton of domain knowledge and a bunch of patented software solutions. Now, they want to supercharge their products by integrating AI foundation model (or some sorts of fine tuned version) into their systems for a killer user experience. But here’s the rub. These products are also being used by other organizations, like Company B and Company C, and they’re dealing with sensitive patient data.

As Company “A” integrates an AI foundation model into its systems, there’s a massive chance that the AI model could soak up knowledge from the proprietary information it comes across. Even if the data is retained for a certain period, the AI model could, hypothetically, learn enough to rival Company “A” in its area of expertise. And this doesn’t even touch on the issues that could crop up when patient data comes into the picture.

What we say?

Honestly, I’ve been trying to find answers everywhere. I’ve scoured articles from Google and Microsoft, sifted through OpenAI’s press releases, and yet, clarity eludes me. For an example (and not cherry picking) OpenAI says it retains data sent via the API for 30 days and doesn’t use it to improve the models. But how exactly does it implement this principle? If the API data isn’t included in the training data when new versions of the model are created, how does it ensure the model isn’t influenced by this data?

A data retention policy can be implemented to automatically delete this data after a certain period, in this case, 30 days. Automated deletion schedules are a standard part of modern database systems, so this part is relatively straightforward to implement from a technical standpoint.

The part about not using API data to improve the models is a bit trickier to understand. But the crucial point could be the process of training a machine learning model and the process of using that model to generate predictions are separate.

When OpenAI says it doesn’t use API data to improve the models, it means that the data sent via the API isn’t included in the training data when new versions of the model are created. Remember, training an AI model like GPT-4 involves running an algorithm on a large dataset. If OpenAI excludes the API data from this dataset, then the API data doesn’t influence the resulting model.

But question still remain open

When considering the development of a software product by Company “A”, it’s often the domain knowledge, unique concepts, patented mathematical algorithms or patterns that hold the most proprietary value, as opposed to the specific lines of code. From a technical perspective, sophisticated AI models have the capacity to mimic or simulate these concepts once exposed to Company A’s proprietary information via its API, subsequently generating synthetic data that mirrors the proprietary domain use cases. While this synthetic data is legally distinct from Company A’s original data, Question would be can these data be utilized to train an AI model, essentially turning it into a domain expert?

So its boils down to following few pointers

Understanding and Replicating Knowledge: AI models don’t “understand” information in the way humans do. They generate responses based on patterns they’ve learned from their training data. If they’re exposed to specific domain knowledge, they may produce outputs that mimic that knowledge.

And Synthetic Data: AI can generate synthetic data, but this is a transformation of the input data, not a reproduction of it. If an AI model is given proprietary information, it can’t directly reproduce or ‘leak’ that information unless it was explicitly included in the input data. However, it could generate data that’s influenced by what it learned from the proprietary information, potentially creating a scenario where the AI appears to ‘know’ more about the domain than it should. And This is what I don’t have clear-cut legal understanding yet.

Now think about the same story from the perspectives of company “B” and “C” or

Potential implications of an insurance company using AI to set premiums based on various factors, including genetic information, social and geographical characteristics, and health data.

An innocent agenda could be re-distribution of advantages based on data!

Even when using synthetic data, which is designed to mimic the patterns in the original data without disclosing any individual’s information, biases in the original data can still be learned and perpetuated by the AI system. This is a well-known challenge in machine learning and AI, often referred to as “algorithmic bias” or “bias in AI.”

It’s important to distinguish between two types of bias:

Bias based on accurate reflections of the real world. For example, certain diseases are more prevalent in certain age groups, and an AI trained on medical data will learn this pattern.

Bias based on unjust or discriminatory factors. For example, if an AI is trained on data that reflects discriminatory practices (like redlining in housing), it can learn and perpetuate these biases, even if the people using the AI aren’t aware of them.

The legal issues around this are complex and no clarifications from any angles made available., AI-generated content typically belongs to the entity that owns the AI, not the sources of its training data. However, if an AI is directly copying or replicating copyrighted material, that could potentially infringe on copyright law. If the AI is merely generating similar or influenced content, the legal situation is less clear. In the case of patents, the situation could be even more complex. An AI couldn’t infringe on a patent unless it’s directly implementing a patented method or system.

--

--