The AI Training Data Problem

Every major AI company needs massive amounts of data to train its models. GPT-4, Gemini, Llama, Claude, Stable Diffusion, and Midjourney were all built on datasets scraped from the public internet — and that includes your personal information.

The data broker connection makes this worse than most people realize. Data broker profiles — which contain your name, home address, phone number, email, employment history, relatives, and more — sit on publicly accessible web pages. AI companies scrape these pages. Your personal information does not just end up in search results. It can end up embedded in the neural weights of AI models that billions of people interact with every day.

This guide covers which AI companies use your data, how to check what they have, and step-by-step instructions for opting out of each one.

Which AI Companies Use Your Data?

OpenAI (ChatGPT, GPT-4, DALL-E)

OpenAI trains its models on a combination of publicly available internet data, licensed datasets, and data generated by human trainers. If you have ever used ChatGPT, your conversations may be used to improve future models unless you explicitly opt out. OpenAI's web crawler, GPTBot, also scrapes publicly accessible websites — including data broker listings that contain your personal information.

What they collect from direct use:

Chat conversations with ChatGPT
Images uploaded to GPT-4 Vision or DALL-E
Files uploaded through the Code Interpreter
Feedback and ratings you provide

What they collect from the web:

Publicly accessible web pages crawled by GPTBot
Licensed datasets from publishers and data providers

Meta (Llama, AI Studio, Imagine)

Meta trains AI models using content shared on Facebook, Instagram, WhatsApp (non-E2E encrypted metadata), and Threads, plus publicly available internet data. In 2024, Meta expanded its AI training to include public posts from Facebook and Instagram by default — meaning your social media content is training AI unless you opted out.

What they use:

Public Facebook and Instagram posts, photos, and comments
Public Threads posts
Content shared with AI features (Meta AI assistant)
Publicly available internet data

Google (Gemini, Imagen, DeepMind)

Google trains its AI models on a vast corpus of internet data, Google Search index data, YouTube transcripts, Google Books, and user interactions with Google products. Gemini (formerly Bard) conversations can be used for model improvement unless you disable the setting. Google's overall data collection across Search, Gmail, Maps, YouTube, and Android creates one of the most comprehensive training datasets in existence.

What they use:

Gemini conversation history
Publicly available web data
YouTube video transcripts (public videos)
Google Books content
Data from Google Workspace (with enterprise exceptions)

Anthropic (Claude)

Anthropic takes a more conservative approach to training data. They primarily use publicly available internet data, licensed datasets, and conversations where users have explicitly opted in to data sharing. Anthropic does not use conversations from their API by default. Their Constitutional AI approach also includes human feedback data from paid contractors.

What they use:

Publicly available internet data
Licensed datasets
Opt-in conversation data from free-tier Claude.ai
Human feedback from RLHF training (contractors)

Stability AI (Stable Diffusion)

Stability AI trained Stable Diffusion on the LAION-5B dataset, which contains over 5 billion image-text pairs scraped from the public internet. This includes personal photos, artwork, medical images, and copyrighted material that was indexed by Common Crawl. The dataset included images from social media, personal websites, and photo-sharing platforms without explicit consent from the individuals depicted.

Midjourney

Midjourney trains its image generation model on a large dataset of images from the public internet. The company has been notably opaque about its exact data sources. CEO David Holz acknowledged in a 2022 interview that the training data included hundreds of millions of images scraped without explicit consent. Midjourney does not currently offer an opt-out mechanism for individuals whose images or data were included in training sets.

How to Check If Your Data Is in AI Training Sets

There is no single tool that definitively tells you whether your specific data was used to train a given AI model. However, there are several practical ways to assess your exposure.

Ask the AI about you. Try asking ChatGPT, Gemini, or Claude what they know about you by name. If an AI can provide specific biographical details — your employer, your city, your professional history — some version of your personal data was likely in its training set. This is not proof of direct broker data usage, but it indicates your information is in the model.

Check Have I Been Trained. The haveibeentrained.com website lets you search for images that were included in the LAION-5B dataset used to train Stable Diffusion and other image models. If you are a photographer, artist, or public figure, your images may appear.

Search data broker sites. If your personal information is publicly visible on people-search sites like Spokeo, BeenVerified, or Whitepages, it was almost certainly scraped by AI crawlers. These sites are indexed by search engines and easily accessible to web scrapers.

Check robots.txt compliance. Some data brokers block AI crawlers in their robots.txt files. Many do not. If a broker does not block GPTBot, CCBot (Common Crawl), or Google-Extended, its content — including your profile — is fair game for AI training.

Step-by-Step Opt-Out Instructions

Opting Out of OpenAI (ChatGPT)

For your conversation data:

Log in to your ChatGPT account at chat.openai.com
Click your profile icon in the bottom-left corner
Select "Settings"
Click "Data controls"
Toggle off "Improve the model for everyone"
This prevents future conversations from being used for training

For your web data (via CCPA request):

Visit privacy.openai.com
Submit a "Do Not Train" request under CCPA
Provide your name and the specific data you want excluded
OpenAI processes these within 45 days

Note: Opting out of conversation training does not remove data already used. It only prevents future data from being included.

Opting Out of Meta AI Training

For Facebook and Instagram data:

Go to your Facebook or Instagram Settings
Navigate to "Privacy" then "AI at Meta"
Look for "AI Training" settings
Submit an objection form through the "Right to Object" link
Meta will review your request (EU/UK residents have stronger rights)

For US residents:

Meta makes this harder. US users do not have the same one-click opt-out that EU users received after GDPR complaints. Your options are:

Submit a CCPA deletion request through Meta's privacy center (for California residents)
Delete or make private all public posts
Submit a data access request to see what Meta has
File a complaint with your state attorney general if Meta refuses

Opting Out of Google AI Training

For Gemini conversations:

Go to myactivity.google.com
Click "Gemini Apps Activity"
Toggle off "Gemini Apps Activity"
Click "Delete all Gemini Apps activity" to remove history
Confirm deletion

For broader Google AI usage:

Go to myaccount.google.com/data-and-privacy
Review "Web & App Activity" — this feeds Google's AI
Toggle off "Web & App Activity" to stop future collection
Pause "YouTube History" and "Location History"
Use Google Takeout (takeout.google.com) to export your data before deleting

To block Google AI crawlers from your website:

Add to your robots.txt file: User-agent: Google-Extended followed by Disallow: /

Opting Out of Anthropic (Claude)

If using Claude.ai, go to Settings then Privacy
Look for the conversation improvement toggle and disable it
API users: data is not used for training by default (no action needed)
Submit a data deletion request through Anthropic's support page

Opting Out of Stability AI / LAION

Visit haveibeentrained.com and search for your images
If found, submit a removal request through the site
Submit a CCPA request to Stability AI directly via their privacy page
The Spawning.ai tool (spawning.ai/opt-out) provides a mechanism to flag content for removal from future training datasets

Midjourney

Midjourney currently offers no public opt-out mechanism for training data. Your options are:

Submit a CCPA deletion request through their support
Opt out of future Midjourney image generation featuring your likeness (limited effectiveness)
Ensure your images are not publicly accessible on the web

The Data Broker Connection

Here is the part most AI opt-out guides miss entirely. Even if you opt out of every AI company's direct data collection, your personal information can re-enter AI training sets through data brokers.

The path works like this:

Data brokers collect your personal information from public records, commercial sources, and scraped web data
Brokers publish this information on publicly accessible people-search profiles
AI web crawlers (GPTBot, CCBot, Google-Extended) scrape these profiles
Your personal information becomes embedded in AI training datasets

Opting out of data brokers is arguably more important for AI privacy than opting out of AI companies directly. When your information is not publicly available on broker sites, AI crawlers cannot find and ingest it.

GhostMyData scans data broker sites and submits removal requests, which dramatically reduces the amount of personal data available to AI scrapers. This is not a theoretical benefit — it directly limits what AI models can learn about you from the public web.

CCPA Rights for AI Training Data

California's CCPA gives residents specific rights related to AI training data:

Right to Know. You can request that AI companies disclose what personal information they have collected about you and how it was used, including for model training.

Right to Delete. You can request deletion of personal information collected about you. However, this may not remove data already embedded in model weights — once information is used for training, "deleting" it from a neural network is technically complex.

Right to Opt Out. You can direct AI companies to stop selling or sharing your personal information, which includes using it for AI training purposes.

Right to Correct. You can request correction of inaccurate personal information, which is relevant when AI models generate false statements about you.

Several pending California bills specifically address AI training data rights, including requirements for training data transparency and opt-out mechanisms that apply retroactively.

Practical Steps to Reduce Your AI Exposure

Remove yourself from data brokers. This is the single most effective action. GhostMyData scans broker sites and files removal requests on your behalf.

Audit your public social media. Review what is publicly visible on every platform and set profiles to private or friends-only where possible.

Opt out of each AI company individually. Follow the instructions above for every AI service you have used.

Minimize new data creation. Be intentional about what you post, share, and allow apps to collect. Every piece of data you create online is a potential training input.

Use robots.txt if you run a website. Block AI-specific crawlers from indexing your content.

Submit CCPA requests. Even if you are not in California, many companies honor CCPA-style requests from all US residents.

Monitor regularly. AI companies change their policies frequently. What you opted out of today may require a new opt-out tomorrow.

Automate Your Privacy with GhostMyData

AI companies will continue to scrape the web for training data. The most effective defense is reducing what they can find. GhostMyData scans data broker sites and removes your personal information — cutting off the primary source of personal data that feeds AI training pipelines, identity theft, phishing attacks, and more.

Start your free privacy scan to see which data brokers have your information and begin automated removal.

Frequently Asked Questions

Can I remove my data from an AI model after it has been trained?

Not in a meaningful way with current technology. Once your data is used to train a model, it becomes embedded in the model's neural weights. Deleting the original source does not erase the learned patterns. However, opting out prevents your data from being used in future training runs, and removing broker profiles prevents re-scraping.

Is it legal for AI companies to use my data for training?

The legal landscape is still evolving. Several lawsuits are pending (including class actions against OpenAI, Meta, and Stability AI). Under CCPA, you have the right to opt out of data use for training. The EU has clearer protections under GDPR. Currently, scraping publicly available data occupies a legal gray area in the US.

Does making my social media private protect me from AI training?

Partially. Private posts cannot be scraped by AI crawlers. However, data that was public at any point may already have been captured. Additionally, your data may still reach AI companies through data brokers, friend-of-friend data sharing, or platform-internal AI training (Meta uses public Facebook posts for Llama training).

Which AI company has the best privacy practices?

Anthropic (Claude) generally has the most conservative data practices — API conversations are not used for training by default, and they have been more transparent about their approach. OpenAI and Google offer explicit opt-out toggles. Meta is the least privacy-friendly, training on public social media content by default with limited opt-out options for US users.

How often should I check my AI data exposure?

At minimum, quarterly. AI companies update their data collection practices regularly, and new models are trained on fresh web scrapes. Continuous monitoring through a service like GhostMyData ensures your data broker profiles are removed and stay removed, reducing ongoing AI exposure automatically.

Is Your Data Being Used to Train AI? How to Check and Opt Out

The AI Training Data Problem

Which AI Companies Use Your Data?

OpenAI (ChatGPT, GPT-4, DALL-E)

Meta (Llama, AI Studio, Imagine)

Google (Gemini, Imagen, DeepMind)

Anthropic (Claude)

Stability AI (Stable Diffusion)

Midjourney

How to Check If Your Data Is in AI Training Sets

Step-by-Step Opt-Out Instructions

Opting Out of OpenAI (ChatGPT)

Opting Out of Meta AI Training

Opting Out of Google AI Training

Opting Out of Anthropic (Claude)

Opting Out of Stability AI / LAION

Midjourney

The Data Broker Connection

CCPA Rights for AI Training Data

Practical Steps to Reduce Your AI Exposure

Automate Your Privacy with GhostMyData

Frequently Asked Questions

Can I remove my data from an AI model after it has been trained?

Is it legal for AI companies to use my data for training?

Does making my social media private protect me from AI training?

Which AI company has the best privacy practices?

How often should I check my AI data exposure?

Related Reading

Ready to Remove Your Data?

Get Privacy Tips in Your Inbox

Related Articles

Privacy Glossary: 50 Terms Everyone Should Know in 2026

What Are Data Brokers? The Complete Guide (2026)

How to Reduce Your Digital Footprint in 2026