AI Is Coming for Your Social Media Data: Can You Do Anything About It?
Social media corporations are increasingly partnering with artificial intelligence firms to monetize user data. However, what steps can regular users do to safeguard their information?
Digital Media Sites Form Partnerships With Artificial Intelligence Firms
The contentious decision to utilise social media data to train generative AI models seems to be having little effect on social media firms’ practices of sharing user data.
The generative AI capabilities that Meta revealed at Meta Connect in 2023 are already trained using social media data. Meta AI and other capabilities, such as the ability to make WhatsApp stickers using AI, are part of this.
In an article for the Meta Newsroom, Mike Clark
“Publicly shared posts from Instagram and Facebook—including photos and text—were part of the data used to train the generative AI models underlying the features we announced at Connect.”
Even in 2024, this tendency shows no signs of abating. Reuters reports that Reddit and Google have struck an agreement to make Reddit’s material accessible for use in training artificial intelligence models.\
In its S-1 filing for its first public offering (IPO),
Reddit confirmed on February 22, 2024, that the corporation is investigating potential licencing agreements. Reported in the document:
Data from Reddit is essential to the development of modern AI and a lot of LLMs. We expect Reddit’s large collection of conversational data and expertise to be useful for training and enhancing LLMs.
Reddit is “in the early stages of allowing third parties to licence access to search, analyse, and display historical and real-time data from our platform” to teach LLMs, according to the document.
Meta and Reddit may be two of the most well-known social networking sites, but they certainly aren’t the only ones that use user data to train AI. The 404 Media reports that WordPress.com and Tumblr are getting ready to sell user data to Midjourney and OpenAI.
Is There Any Way to Prevent Platforms From Selling Your Social Media Data to Train AI?
The publicly accessible material you have shared on social media platforms like Facebook, Instagram, Reddit, Tumblr, or WordPress.com has probably been used in the training of LLMs before you know it.
As an example, Reddit.com takes up 7.9 million tokens, according to the Washington Post’s search engine that lists the sites that were part of Google’s C4 dataset that Bard used for training.
Search results for the C4 dataset include Reddit.
There are 1.6 million tokens associated with Tumblr.com. It is possible that the dataset contained little personal blogs like my own, which utilises WordPress.com and accounts for 14,000 tokens.
Instead of passively scraping this data from the web, it will be actively marketed thanks to licencing partnerships between AI businesses and social media companies.
What, however, are your options with regard to processing in the future? If you do not agree with or would want to limit how third parties may use your personal information to train Meta’s generative AI models, you can use the new form that Meta has provided for data subject rights related to generative AI.
Importantly, you can’t use this setting to protest Meta’s use of your data in their own generative AI training. In addition, I had to provide evidence that my personal information was already showing in Meta’s generative AI outputs when I created a support ticket to object to the usage of my personal data via the form.
answer related to meta-support
Additionally, under the blog settings, Tumblr now gives you the option to disable the sharing of your public blog posts with other parties. If you go to your blog’s settings and slide down to the Visibility section, you should be able to locate it. Next, choose the option to disable sharing by third parties on your blog.
block tumblr blog posts from outside sources
One possible solution to the problem of Instagram collecting personal information is to make your account private. Data scraping for LLMs seems to centre on public data, thus this might be a precaution, but it doesn’t ensure that your data won’t be exploited.
Again, this is just a precaution you may take; it won’t ensure that your data stays private simply because you made your X (Twitter) account private.
In a joint statement, experts from across the globe, including national information commissioners, have offered advice on how consumers should protect themselves from the privacy risks posed by AI businesses’ data scraping. Here is some guidance:
Before you provide any personal information to a website, be sure to read their terms and privacy policy.
Refrain from sharing personal details, particularly those that might identify you, over the internet.
You can modify your privacy settings.
Be mindful of the impact your internet actions might have in the future.
In the event that you suspect inappropriate data scraping, you should get in touch with the relevant social media firm or website. Not happy with their reply? Go ahead and complain to the data protection authorities in your area.