AI text-to-image generators being trained on images of child sexual abuse: Study

(CN) — Deep inside a giant open-sourced artificial intelligence dataset used to train popular AI image-making systems are thousands of images of child sexual abuse, according to a study published Wednesday.

The study conducted by the Stanford Internet Observatory, which researches online abuse on the internet and social media, found 3,226 images of suspected child sexual abuse in an AI database called LAION. The German nonprofit makes open-sourced AI models and datasets of images and captions, combed from the internet, which are used to train popular text to image AI programs like Stable Diffusion.

LAION's database was also used to train Google's text-to-image AI model called Imagen, but during an audit of LAION developers found “a wide range of inappropriate content including pornographic imagery, racist slurs and harmful social stereotypes,” and decided not to take the program public.

LAION stands for Large-scale Artificial Intelligence Open Network.

The researchers' previous work indicated that AI systems can and do generate images of child sexual abuse, but that research assumed that the programs were combining concepts they learn from images depicting nudity and explicit sex acts done by adults with normal, benign images of children. This report found the AI model was instead trained on images of child sexual abuse.

While those images only account for a fraction of the billions of images in the system, the researchers say it is likely influencing the capacity for AI tools that use the database to generate explicit content depicting children, both deepfake, and based on real images, and images of sex abuse victims that were posted online without their consent, and combed into the system. Often, those images are sold and used for commercial purposes, the researchers added.

“The most obvious solution is for the bulk of those in possession of LAION‐5B‐derived training sets to delete them or work with intermediaries to clean the material. Models based on Stable Diffusion 1.5 that have not had safety measures applied to them should be deprecated and distribution ceased where feasible,” the researchers wrote in their study.

They also suggested that AI programs trained on erotic images should not also be trained on material depicting children, and that, arguably, images of children should be excluded from all AI training.

“Web‐scale datasets are highly problematic for a number of reasons even with attempts at safety filtering. Apart from CSAM, the presence of nonconsensual intimate imagery or ‘borderline’ content in such datasets is essentially certain — to say nothing of potential copyright and privacy concerns. Ideally, such datasets should be restricted to research settings only, with more curated and well‐sourced datasets used for publicly distributed models,” the researchers wrote in the study.

A LAION spokesperson said the nonprofit is temporarily taking down its datasets "to ensure they are safe before republishing them."

"LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. We collaborate with universities, researchers and NGOs to improve these filters and are currently working with the Internet Watch Foundation to identify and remove content suspected of violating laws.

"We invite Stanford researchers to join LAION to improve our datasets and to develop efficient filters for detecting harmful content," the spokesperson wrote in an email.

Categories / Technology

Subscribe to Closing Arguments

Sign up for new weekly newsletter Closing Arguments to get the latest about ongoing trials, major litigation and hot cases and rulings in courthouses around the U.S. and the world.

AI text-to-image generators being trained on images of child sexual abuse: Study

Subscribe to Closing Arguments

Additional Reads