Training AI for Generative: Fair Use Perspective from Creative Commons

To me amongst many baffling concepts with the current stream of interest? concern? reaction? to generative AI, especially more recently weighted towards the systems that generate written content, how the concept of “training” confounds our experience of past understanding of reuse. Even more so as for the most part, we have no way of knowing how the content is actually created.

And more on a side of how we can use and help others use generative content, our notions of attribution and licensing are stretched.

It’s a good thing Creative Commons is around. In a series of posts on AI under the umbrella of the CC campaign for “better sharing”, I found more helpful insight from this current post by Stephen Wolfson which looks at the way fair use might be applied for AI content (though never as simple as a “rule”):

Among many good points, I want to key in parts where Stephen provides me better awarwness of what is going on behind the black curtain when Midjourney or Stable Diffusion spits out an image from a given text prompt. It gets interesting because there are no images in the LAION training dataset, just information about images and nothing seems copied from the original in a way we think of copying:

Stability AI used a dataset called LAION to train Stable Diffusion, but this dataset does not actually contain images. Instead, it contains over 5 billion weblinks to image-text pairs. Diffusion models like Stable Diffusion and Midjourney take these inputs, add “noise” to them, corrupting them, and then train neural networks to remove the corruption. The models then use another tool, called CLIP, to understand the relationship between the text and the associated images. Finally, they use what are called “latent spaces” to cluster together similar data. With these latent spaces, the models contain representations of what images are supposed to look like, based on the training data, and not copies of the images in their training data.

I am fuzzy on what “latent” spaces mean, but it feels like an effort to create a statistically similar result from a mixture of sources, not one (?).


Turning back to fair use, this method of using image-text combinations to train the AI model has an inherently transformative purpose from the original images and should support a finding of fair use. While these images were originally created for their aesthetic value, their purpose for the AI model is only as data. For the AI, these image-text pairs are only representations of how text and images relate. What the images are does not matter for the model — they are only data to teach the model about statistical relationships between elements of the images and not pieces of art.

And again to help understand how that these systems are doing is so different from our conceptions of putting images together

The models do not store copies of the works in their datasets and they do not create collages from the images in its training data. Instead, they use the images only as long as they must for training.

I really encourage you to read the full post, and let us know if this adds clarity or raises more questions? It still is not certain if we can say with certainty that images one creates with Midjourney/Stable Diffusion truly fall under fair use (which always means leaving it to a court to arbitrate).

It feels right to me that this is the right approach, but it does not prelude the chance that something they create have a high degree of similarity to original works.

I am keeping tuned to the Creative Commons series on AI and urge anyone interested to weigh on their next series of public forums (happening in 3 time zones tomorrow).


Thanks for this thoughtful review @cogdog! If you liked Stephen’s post on fair use and generative AI, check out his new post from today: This Is Not a Bicycle: Human Creativity and Generative AI.

And if you didn’t see it before, CC CEO Catherine Stihler kicked off this series of blog posts on generative AI with: Better Sharing for Generative AI

You’ll see a link at the top of all three posts to join one of three community input sessions CC is holding on generative AI: Wed 22 Feb at 2:00–3:00 UTC, 14:00–15:00 UTC and 18:00–19:00 UTC. There are handy links to see which of those times might work for you.

1 Like

Sorry I could not attend the session, but anxious to hear the results. And these series of posts are really powerful and insightful, kudos to the CC team.

I could not resist, knowing how bad DALL•E is when asked for text on an image.

The Treachery of AI Images generated by the DALL•E/OpenAI platform via prompt “Vintage poster in style of René Magritte labeled ‘This is not a bicycle’ with an image of a bicycle” by Alan Levine who dedicates any rights it holds to the image to the public domain via CC0.

As the new saying goes, “Nos Inttlic TTS”