The public debate over AI has seriously heated up in the wake of new advances in the design and deployment of large generative AI models. In particular, artists have begun to protest the use of models that generate artwork. Meanwhile, software developers and writers seem to be a bit more split in their opinions on models that generate code or prose (though a developing lawsuit suggests a critical faction exists on the code front too).
A major rallying cry in the AI art discussion has been to implicate AI models — and/or AI developers — as stealing the art used for training data. When you look at examples of AI art created by specifically referencing an artist’s name, this accusation seems to carry some weight. Of course, this claim is complicated, as it could be seen as simultaneously making a moral claim and a legal claim about AI art.
On the legal side, it seems like we just don’t have a clear answer yet. In the short term, it doesn’t seem like agents of the state will be storming the server rooms where generative model weights “live”, but in the long term, who knows (it’s not out of the question).
On the moral side, it seems that the grievances have some merit. Proponents of the current “scrape it all” paradigm might argue that if artists didn’t want to contribute to training sets, they should not have put their art online.
However, a key point here is that most of the data that’s in training sets right now was put online by creators who did not, and for all intents and purposes could not, have meaningfully given consent to fuel these models. These technologies are innovative: they have capabilities that previous human technology did not have. There is a core tension between innovation over human-generated data and gaining mass consent, i.e. a public mandate. It just can’t be done. To truly keep the mandate of the public, each new innovation would require a new referendum.
Additionally, it's (as far as I know) undisputed that all of the headline-generating generative models thus far are entirely reliant on collective efforts from large groups of people. Without a large corpus of artwork or a large corpus of code, the new capabilities we've seen would not exist.
I’d like to argue that a very productive way to side-step the debate on the morality front and to move forward while we wait for updates on the legal front is to ground this discussion firmly in terms of who has leverage to impact these models.
We may see a live experiment in this realm very soon: StabilityAI will train a new version of Stable Diffusion that honors opt-out requests from artists (facilitated by Spawning). HuggingFace will train a new version of BigCode that honors opt-out requests from developers. In both cases, we can compare the capabilities with and without opt-out, and if enough people opt-out we’ll expect to see meaningful differences. We’re also seeing a protest on ArtStation, as artists post “No AI” images in hopes of harming future training sets. These can be seen as in-the-wild “data strikes” and data poisoning (in the case of ArtStation), and they open the door to highly curated opt-in training sets that can be seen as “conscious data contribution”.
It’s inevitable that we (speaking broadly) will not all agree on a clear definition of what constitutes "stealing" in the AI context. A leverage-based framing can help align artists, coders, writers, and whoever else is blindsided by the next leap in AI capabilities, even if individuals disagree on the moral framing. Furthermore, a leverage framing lends itself to asking falsifiable questions, such “Can a group of size X boost or harm the capabilities of model Y by some delta Z?” This is extremely critical, because it also allows us to ground this debate in empirical claims and to move forward with deliberation and bargaining. Even if one group vehemently opposes the idea that scraping data is stealing and another vehemently opposes the practice of scraping entirely, thinking in terms of who has leverage can help us avoid talking past each other.
For researchers, continuing to find ways to measure the value of training data at the collective scale can be highly effective in advancing these discussions, and is in line with promoting a data-centric approach to AI. Just as we might ask how hyperparameter choices affect system capabilities, we can also ask how system capabilities might change under different conditions of collective action. What can we achieve with a model that has a symbiotic relationship with communities of artists or coders?
On the organizing side, continuing to find ways to opt-out from fueling models you oppose — or opt-in to fueling models you’re excited about — will be of immense value. Each action can be seen as an in-the-wild experiment that can help us figure out how to work towards a better set of power dynamics for the AI ecosystem.
You can comment via this Notion public link.