Once again, we’ve had an eventful few weeks in the space of data-dependent computing! In this short post, I want to highlight a series of three news events that suggest we might be seeing a major shift towards data dignity, though there’s lot of work to be done to ensure these developments lead to benefits that are broadly shared.
Major user-generated content platforms charging for their user-generated content
First, we saw that, in very quick succession, Reddit and Stack Overflow announced plans to change the way platform data flows to large-scale data users (i.e., AI operators like OpenAI). Reddit and StackOverflow have become quite important data sources for academic research. Thus, it made a good deal of sense when AI operating firms started using Reddit data for early language model training.
The leadership of these firms have, nearly in sync in terms of timing, alluded to changing this status quo. Reddit and StackExchange want AI operators to start paying for data that has so far followed something of a free-for-all paradigm.
Quotes from the CEOs sound a lot like data dignity talking points
The statements from the chief executives are quite promising. I’m going to include several quotes in full.
In his interview with Mike Isaac of the New York Times, Steve Huffman of Reddit said:
“The Reddit corpus of data is really valuable… but we don’t need to give all of that value to some of the largest companies in the world for free.”
He further highlighted the unique value Reddit users provide: “There’s a lot of stuff on the site that you’d only ever say in therapy, or A.A., or never at all” and directly called for returning value: “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with.”
In his blog post, Prashanth Chandrasekar, stated “If AI models are powerful because they were trained on open source or publicly available code, we want to craft models that reward the users who contribute and keep the knowledge base we all rely on open and growing, ensuring we remain the top destination for knowledge on new technologies in the future.”
Several quotes from this post (which was controversial, primarily because of allusions to using generative AI on the platform) really resonated with the thinking I’ve been advocating for. “AI systems are, at their core, built upon the vast wealth of human knowledge and experiences.” Chandrasekar even directly calls out that “Allowing AI models to train on the data developers have created over the years, but not sharing the data and learnings from those models with the public in return, would lead to a tragedy of the commons”.
Finally, there’s a nice image in the original post with the caption, “AI is built on our collective knowledge, and we must all participate in building its future”.
Firms with Data Leaning On Firms Who Want Data
This is an example of data leverage by firms. These firms have something AI operators want — data — so they have leverage to demand payments or concessions. It’s not (yet) the hopeful vision of grassroots data leverage I’ve laid out in some my work, in which groups of Reddit and StackOverflow contributors threaten to withhold, alter, or redirect data unless they get paid or stipulations are placed on how resulting AI systems are used.
However, the above quotes really do suggest that firm leadership wants to reward users down the line. If you’re skeptical that this is driven purely by a sense of corporate responsibility, there’s a very rational explanation here as well: the more money platforms make from data, the more they *need* to keep their core data contributors happy.
If Reddit and SO are relying a continuous supply of high quality data to get money from OpenAI, Microsoft, Google, and company, this means that overnight, users of these platforms gain a sort of second-order leverage.
Users who Create Data Leaning on Firms Who Sell that Data
I think there’s a lot of value in noting the similarities between contributions to these platforms and labor (specifically, cartographic labor). Extending the metaphor, the value of Reddit and SO depends a lot of this data labor (but also on the paid labor of engineers, ad sales, etc.). The more that these platforms make off any data deals, the more leverage users have over firm leadership, in the same way that if I run a business that suddenly starts making a bunch of products that use steel, the steelworkers union’s suddenly has a lot more leverage over my business overnight.
Yes, licensing might mitigate this leverage; there’s a “Gotcha!” with Reddit’s not-so-friendly terms: “When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world.”
(TLDR: Reddit can do anything they want with your Reddit posts and comments).
On the other hand, StackOverflow has quite open terms (its all Creative Commons). But both platforms have already been scraped and crystallized as weights. The main value proposition that the platforms can offer data-hungry AI operators is this: if you want to quickly adapt to news, trends, gossip, cultural shifts, and new programming languages, you’re going to need the data our users have yet to produce. This means users always have a lever here: stop or alter their data-contributing behavior going forward.
Europe’s Proposed Legislation
On April 27th, Sam Schechner of the Wall Street Journal reported that European Union legislators have drafted legislation requiring that AI operators disclose any copyrighted materials used in training (see more coverage of this concern here).
If passed, this legislation could open the door for even more organizations — beyond Reddit and StackOverflow — to try seeking payment or credit for data. The article notes that this legislation won’t actually provide the final word on whether the current mode of scraping is legal (arguably in Europe, it’s not).
Importantly, while this legislation is still just a draft, news coverage suggests it seems fairly likely to pass. This could mean some AI operators simply stop offering services in the EU, but it will probably lead to a substantial increase in generative AI dataset documentation compared to the current GPT-4 status quo (i.e., not so much).
Will a Small Group of Higher-ups Reap all the Rewards? Data Coalitions Might Help.
On one hand, these developments clearly favor data creators over data users. However, one serious concern is that all these developments will primarily empower large organizations and not individual data creators. For instance, we can imagine one future in which a small group of higher-ups at Reddit, StackOverflow, and large media conglomerates are the primary beneficiaries of data payments.
This is not a foregone conclusion, however. As noted above, if Reddit, StackOverflow, or a record company suddenly lean on OpenAI and see an influx of generative AI’s profits, there is an immediate avenue for the true upstream source of data — individual users and creators — to lean on the platform. The people at the top of the river will always have the final say in this kind of arrangements: the models and platforms truly are downstream of users.
Ideally, the bargaining positions of users and creators might be solidified by increased prominence and formal recognition of data cooperatives. In the case of media, this arguably already exists in the form of organizations like the Screen Actors Guild (if production companies get some profits from AI companies that use frames from movies, SAG could in turn push to get some of these profits from the production companies). Put simply, some creative industries already have organizations that support the labor of content creators.
However, unless this kind of organizing is extended to more domains of “creation”, in many contexts it may be the case that a small group of higher-ups do reap the majority the benefits of these changes.