Data Leverage Recap: December 2022 - April 2023

Published on April 17, 2023

The waves never stop; photo from Unsplash contributor Photoholgic.

This post will be a short review of the blog so far. There’s two goals here: to provide a quick catch up for anyone who’s new to the blog, and to give me a chance to reflect on how these ideas have held up with against a barrage of AI and tech-related product releases, research outputs, and other news.

The Paradox of Reuse, Language Models Edition (Dec 1, 2022)

Summary

In this post, I discuss the concern that language model apps could erode their foundations. Platforms like StackExchange and Wikipedia provide infrastructure and incentives for users to participate in the creation and sharing of knowledge. These platforms need traffic and users, which creates a key concern: if generative AI systems like ChatGPT (which rely on StackExchange and Wikipedia for training data!) are good enough that users replace their StackExchange visit with a ChatGPT visit, could GPT-4 hinder our ability to train GPT-5?

How the Key Points Hold Up

On March 17th, we got see some early evidence for this effect. In the linked Tweet, Dominik Gutt describes preliminary evidence for a negative effect from LLMs on Q&A activity.

About a week later, a similar point was made by authoritative source: Peter Nixey, a top 2% StackOverflow contributor, who highlighted the concern that LLMs may prevent users like him from contributing to SO, and “When it comes time to train GPTx it risks drinking from a dry riverbed.”

Finally, on April 17th, StackOverflow’s CEO wrote a blog post discussing generative AI. While the post was controversial in the community for alluding to integrating generative AI into the platform, I was excited to see direct references to the importance of SO training data and the potential tragedy of the commons at play here.

What’s next

The core argument is this piece (and the similar arguments linked above) rely on making assumptions about how people use LLMs and online platforms. It’s certainly possible to imagine a scenario in which LLMs benefit users and online platforms (a point to which we’ll return shortly!) For instance, if LLMs primarily answer what would be duplicate questions, reducing the need for humans to flag duplicate questions and freeing up more time answer interesting questions, this could be great (though I think it’s unlikely without substantial effort).

One direction for future work is to use some combination of agent-based modeling and continued empirical investigation to try and specify the conditions necessary for positive sum outcomes.

I’ll also be keeping an eye out for more empirical work in this space.

ChatGPT is Awesome and Scary: You Deserve Credit for the Good Parts (and Might Help Fix the Bad Parts) (Dec 4, 2022)

Summary

You and most everyone you know probably helped build the new wave of generative AI technologies like ChatGPT. This post provides an overview of all the specific details we know about past GPT training data sources, and how we can use that to engaged in some educated guesswork regarding the data underlying ChatGPT, GPT-4, Bing chat, and more.

How the Key Points Hold Up

The public is still mostly in the dark regarding specific ChatGPT pre-training details and the collection of human feedback data. However, the sources highlighted in the original post still hold up.

OpenAI’s stance on sharing info about training data going forward suggests it may be hard to do this kind of data documentation going forward. I do think we can still learn a lot of ChatGPT from studying relatively more open models and datasets like LLaMa and The Pile — I’d be surprised if there are massive deviations in pre-training data collection strategies.

This post from Vicki Boykis provides a similar perspective on the ChatGPT training data, as well as info about the model architecture and other details.

What’s Next?

I think there’s a lot of value in learning and sharing as much as we can about the role of data in producing generative AI. Please reach out if you’re interested in collaborating. This may necessarily involve focusing on open competitors to ChatGPT.

AI Artist or AI Art Thief? Innovation, Public Mandates, and the Case for Talking in Terms of Leverage

Summary:

In this post, I review ongoing discussions about how generative AI can be seen as “stealing” in a moral or legal sense. The core argument here: whenever we have a major new innovation, it’s practically impossible for the broad public to consent to contributing to these technologies, so they cannot come out the gate with a public mandate. I also argue that although it may be tough to get agreement on exactly which ML and AI technologies are stealing, it can be very productive to talk in terms of which groups have leverage to impact model capabilities.

How the Key Points Hold Up

I stand by everything here. The original posts states that we were waiting for a firm legal answer, and as of April 2023 that’s still true. The implications here could be huge.

I still stand by the value of a leverage-based framing (it’s a pretty big part of my dissertation).

What’s next

I’m in the process of iterating on a website for crowdsourcing and highlighting tools that help you use “data levers”.

Another angle for making concrete progress on this issue is through the development of Responsible AI Licenses for Data. If you’re interested, please consider getting involved in RAIL.

AI Technologies are System Maps, and You are a Cartographer

Summary:

Much of my academic work has been at least partially motivated by an argument for thinking of data labor. Here, I make the case that this “data as X” metaphor is even stronger when we look to cartographic labor. The comparison with map-making can be especially useful for thinking about when time is on the side of the laborer.

How the Key Points Hold Up

I’ve been reflecting more on this “Data isn’t labor” post from Sebastien Benthell. It makes some compelling arguments against data as labor metaphor in the context of search engines and Google. I think it’s worth discussing the tensions with different data as X metaphor.

On the technical side, I haven’t seen anything yet that makes me seriously re-think these arguments. I do think it’s worth making a distinction between the predictive of value of a “data as (cartographic) labor” theory (can we actually predict how data economies work with a labor lens) and the aspirational value of this metaphor (we should try to foster a sort of data labor solidarity to support collective action).

What’s next

I believe this remains a ripe conceptual lens for both academic research in the data governance space, and for public-facing arguments about data’s value. I plan to continue developing the idea, and have some work in the oven (stay tuned). The planned datalevers.org FAQ may also help here.

Plural AI Data Alignment

Summary:

I propose a definition for measuring when an AI system is aligned with a group of people in terms of data agency:

“An AI system is more aligned with a coalition if members (1) know how their data contributions flow to that system, (2) can reason about how changes to data flow might impact AI capabilities, and (3) have agency to reconfigure these data flows.”

How the Key Points Hold Up

I still think this framing can be important. An open question I remain unsure about is whether it’s useful to frame this as part of “alignment” or something else entirely. There is a growing movement for consentful AI, and these ideas are much more naturally aligned with this movement than any one “AI Alignment” faction. Let me know what you think about this!

What’s next

I plan to continue developing this definition and trying to find venues and communities that find it useful. I also plan to use some version of this in academic work at some point.

Bing Rewards for the AI Age

Summary:

A proposal for giving people credits for their data contributions that can be used to query expensive generative AI systems. A key idea: if we set up a credit system well, we might be able to account for the incentives faced by generative AI operators, online platforms that host and facilitate the creation of data, and individual people.

How the Key Points Hold Up

This one’s changed a little bit. The costs matter here (even for back-of-the-napkin examples) and we’ve seen changes in API costs. We also still don’t know exactly how the operational costs of generative AI systems are changing, which could have a big impact on the viability.

It’s very promising to see the StackOverflow CEO making statements like “If AI models are powerful because they were trained on open source or publicly available code, we want to craft models that reward the users who contribute and keep the knowledge base we all rely on open and growing, ensuring we remain the top destination for knowledge on new technologies in the future.” I believe this suggests this kind of idea is quite plausible.

What’s Next

The main thing I’ll be on the look out is any generative AI firms trying something like this out.

I think there’s also room here to use agent-based modeling to understand the strengths and weaknesses of this kind of system.

If you happen to work on a platform for online communities, and want to implement something like this, please do let me know!

Comments?

You can also read this article via a Notion public link.