quartz/generation-copyright.md at 661673e9c695f8988c5aa551f2c90f1c9a8f1b4a

GitHub/quartz

Fork 2

mirror of https://github.com/jackyzha0/quartz.git synced 2025-12-27 23:04:05 -06:00

bfahrenfort 661673e9c6 Quartz sync: Nov 2, 2024, 11:36 PM

2024-11-02 23:36:40 +11:00

6.9 KiB

Raw Blame History

title

Human Authorship

According to the US Copyright Office, AI-generated works do not satisfy the human authorship requirement. This makes them uncopyrightable, but more importantly, it also gives legal weight to the distinction between the human and AI learning process.

Summaries

This is probably the most direct non-technical refutation of the "AI understands what it trains on" argument possible. I also think it's the most important aspect of current generative models for me to highlight. The question: If an AI can't understand what it reads, how does it choose what parts of a work should be included in a summary of that work? A book, an article, an email?

Once again, the answer is mere probability. In training, the model is told what word to come after a word is more "correct" by how many times that sequence of words occurs in its training data. And in generation, if more of the work mentions a particular subject than the actual conclusion of the work, the subject given most attention will be what the model includes in a summary.

Empirical evidence of this fact can be found in the excellent post, When ChatGPT Summarizes, it Actually does Nothing of the Kind. It's funny how this single approach is responsible for nearly all of the problems with generative AI, from the decidedly unartistic way it "creates" to its Essays/plagiarism##1 Revealing what's behind the curtain.

Dr. Edgecase, or how I learned to stop worrying (about AI) and love the gig worker

So how do corporations try to solve the problem? Human-performed microtasks.

AI can get things wrong, that's not new. Take a look at this:

!limmygpt.png Slight variance in semantics, same answer because it's the most popular string of words to respond to that pattern of a prompt. Again, nothing new. Yet GPT-4 will get it right. This probably isn't due to an advancement in the model. My theory is that OpenAI looks at the failures published on the internet (sites like ShareGPT, Twitter, etc) and has remote validation gig workers (already a staple in AI) "correct" the model's responses to that sort of query. In effect, corporations could be exploiting (yes, exploiting) developing countries to create a massive network of edge cases to fix the actual model's plausible-sounding-yet-wrong responses.

This paragraph does border on conspiracy theory. However, which is more likely:
- Company in the competitive business of wowing financial backers leverages existing business contacts to massively boost user-facing performance of their product as a whole at little added cost; or
- Said company finds a needle of improvement over their last haystack in an even bigger haystack that enables the most expensive facet of their product to do more of the work.

[!question] I won't analyze this today, but who owns the human authored content of these edge cases? They're probably expressive and copyrightable.

Expression and Infringement

It can be said that anything a human produces is just a recombination of everything that person's ever read. Similarly, that process is a simplified understanding of how an AI trains.

However, everything a person has ever read is stored as concepts, floating around in their brain. My brain doesn't have a specific person's explanation of a transformer model architecture prepped, or even particular phrases from that explanation. It has a "visual" and emotional linkage of ideas, that other regions of my brain leverage vocabulary to put to paper when I explain it. An AI stores words that occurred in its corpus that can be considered responsive to the prompt. It may also have words that succeeded the prompt as the next portion in a containing work of both the prompt and the output. N-grams, not neurons.

The key difference: talking about a human brain making a work by recombining its input is metaphor; talking about an AI recombining a work is technologically accurate. A chatbot goes to look at the secret code and shows you the photograph it corresponds to when you ask it to.

Naturally, there are occurrences where a human and an AI would reach approximately the same factual response if you asked them the same question. So what makes some of AI output infringement? The same thing that makes some human responses copyright infringement: reproduction of a copyrighted work. But the difference is that some human responses would be copyrightable in themselves because they don't reproduce enough of a specific work or multiple works to be considered either an ordinary derivative or a compilation derivative. ==ughthis is hardddd==

Detour: An Alternative Argument

There's a more concise and less squishy argument that generative AI output infringes on its training dataset.

Recall that AI output taken right from the model (straight from the horse's mouth) is not copyrightable according to USCO. If the model's input is copyrighted, and the output can't be copyrighted, then there's nothing in the AI "black box" that adds to the final product, so it's literally just the training data reproduced and recombined. Et voila, infringement.

This isn't to say that anything uncopyrightable will infringe something else, but it does mean that the defendant's likelihood of prevailing on a fair use defense could be minimal. Additionally, the simpler argument makes damages infinitely harder to prove in terms of apportionment.

Note that there are many conclusions in the USCO guidance, so you should definitely read the whole thing if you're looking for a complete understanding of the (very scarce) actual legal coverage of AI issues so far.

6.9 KiB Raw Blame History