mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-27 23:04:05 -06:00
Quartz sync: Nov 2, 2024, 4:47 PM
This commit is contained in:
parent
3e3746d26a
commit
b6b0cbc482
23
content/Dict/integrity.md
Normal file
23
content/Dict/integrity.md
Normal file
@ -0,0 +1,23 @@
|
||||
---
|
||||
title: Integrity
|
||||
tags:
|
||||
- glossary
|
||||
- misc
|
||||
- ai
|
||||
- resources
|
||||
date: 2024-10-23
|
||||
lastmod: 2024-10-23
|
||||
draft: false
|
||||
---
|
||||
Integrity in the academic context can be thought of as multiple qualities of a work that, when put together, establish the argument that people should take you seriously. This is irrespective of the content of the work; it need not contribute anything new or objectively/subjectively valuable to the field in order to have integrity. In large part, works with integrity are:
|
||||
|
||||
### Authoritative
|
||||
The work's conclusion is logically sound as a result of all of the below. The work is self-critical, and if it has logical gaps or counter-arguments, they are mentioned and possibly rebutted. If necessary, it's critical of the question presented.
|
||||
### Credible
|
||||
Things that the work holds out as facts are indeed true. If they are drawn from prior work, then the reader needs to be able to examine the reasoning behind those facts, so the work should avoid [[Essays/plagiarism|plagiarism]].
|
||||
### Rigorous
|
||||
The methodology used is widely accepted, has an explained basis in prior work, or is laid out in a way that persuades the reader of its veracity. The analysis does not contain any unaddressed logical gaps. The argument is not made in bad faith.
|
||||
|
||||
## Further Reading
|
||||
[[Misc/ai-integrity|Ai lacks integrity]]. It won't take issue with a fallacious question, it is not credible (even if the particular model cites works), and has no true explanation of its methodology.
|
||||
- Nerdy sidebar: LRM is a misnomer, as "chain-of-reasoning" output is generated at a second step after the actual output. A reinforcement learner selects a reasoning theory from a group based on how reasonable it sounds that the output was achieved from that chain of reasoning, not because the output *did* encode that reasoning.
|
||||
@ -16,34 +16,33 @@ lastmod: 2024-09-06
|
||||
> CW: US law and politics; memes
|
||||
>
|
||||
> **This site contains my own opinion in a personal capacity, and is not legal advice, nor is it representative of anyone else's opinion.** Not every citation is an endorsement, and none of the authors I cite have endorsed this work.
|
||||
> - Also a reminder that I won’t permit inputting my work in whole or part into an LLM.
|
||||
|
||||
I've seen many news articles and opinion pieces recently that support training generative AI and LLMs (such as ChatGPT/GPT-4, LLaMa, and Midjourney) on the broader internet as well as more traditional copyrighted works, without respect to the copyright holders for all of the above. For now, this will be less of a response to any one article and more of a collection of points of consideration that tie together common threads in public perception. I intend for this to become comprehensive over time.
|
||||
I've seen many news articles and opinion pieces recently that support training generative AI and LLMs (such as ChatGPT/GPT-4, LLaMa, and Midjourney) on the broader internet as well as more traditional copyrighted works. The general sentiment from the industry and some critics is that training should not consider the copyright holders for all of the above. For now, this will be less of a response to any one article and more of a collection of points of consideration that tie together common threads in public perception. I intend for this to become comprehensive over time.
|
||||
|
||||
My opinion here boils down to three main points. **Under existing US law**:
|
||||
- Training a generative AI model on copyrightable subject matter without authorization is copyright infringement (and the proprietors of the model should be responsible);
|
||||
- Generating something based on copyrightable subject matter is copyright infringement (and the proprietors and users of the model should be jointly responsible); and
|
||||
- Generating something based on copyrightable subject matter is copyright infringement (and the proprietors and users of the model should each be able to be held responsible); and
|
||||
- Fair use is not a defense to either of the above.
|
||||
|
||||
I also discuss policy later in the essay. Certain policy points are instead made in my [[Essays/plagiarism|🅿️ essay on plagiarism]], and links to that entry will be labeled with 🅿️.
|
||||
I discuss policy and speculative points at the end of this entry. Certain policy points are instead made in my [[Essays/plagiarism|🅿️ essay on plagiarism]], and links to that entry will be labeled with 🅿️.
|
||||
## Prologue: why these arguments are popping up
|
||||
<img src="/Attachments/but-he-can.jpg" alt="'I know, but he can' meme, with the RIAA defeating AI art for independent illustrators" style="height: 30em;margin: 0% 25%" loading="lazy">
|
||||
|
||||
In short, there's a growing sentiment against copyright in general. Copyright can enable centralization of rights when paired with a capitalist economy, which is what we've been historically experiencing with the advent of copyright repositories like record labels and publishing companies. It's even statutorily enshrined as the "work-for-hire" doctrine. AI has the potential to be an end-run around these massive corporations' rights, which many see as a benefit.
|
||||
In short, there's a growing sentiment against copyright in general. Copyright can enable centralization of rights when paired with a capitalist economy, which is what we've historically experienced with the advent of copyright repositories like record labels and publishing companies. It's even statutorily enshrined as the "work-for-hire" doctrine. AI has the potential to be an end-run around these massive corporations' rights, which many see as a benefit.
|
||||
|
||||
However, this argument forgets that intangible rights are not *yet* so centralized that independent rights-holders have ceased to exist. While AI will indeed affect central rights-holders, it will also harm individual creators and diminish the bargaining power of those that choose to work with central institutions. I see AI as a neutral factor to the disestablishment of copyright. Due to my roots in the indie music and open-source communities, I'd much rather keep their/our/**your** present rights intact.
|
||||
|
||||
Unfortunately, because US copyright law is so easily abused, I think the most likely outcome is that publishers/centralized rights holders get their due, and creators get the shaft. This makes me sympathetic to arguments against specific parts of the US's copyright regime as enforced by the courts, such as the DMCA or the statutory language of fair use. We as a voting population have the power to compel our representatives to enact reforms that take the threat of ultimate centralization into account, and can even work to break down what's already here. But I don't think that AI should be the impetus for arguments against the system as a whole.
|
||||
Unfortunately, because US copyright law is so easily abused, I think the most likely outcome is that publishers/centralized rights holders get their due, and individual creators get the shaft. This makes me sympathetic to arguments against specific parts of the US's copyright regime as enforced by the courts, such as the DMCA or the statutory language of fair use. We as a voting population have the power to compel our representatives to enact reforms that take the threat of ultimate centralization into account. We can even work to break down what's already here. But I don't think that AI should be the impetus for arguments against the system as a whole.
|
||||
|
||||
And finally, remember that perfect is the enemy of good enough. More generally regarding AI, while we're having these discussions about how to regulate AI, unregulated AI is causing real economic and personal harm to creators and underrepresented minorities.
|
||||
Finally, remember that perfect is the enemy of good enough. While we're having these discussions about how to regulate GenAI, unregulated use is causing real economic and personal harm to creators and underrepresented minorities.
|
||||
## The Tech/Legal Argument
|
||||
Fair warning, this section is going to be the most law-heavy, and probably pretty tech-heavy too. Feel free to skip [[#The First Amendment and the "Right to Read"|-> straight to the policy debates.]]
|
||||
|
||||
The field is notoriously paywalled, but I'll try to link to publicly available versions of my sources whenever possible. Please don't criticize my sources in this section unless I actually can't rely on it (*i.e.*, a case has been overruled or a statute has been repealed/amended). This is my interpretation of what's here, and again, not legal advice or a professional opinion. **Seek legal counsel before acting/refraining from action re: AI**. Whether a case is binding on you personally doesn't weigh in on whether its holding is the nationally accepted view.
|
||||
|
||||
The core tenet of copyright is that it protects original expression, which the Constitution authorizes regulation of as "works of authorship." This means **you can't copyright facts**. It also results in two logical ends of the spectrum of arguments made by authors (seeking protection) and defendants (arguing that enforcement is unnecessary in their case). For example, you can't be sued for using the formula you read in a math textbook, but if you scan that math textbook into a PDF, you might be found liable for infringement because your reproduction contains the way the author wrote and arranged the words and formulas on the page.
|
||||
The core tenet of copyright is that it protects original expression, which the Constitution authorizes regulation of as "works of authorship." This means **you can't copyright facts**. It also results in two logical ends of the spectrum of arguments made by plaintiffs (seeking protection) and defendants (arguing that enforcement is unnecessary in their case). For example, you can't be sued for using the formula you read in a math textbook, but if you scan that math textbook into a PDF, you might be found liable for infringement because your reproduction contains the way the author wrote and arranged the words and formulas on the page.
|
||||
|
||||
One common legal argument against training as infringement is that the AI extracts facts, not the author's expression, from a work. But that position assumes that the AI is capable of first differentiating the two, and then separating them in a way analogous to the human mind's.
|
||||
By far the most common legal argument against training as infringement is that the AI only extracts facts, not the author's expression, from a work. But that position assumes that the AI is capable of first differentiating the two, and then separating them in a way analogous to the human mind's.
|
||||
### Training
|
||||
|
||||
<img src="/Attachments/common_crap.svg" alt="Common Crawl logo edited to say 'common crap' instead" style="padding:0% 5%">
|
||||
@ -51,7 +50,7 @@ One common legal argument against training as infringement is that the AI extrac
|
||||
Everything AI starts with a dataset. And most AI models will start with the easiest, most freely available resource: the internet. Hundreds of different scrapers exist with the goal of collecting as much of the internet as possible to train modern AI (or previously, machine learners, neural networks, or even just classifiers/cluster models). I think that just acquiring data without authorization to train an AI on it is copyright infringement standing by itself.
|
||||
|
||||
> [!info]
|
||||
> And acquiring data for training is an unethical mess even independent of copyright concerns. **In human terms**, scrapers like Common Crawl will take what they want, without asking (unless you know the magic word to make it go away, or just [[Projects/Obsidian/digital-garden#Block the bot traffic!|block it from the get-go]] like I do), and without providing immediately useful services in return like a search engine. For more information on the ethics of AI datasets, read my take on [[Essays/plagiarism#AI shouldn't disregard the need for attribution|🅿️ the need for AI attribution]], and have a look at the work of [Dr. Damien Williams](https://scholar.google.com/citations?user=riv547sAAAAJ&hl=en) ([Mastodon](https://ourislandgeorgia.net/@Wolven)).
|
||||
> Acquiring data for training is an unethical mess even independent of copyright concerns. **In human terms**, scrapers like Common Crawl will take what they want, without asking (unless you know the magic word to make it go away, or just [[Projects/Obsidian/digital-garden#Block the bot traffic!|block it from the get-go]] like I do), and without providing immediately useful services in return like a search engine. For more information on the ethics of AI datasets, read my take on [[Essays/plagiarism#AI shouldn't disregard the need for attribution|🅿️ the need for AI attribution]], and have a look at the work of [Dr. Damien Williams](https://scholar.google.com/citations?user=riv547sAAAAJ&hl=en) ([Mastodon](https://ourislandgeorgia.net/@Wolven)).
|
||||
|
||||
The first reason that it's copyright infringement? [*MAI Systems v. Peak Computer*](https://casetext.com/case/mai-systems-corp-v-peak-computer-inc). It holds that RAM copying (ie, moving a file from somewhere to a computer's memory) is an unlicensed copy. As of today, it's still good law, for some reason. Every single file you open in Word or a PDF reader; or any webpage in your browser, is moved to your memory before it gets displayed on the screen. Bring it up at trivia night: just using your computer is copyright infringement! It's silly and needs to be overruled going forward, but it's what we have right now. And it means that a bot drinking from the firehose is committing infringement on a massive scale.
|
||||
- I'm very aware that this is a silly argument, but it is an argument and it is precedent.
|
||||
@ -60,41 +59,47 @@ But then a company actually has to train an AI on that data. What copyright issu
|
||||
|
||||
[The Chinese Room](https://plato.stanford.edu/entries/chinese-room/) is a philosophical exercise authored by John Searle where the (in context, American) subject is locked in a room and receives symbols in Chinese slipped under the door. A computer program tells the subject what Chinese outputs to send back out under the door based on patterns and combinations of the input. The subject does not understand Chinese. Yet to an observer of Searle's room, it **appears** as if whoever is inside it has a firm understanding of the language.
|
||||
|
||||
Searle's exercise was at the time an extension of the Turing test. He designed it to refute the theory of "Strong AI." At the time that theory was well-named, but today the AI it was talking about is not even considered AI by most. The hypothetical Strong AI was a computer program capable of understanding its inputs and outputs, and importantly *why* it took each action to solve a problem, with the ability to apply that understanding to new problems (much like our modern conception of Artificial General Intelligence). A Weak AI, on the other hand, was just the Chinese Room: taking inputs and producing outputs among defined rules. Searle reasoned that the "understanding" of a Strong AI was inherently biological, thus one could not presently exist.
|
||||
Searle's exercise was at the time an extension of the Turing test. He designed it to refute the theory of "Strong AI." At the time that theory was well-named, but today the AI it was talking about is not even considered AI by most. The hypothetical Strong AI was a computer program capable of understanding its inputs and outputs, and importantly *why* it took each action to solve a problem, with the ability to apply that understanding to new problems (much like our modern conception of Artificial General Intelligence). A Weak AI, on the other hand, is just the Chinese Room: taking inputs and producing outputs among defined rules. Searle reasoned that the "understanding" of a Strong AI was inherently biological, thus one could not presently exist.
|
||||
- Note that some computer science sources like [IBM](https://www.ibm.com/topics/strong-ai) have taken to using Strong AI to denote only AGI, which was a sufficient, not necessary quality of a philosophical "intelligent" intelligence like the kind Searle contemplated.
|
||||
|
||||
Generative AI models from different sources are architected in a variety of different ways, but they all boil down to one abstract process: tuning an absurdly massive number of parameters to the exact values that produce the most desirable output. (note: [CGP Grey's video on AI](https://www.youtube.com/watch?v=R9OHn5ZF4Uo) and its follow-up are mainly directed towards neural networks, but do apply to LLMs, and do a great job illustrating this). This process requires a gargantuan stream of data to use to calibrate those parameters and then test the model. How it parses that incoming data suggests that, even if we ignore the method of acquisition, the AI model still infringes the input.
|
||||
Generative AI models from different sources are architected in a variety of different ways, but they all boil down to one abstract process: tuning an absurdly massive number of parameters to values that produce the most desirable output. (note: [CGP Grey's video on AI](https://www.youtube.com/watch?v=R9OHn5ZF4Uo) and its follow-up are mainly directed towards neural networks, but do apply to LLMs, and do a great job illustrating this). This process requires a gargantuan stream of data to use to calibrate those parameters and then test the model. How it parses that incoming data suggests that, even if we ignore the method of acquisition, the AI model still infringes the input.
|
||||
- Sidebar: you're nearly guaranteed not to find the optimal combination of several billion parameters, each tunable to several decimals. When I say "desirable," I really mean "good enough."
|
||||
|
||||
At the risk of bleeding the [[#Generation]] section into this one, generative AI is effectively a very sophisticated next-word predictor based on the words it has read and written previously.
|
||||
At the risk of bleeding the [[#Generation]] section into this one, generative AI training creates a sophisticated next-word predictor based on the words it has read and written previously.
|
||||
|
||||
First, this training is deterministic. It's a pure, one-way, data-to-model transformation (one part of the process for which "transformer models" are named). The words are ingested and converted into one of various types of formal representations to comprise the model. It's important to remember that given a specific work and a step of the training process, it's always possible to calculate by hand the resulting state of the model after training on that work. The "black box" that's often discussed in connection with AI refers to the final state of the model, when it's no longer possible to tell what effects the data ingested at earlier steps had on the model.
|
||||
This training is deterministic. It's a pure, one-way, data-to-model transformation (one part of the process for which "transformer models" are named). The words are ingested and converted into one of various types of formal representations to comprise the model. It's important to remember that given a specific work and a step of the training process, it's always possible to calculate by hand the resulting state of the model after training on that work. The "black box" that's often discussed in connection with AI refers to the final state of the model, when it's no longer possible to tell what effects the data ingested at earlier steps had on the model.
|
||||
|
||||
In the model, if some words are more frequently associated together, then that association is more "correct" to generate in a given scenario than other options. A parameter called "temperature" determines how far the model will stray from the correct next word. And the only data to determine whether an association *is* correct would be that training input. This means that an AI trains only on the words as they are on the page. Training doesn't have some external indicator of semantics that a secondary natural-language processor on the generation side can incorporate. Training thus can't be analogized to human learning processes, because **when an AI trains by "reading" something, it isn't reading for the *forest*—it's reading for the *trees***. Idea and expression are meaningless distinctions to AI.
|
||||
In the model, if some words are more frequently associated together, then that association is more "correct" to generate in a given scenario than other options. A parameter sometimes called "temperature" determines how far the model will stray from the correct next word. And the only data to determine whether an association *is* correct would be that training input. This means that an AI trains only on the words as they are on the page. Training can't have some external indicator of semantics that a secondary natural-language processor on the generation side could. If it could, it would need some encoding—some expression—that it turns the facts into. Instead, it just incorporates the word as it read it in, and the data about the body of text it was contained in. Training thus can't be analogized to human learning processes, because **when an AI trains by "reading" something, it isn't reading for the *forest*; it's reading for the *trees***. Idea and expression are meaningless distinctions to AI.
|
||||
|
||||
As such, modern generative AI, like the statistical data models and machine learners before it, is a Weak AI. And weak AIs use weak AI data. Here's how that translates to copyright.
|
||||
- Sidebar: this point doesn't consider an AI's ability to summarize a work since the section focuses on how the *training* inputs are used rather than how the output is generated from real input. This is why I didn't want to get into generation in this section. It's confusing, but training and generation are merely linked concepts rather than direct results of each other when talking about machine learning. Especially when you introduce concepts like temperature in order to simulate creativity.
|
||||
- ...I'll talk about that in the next section.
|
||||
#### "The Law Part"
|
||||
All of the previous analysis has been to establish how an AI receives data so that I can reason about how it *stores* that data. Everything about training except fair use is in this section, which is located in [[#Fair Use|Policy: Fair Use]].
|
||||
All of the previous analysis has been to establish how an AI receives data so that I can reason about how it *stores* that data. Every legal hypothesis about training except fair use is in this section, which is located in [[#Fair Use|Policy: Fair Use]].
|
||||
|
||||
First, I think a very convoluted analogy is helpful here. Let's say I publish a book. Every page of this book is a different photograph. Some of the photos are public domain, but the vast majority are copyrighted, and I don't have authorization to publish those ones. Now, I don't just put the photos on the page directly; that would be copyright infringement! Instead, each page is a secret code that I derive from the photo that I can decipher to show you the photo (if you ask me to, after you've bought the book). Is my book still copyright infringment?
|
||||
- Related but ludicrous: suppose I'm not selling the book, but I bought prints of all these photographs for myself, and if you ask me to, I'll show you a photograph that I bought. But since I only bought one photograph, if I'm showing you the photograph I bought, I can't be showing it to someone else at the same time This *is* considered copyright infringement?!?! At least, that's what *[[Essays/wget-pipe-tar-xzvf|Hachette v. Internet Archive]]* tells us.
|
||||
First, I think a very convoluted analogy is helpful here. Let's say I publish a book. Every page of this book is a different photograph. Some of the photos are public domain, but the vast majority are copyrighted, and I don't have authorization to publish those ones. Now, I don't just put the photos on the page directly; that would be copyright infringement! Instead, each page is a secret code that I derive from the photo (and all other photos already in the book) that I can decipher to show you the photo (if you ask me to, after you've bought the book). Is my book still copyright infringment?
|
||||
- Alternatively, I let you download the instructions on how to access a photo from the secret codes in the book onto your computer. Now, if an artist uses these instructions and gets their own photo, and they sue me, did I injure them or did they injure themselves?
|
||||
- This analogy relates to the standing argument in *Doe v. GitHub*.
|
||||
- Related but ludicrous: suppose I'm not selling the book. I bought prints of all these photographs for myself, and if you ask me to, I'll show you a photograph that I bought. But since I only bought one photograph, if I'm showing you the photograph I bought, I can't be showing it to someone else at the same time. This *is* considered copyright infringement?!?! At least, that's what *Hachette v. Internet Archive* tells us.
|
||||
|
||||
In copyright, reproduction of expression is infringement. And I believe that inputting a work into a generative AI creates an infringing derivative of the work, because it reproduces both the facts and expression of that work. Eventually, the model is effectively a compilation of all works passed in. Finally—on a related topic—there is nothing copyrightable in how the model has arranged the works in that compilation, even if every work trained on is authorized.
|
||||
In copyright, reproduction of expression is infringement. And I believe that inputting a work into a generative AI creates an infringing derivative of the work, because it reproduces both the facts and expression of that work in a way that you could do by hand. Eventually, the model is effectively a compilation of all works passed in. Finally—on a related topic—there is nothing copyrightable in how the model has arranged the works in that compilation, even if every work trained on is authorized.
|
||||
|
||||
Recall that training on a work incorporates its facts and the way the author expressed those facts into the model. When the training process takes a model and extracts weights on the words within, it's first reproducing copyrightable expression, and then creating something directly from the expression. You can analogize the model at this point to a translation (a [specifically recognized](https://www.law.cornell.edu/uscode/text/17/101#:~:text=preexisting%20works%2C%20such%20as%20a%20translation) type of derivative) into a language the AI can understand. But where a normal translation would be copyrightable (if authorized) because the human translating a work has to make expressive choices and no two translations are exactly equal, an AI's model would not be. A given AI will always produce the same translation for a work it's been given, it's not a creative process. Even if every work trained on expressly authorized training, I don't think the resulting AI model would be copyrightable. And absent authorization, it's infringement.
|
||||
Recall that training on a work incorporates its facts and the way the author expressed those facts into the model. When the training process takes a model and extracts weights on the words within, it's first reproducing copyrightable expression, and then creating something directly from the expression. You can analogize the model at this point to a translation (a [specifically recognized](https://www.law.cornell.edu/uscode/text/17/101#:~:text=preexisting%20works%2C%20such%20as%20a%20translation) type of derivative) into a language the AI can understand. But where a normal translation would be copyrightable (if authorized) because the human translating a work has to make expressive choices and no two translations are exactly equal, an AI's model would not be. A given AI will always produce the same translation for a work it's been given, it's not a creative process. Even if every work trained on expressly authorized data, I don't think the resulting AI model would be copyrightable. And absent authorization, it's infringement.
|
||||
- I desperately want Adobe to sue someone for appropriating their new model now so I can see if this theory holds up. The fight might turn on an anti-circumvention question, because if it's not a copyrightable work, there's no claim from circumventing protections on that work.
|
||||
|
||||
As the AI training scales and amasses even more works, it starts to look like a compilation, another type of derivative work. Normally, the expressive component of an authorized compilation is in the arrangement of the works. Here, the specific process of arrangement is predetermined and encompasses only uncopyrightable material. I wasn't able to find precedent on whether a deterministically-assembled compilation of uncopyrightable derivatives passes the bar for protection, but that just doesn't sound good. Maybe there's some creativity in the process of creating the algorithms for layering the model (related: is code art?). More in the [[#Policy]] section.
|
||||
As the AI training scales and amasses even more works, it starts to look like a compilation, another type of derivative work. Normally, the expressive component of an authorized compilation is in the arrangement of the works. Here, the specific process of arrangement is predetermined, and encompasses only uncopyrightable material. I wasn't able to find precedent on whether a deterministically-assembled compilation of uncopyrightable derivatives passes the bar for protection, but that just doesn't sound good. Maybe there's some creativity in the process of creating the algorithms for layering the model (related: is code art?).
|
||||
- There's a thread running through this and a few other points that because the iteration is on such a gargantuan scale, it discounts the fact that you could (over a period of years) theoretically recreate the exact compilation by hand following the AI's steps, and that the arrangement is completely fungible in that way. This is one facet of how GenAI is well suited to helping a person avoid liability. More in the [[#Policy]] section.
|
||||
|
||||
The Northern District of California has actually considered this argument in *Kadrey v. Meta*. They called it "nonsensical", and based on how it was presented in that case, I don't blame them. Looking at how much technical setup I needed to properly make this argument, I'd have some serious difficulty compressing this all into something a judge could read (even ignoring court rule word limits) or that I could orate concisely to a jury. I'm open to suggestions on a more digestible way to persuade people of this point.
|
||||
The Northern District of California has actually considered this infringing-derivative argument in *Kadrey v. Meta*. They called it "nonsensical", and based on how it was presented in that case, I don't blame them. Looking at how much technical setup I needed to properly make this argument, I'd have some serious difficulty compressing this all into something a judge could read (even ignoring court rule word limits) or that I could orate concisely to a jury. I'm open to suggestions on a more digestible way to persuade people of this point, since the *Kadrey* plaintiffs also failed.
|
||||
#### Detour: point for the observant
|
||||
The idea and expression being indistinguishable to an AI may make one immediately think of merger doctrine. That argument looks like: the idea inherent in the work trained on merges with its expression, so it is not copyrightable. That would not be a correct reading of the doctrine. [*Ets-Hokin v. Skyy Spirits, Inc.*](https://casetext.com/case/ets-hokin-v-skyy-spirits-inc) makes it clear that the doctrine is more about disregarding the types of works that are low-expressivity by default, and that this "merger" is just a nice name to remember the actual test by. Confusing name, easy doctrine.
|
||||
The idea and expression being indistinguishable to an AI may make one immediately think of merger doctrine. That argument looks like: the idea inherent in the work trained on merges with its expression, so that segment of the training data must not be copyrightable. However, that argument would not be a correct reading of the doctrine. [*Ets-Hokin v. Skyy Spirits, Inc.*](https://casetext.com/case/ets-hokin-v-skyy-spirits-inc) suggests that the doctrine is more about disregarding the types of works that are low-expressivity by default, and that this "merger" is just a nice name to remember the actual test by. Confusing name, easy doctrine.
|
||||
- Yet somehow this doctrine doesn't extend to RGB colors. I'll die on the hill that you shouldn't be able to copyright a hex code the same way you can't copyright an executable binary. I know, small specific part of US copyright doctrine that I'm sympathetic to arguments against, moving on.
|
||||
### Generation
|
||||
The model itself is only one side of the legal AI coin. What of the output? First, it's certainly not copyrightable. The US is extremely strict when it comes to the human authorship requirement for protection. If an AI is seen as the creator, the requirement is obviously not satisfied. And the human "pushing the button" probably isn't enough either. But does the output infringe the training data? It depends.
|
||||
#### Human Authorship
|
||||
As an initial matter, AI-generated works do not satisfy the human authorship requirement. This makes them uncopyrightable, but more importantly, it also gives legal weight to the distinction between the human and AI learning process. Like I mentioned in the training section, it's very difficult to keep discussions of training and generation separate because they're related concepts, and this argument is a perfect example of that challenge.
|
||||
#### Summaries
|
||||
This section is the most direct refutation of the "AI understands what it trains on" conclusion. I also think it's the most important aspect of generative models for me to discuss. **The question**: If an AI can't understand what it reads, how does it choose what parts of a work should be included in a summary of that work? A book, an article, an email?
|
||||
This is probably the most direct non-technical refutation of the "AI understands what it trains on" argument possible. I also think it's the most important aspect of current generative models for me to highlight. **The question**: If an AI can't understand what it reads, how does it choose what parts of a work should be included in a summary of that work? A book, an article, an email?
|
||||
|
||||
Once again, the answer is mere probability. In training, the model is told what word to come after a word is more "correct" by how many times that sequence of words occurs in its training data. And in generation, if more of the work mentions a particular subject than the actual conclusion of the work, the subject given most attention will be what the model includes in a summary.
|
||||
|
||||
@ -105,29 +110,34 @@ So how do corporations try to solve the problem? Human-performed [microtasks](ht
|
||||
AI can get things wrong, that's not new. Take a look at this:
|
||||
|
||||
![[limmygpt.png|Question for chatgpt: Which is heavier, 2kg of feathers or 1kg of lead? Answer: Even though it might sound counterintuitive, 1 kilogram of lead is heavier than 2 kilograms of feathers...]]
|
||||
Slight variance in semantics, same answer because it's the most popular string of words to respond to that pattern of a prompt. Again, nothing new. Yet GPT-4 will get it right. This probably isn't due to an advancement in the model. My theory is that OpenAI looks at the failures published on the internet (sites like ShareGPT, Twitter, etc) and has remote validation gig workers ([already a staple in AI](https://www.businessinsider.com/amazons-just-walk-out-actually-1-000-people-in-india-2024-4)) "correct" the model's responses to that sort of query. In effect, corporations are exploiting ([yes, exploiting](https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/)) developing countries to create a massive **network of edge cases** to fix the actual model's plausible-sounding-yet-wrong responses.
|
||||
Slight variance in semantics, same answer because it's the most popular string of words to respond to that pattern of a prompt. Again, nothing new. Yet GPT-4 will get it right. This probably isn't due to an advancement in the model. My theory is that OpenAI looks at the failures published on the internet (sites like ShareGPT, Twitter, etc) and has remote validation gig workers ([already a staple in AI](https://www.businessinsider.com/amazons-just-walk-out-actually-1-000-people-in-india-2024-4)) "correct" the model's responses to that sort of query. In effect, corporations could be exploiting ([yes, exploiting](https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/)) developing countries to create a massive **network of edge cases** to fix the actual model's plausible-sounding-yet-wrong responses.
|
||||
- This paragraph does border on conspiracy theory. However, which is more likely:
|
||||
- Company in the competitive business of *wow*ing financial backers leverages existing business contacts to massively boost user-facing performance of their product as a whole at little added cost; or
|
||||
- Said company finds a needle of improvement over their last haystack in an even *bigger* haystack that enables the most expensive facet of their product to do more of the work.
|
||||
|
||||
> [!question]
|
||||
> I won't analyze this today, but who owns the human authored content of these edge cases? They're *probably* expressive and copyrightable.
|
||||
|
||||
#### Expression and Infringement; "The law part" again
|
||||
Like training, generation also involves reproduction of But where a deterministic process creates training's legal issues, generation is problematic for its *non*-deterministic output.
|
||||
|
||||
It can be said that anything a human produces is just a recombination of everything that person's ever read. Similarly, that process is a simplified understanding of how an AI trains.
|
||||
|
||||
==MORE==
|
||||
However, everything a *person* has ever read is stored as concepts, floating around in their brain. My brain doesn't have a specific person's explanation of a transformer model architecture prepped, or even particular phrases from that explanation. It has a "visual" and emotional linkage of **ideas**, that other regions of my brain leverage vocabulary to put to paper when I explain it. An AI stores words that occurred in its corpus that can be considered responsive to the prompt. It may also have words that succeeded the prompt as the next portion in a containing work of both the prompt and the output. N-grams, not neurons.
|
||||
|
||||
The key difference: talking about a human brain making a work by recombining its input is **metaphor**; talking about an AI recombining a work is **technologically accurate**. A chatbot goes to look at the secret code and shows you the photograph it corresponds to when you ask it to.
|
||||
|
||||
Naturally, there are occurrences where a human and an AI would reach approximately the same factual response if you asked them the same question. So what makes some of AI output infringement? The same thing that makes some human responses copyright infringement: reproduction of a copyrighted work. But the difference is that some human responses would be copyrightable in themselves because they don't reproduce enough of a specific work or multiple works to be considered either an ordinary derivative or a compilation derivative. ==ughthis is hardddd==
|
||||
#### Detour: actual harm caused by specific uses of AI models
|
||||
My bet for a strong factor when courts start applying fair use tests to AI output: **harm**. { *and I actually wrote this before the [[Essays/no-ai-fraud-act|No AI FRAUD Act]] 's negligible-harm provision was published, -ed.* } Here's a quick list of uses that probably do cause harm, some of them maybe even harmful *per se* (definitely harmful without even looking at specific facts).
|
||||
My bet for a strong factor when courts start applying fair use tests to AI output: **harm**. { *and I actually wrote this before the [[Essays/no-ai-fraud-act|No AI FRAUD Act]]'s negligible-harm provision was published, -ed.* } Here's a quick list of uses that probably do cause harm, some of them maybe even harmful *per se* (definitely harmful without even looking at specific facts).
|
||||
- Election fraud and misleading voters, including even **more** corporate influence on US elections ([not hypothetical](https://www.washingtonpost.com/elections/2024/01/18/ai-tech-biden/) [in the slightest](https://web.archive.org/web/20240131220028/https://openai.com/careers/elections-program-manager), [and knowingly unethical](https://www.npr.org/2024/01/19/1225573883/politicians-lobbyists-are-banned-from-using-chatgpt-for-official-campaign-busine))
|
||||
- [Claiming](https://www.washingtonpost.com/politics/2024/03/13/trump-video-ai-truth-social/) misleading voters?
|
||||
- Other fraud, like telemarketing/robocalls, phishing, etc
|
||||
- Competition with actual artists and authors (I am VERY excited to see where trademark law evolves around trademarking one's art or literary style. Currently, the arguments are weak and listed in the mini-argument section).
|
||||
- Competition with actual artists and authors (I am VERY excited to see where trademark law evolves around trademarking one's art or literary style. Currently, the arguments are weak and listed in the mini-argument section.)
|
||||
- Obsoletes human online workforces in tech support, translation, etc
|
||||
- [[Essays/plagiarism##1 Revealing what's behind the curtain|🅿️ Reinforces systemic bias]]
|
||||
- [Violates the GDPR on a technological level](https://www.theregister.com/2024/04/29/openai_hit_by_gdpr_complaint/)
|
||||
- I also think being unable to delete personal data that it *has* acquired and not just hallucinated is a big problem
|
||||
- I also think being unable to delete personal data that it *has* acquired and not just hallucinated is a big problem generally
|
||||
#### Detour 2: An Alternative Argument
|
||||
There's a much more concise argument that generative AI output infringes on its training dataset. I don't plan to engage with it much because I can only see it being used to sue a *user* of a generative AI model, not the corporation that created it.
|
||||
There's a more concise and less squishy argument that generative AI output infringes on its training dataset.
|
||||
|
||||
Recall that AI output taken right from the model (straight from the horse's mouth) is [not copyrightable according to USCO](https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence). If the model's input is copyrighted, and the output can't be copyrighted, then there's nothing in the AI "black box" that adds to the final product, so it's literally *just* the training data reproduced and recombined. Et voila, infringement.
|
||||
|
||||
@ -135,7 +145,7 @@ This isn't to say that anything uncopyrightable will infringe something else, bu
|
||||
|
||||
Note that there are many conclusions in the USCO guidance, so you should definitely read the whole thing if you're looking for a complete understanding of the (very scarce) actual legal coverage of AI issues so far.
|
||||
### Where do we go from here?
|
||||
Well, getting to evaluation of the above by courts would be a start. Right now, courts are ducking AI issues left and right on standing and pleading grounds. Once there's more solid (or honestly *any*) coverage of the legal arguments on the merits, whether the law *should* be enforced will become prudent.
|
||||
Well, getting to evaluation of the above by courts would be a start. Right now, courts are ducking AI issues left and right on standing and pleading grounds. Once there's more solid coverage of the legal arguments on the merits, whether the law *should* be enforced will become prudent.
|
||||
# Policy
|
||||
These arguments will be more or less persuasive to different people. I think there's a lot more room for discussion here because they become relevant to the future direction of the law as well as current enforcement. The most important debate is up first, but the others are not particularly ordered.
|
||||
|
||||
@ -143,6 +153,8 @@ These arguments will be more or less persuasive to different people. I think the
|
||||
> More topics under this section forthcoming! I work and edit in an alternate document and copy over sections as I finish them.
|
||||
|
||||
## Fair Use
|
||||
In modern copyright practice, this defense seems to be the pivotal question. It's probably going to be the exact same in AI.
|
||||
|
||||
Whenever a legal doctrine has strong roots in collective consciousness and policy, there's an epistemological question about how to approach the issue. The debate asks: in the abstract, should the courts protect what *descriptively is* considered within the bounds of protection, or what *ought to be* recognized by society as deserving protection?
|
||||
- Nerd sidebar: This debate is common in criminal law. For example, examine the reasonable expectation of privacy. *Are* members of the public actually concerned with police access to the data on their phone or do they think they have nothing to hide? *Should* they be? Recent cases on searches and third party access trend towards analysis under the latter, more paternalistic position.
|
||||
|
||||
@ -152,7 +164,7 @@ Because it's such an alien technology to the law, I'd argue that generative AI's
|
||||
|
||||
US fair use doctrine has four factors, of which three can speak to whether it ought to be enforced.
|
||||
### Purpose and character of the use
|
||||
|
||||
Training is conducted at a massive scale. Earlier, I mentioned the firehose.
|
||||
|
||||
But for generated output, this factor gets messier. Criticism or comment? Of/on who/what? I can think of one use that would be fair use, but only to defend the person using the model to generate text: criticism of the model itself, or demonstration that it can reproduce copyrighted works. Not to mention if a publisher actually sued a person for *using* a generative AI, that would Streisand Effect the hell out of whatever was generated.
|
||||
### Nature of training data
|
||||
|
||||
@ -92,7 +92,7 @@ Below is a statement from Scarlett Johansson:
|
||||
>
|
||||
> When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference. Mr. Altman even insinuated that the similarity was intentional, tweeting a single word "her" - a reference to the film in which I voiced a chat system, Samantha, who forms an intimate relationship with a human.
|
||||
>
|
||||
Two days before the ChatGPT 4.0 demo was released, Mr. Altman contacted my agent, asking me to reconsider. Before we could connect, the system was out there.
|
||||
> Two days before the ChatGPT 4.0 demo was released, Mr. Altman contacted my agent, asking me to reconsider. Before we could connect, the system was out there.
|
||||
>
|
||||
> As a result of their actions, I was forced to hire legal counsel, who wrote two letters to Mr. Altman and OpenAl, setting out what they had done and asking them to detail the exact process by which they created the "Sky" voice. Consequently, OpenAl reluctantly agreed to take down the "Sky" voice.
|
||||
>
|
||||
|
||||
@ -5,26 +5,31 @@ tags:
|
||||
- misc
|
||||
- seedling
|
||||
date: 2024-09-14
|
||||
lastmod: 2024-09-14
|
||||
draft: true
|
||||
lastmod: 2024-10-23
|
||||
draft: false
|
||||
---
|
||||
Recent studies reveal that the use of AI is becoming increasingly common in academic writings. On [Google Scholar](https://misinforeview.hks.harvard.edu/article/gpt-fabricated-scientific-papers-on-google-scholar-key-features-spread-and-implications-for-preempting-evidence-manipulation/), and on [arXiv](https://arxiv.org/abs/2403.13812); but most shockingly, on platforms like Elsevier's [Science Direct](http://web.archive.org/web/20240315011933/https://www.sciencedirect.com/science/article/abs/pii/S2468023024002402) (check the Introduction). Elsevier supposedly prides itself on its comprehensive review process, which ensures that its publications are of the highest quality. More generally, the academic *profession* insists that it possesses what I call integrity: rigor, attention to detail, and authority or credibility. But AI is casting light on a greater issue: **does it**?
|
||||
## What is integrity?
|
||||
==specific aspects I can talk about in the framings==
|
||||
Recent studies reveal that the use of AI is becoming increasingly common in academic writings. On [Google Scholar](https://misinforeview.hks.harvard.edu/article/gpt-fabricated-scientific-papers-on-google-scholar-key-features-spread-and-implications-for-preempting-evidence-manipulation/), and on [arXiv](https://arxiv.org/abs/2403.13812); but most shockingly, on platforms like Elsevier's [Science Direct](http://web.archive.org/web/20240315011933/https://www.sciencedirect.com/science/article/abs/pii/S2468023024002402) (check the Introduction). Elsevier supposedly prides itself on its comprehensive review process, which ensures that its publications are of the highest quality. More generally, the academic *profession* insists that it possesses what I call [[Dict/integrity|integrity]]: rigor, attention to detail, and authority or credibility. But AI is casting light on a greater issue: **does it**?
|
||||
## Competing framings
|
||||
I think there are two ways of framing the emergence of the problem.
|
||||
### 1: Statistical (not dataset) bias and sample size
|
||||
The first framing is simple proportionality. Journal submission numbers have increased rapidly in the past few years: [atherosclerosis](https://www.atherosclerosis-journal.com/article/S0021-9150(13)00456-5/abstract)
|
||||
![[Attachments/papers.png|graph showing almost an exponential trend in paper submission from 1970 (about 5 million) to 2013 (over 40 million).]]
|
||||
|
||||
This trend has remained consistent in the past decade, I just couldn't find as nice a graphic. [science.org](https://science.org/content/page/journal-metrics) shows nearly 40 thousand submissions over its corpus in 2023. Thus, it's natural that more low-quality papers would "slip through the cracks" or similar.
|
||||
|
||||
I'm not interested in doing the statistical analysis (especially because it would probably require creating a quantitative analysis metric for integrity, which I super don't want to spend time on), but this is just one hypothesis.
|
||||
|
||||
==barriers to access broken by AI==
|
||||
I am, however, aware of the argument that there's nothing inherently bad about a rise in submissions in and of itself. I agree with that argument in that such a claim would be unsubstantiated. For example, the above Atherosclerosis article uses the graph as evidence of what it calls "filter failure." I don't think that's the case. Instead, it may be indicative of a different systematic problem being *solved*: barriers to access are coming down as access to education improves. But AI removes this cause for celebration, because by degrading the integrity of the journals by which we measure the progress, it lowers the importance of that achievement.
|
||||
### 2: We've Always Done It This Way
|
||||
And second, which I find more persuasive (but shocking), is the question: has it just always been like this?
|
||||
Second, which I find more persuasive (but shocking), is the question: has it just always been like this?
|
||||
|
||||
==is the thought of academic integrity just a facade meant to preserve the barriers in the first reading? Detail requires (paid) time, intellectual rigor requires education, credibility requires experience and access to information...==
|
||||
There's a possibility that the thought of academic integrity is just a facade meant to preserve the aforementioned barriers to access. I do understand how it can be seen as hegemonic in nature. Detail requires (paid) time, intellectual rigor requires education, credibility requires access to information, authority requires experience...
|
||||
|
||||
If that's the case, I'd definitely advocate for a shift in academic norms. Think of it like moving the Venn diagram circles around a bit in a way that we can accomplish all of:
|
||||
- Preserving the thought of journals as having the new conception of integrity;
|
||||
- Continuing to classify AI works as lacking integrity; and
|
||||
- Dismantling the bad-faith barriers to access presented by the old normative definition.
|
||||
## Further Reading
|
||||
In my view, a critical component of academic or purportedly informative works is the establishment of authority/credibility in a way that's verifiable by other people. I have an incomplete [[Essays/plagiarism|essay on plagiarism]] where I explore this facet of academic integrity.
|
||||
In my view, a critical component of academic or purportedly informative works is the establishment of authority/credibility in a way that's verifiable by other people. I have an incomplete [[Essays/plagiarism|essay on plagiarism]] where I explore this facet.
|
||||
|
||||
I subscribe to a style of writing called "academ*ish*" voice on this site, documented by [Ink and Switch](https://inkandswitch.notion.site/Academish-Voice-0d8126b3be5545d2a21705ceedb5dd45). Pointing out all the ways that even this less-rigorous style is fundamentally incompatible with AI-generated text is left as an exercise for the reader.
|
||||
I subscribe to a style of writing called "academ*ish*" voice on this site, documented by [Ink and Switch](https://inkandswitch.notion.site/Academish-Voice-0d8126b3be5545d2a21705ceedb5dd45). Pointing out all the ways that even this less-serious style is fundamentally incompatible with AI-generated text is left as an exercise for the reader.
|
||||
@ -14,7 +14,9 @@ It goes without saying that anything herein constitutes my own opinion and not t
|
||||
## Attribution
|
||||
Feel free to properly reference any of the content within in your own gardens or work. Don’t plagiarize. A link to the page you used is just fine.
|
||||
|
||||
**Do not input my work into an online or offline generative AI for any purpose, including to train or update the model, explore alternate positions to mine, or to converse with the work.** Keep the moles out of the garden.
|
||||
**Do not input my work into an online or offline generative AI for any purpose, including to train or update the model, explore alternate positions to mine, or to converse with the work.**
|
||||
- If you need an alternative explanation to understand my argument, please ask me! I run this site to practice explaining these topics.
|
||||
- If you @ me with "I asked chatgpt the same question and here's what it said", it's going to say more about your ability to effectively converse on these subjects than it will about the merits of the AI's position.
|
||||
## Privacy/Terms of Use
|
||||
- I don't run analytics of any kind on this site.
|
||||
- I don't share any of my content with third parties, nor do I consent to third party use of my content which I retain a copyright in.
|
||||
|
||||
@ -6,7 +6,7 @@ tags:
|
||||
- programming
|
||||
- difficulty-easy
|
||||
- seedling
|
||||
draft: true
|
||||
draft: false
|
||||
date: 2024-07-25
|
||||
lastmod: 2024-09-05
|
||||
---
|
||||
|
||||
@ -1,20 +1,19 @@
|
||||
---
|
||||
title: 10/24 - Summary of Changes
|
||||
draft: true
|
||||
draft: false
|
||||
tags:
|
||||
- "#update"
|
||||
date: 2024-10-03
|
||||
lastmod: 2024-10-31
|
||||
lastmod: 2024-11-01
|
||||
---
|
||||
## Housekeeping
|
||||
Happy spooky season!
|
||||
|
||||
As of posting this, I've been on the fediverse/socialweb for a year. The amount of computer interested folks on there is just amazing for community and troubleshooting. It's not without severe problems by way of moderation and groups platforming abusers, but on an instance with superb moderation, you'll see much less of it. I can't applaud the quality of [Treehouse](https://social.treehouse.systems)'s moderation enough; our instance admin has decades of experience dating back to IRC and makes the experience so much nicer.
|
||||
As of posting this, I've been on the fediverse/socialweb for almost a year. The amount of computer interested folks on there is just amazing for community and troubleshooting. It's not without severe problems by way of moderation and groups platforming abusers, but on an instance with superb moderation, you'll see much less of it. I can't applaud the quality of [Treehouse](https://social.treehouse.systems) 's moderation enough; our instance admin has decades of experience dating back to IRC and makes the experience so much nicer.
|
||||
## Pages
|
||||
- New: [[Misc/ai-integrity|Academic Integrity and AI]]. A nice little preview before I drop the infringement page...
|
||||
- New: [[Essays/write-something|You should write something!]]
|
||||
- New for real: [[Projects/keyboards|A Mechanical Keyboard Journey]]
|
||||
- Content Update: [[Projects/nvidia-linux|Nvidia on Linux]]
|
||||
## Status Updates
|
||||
-
|
||||
## Helpful Links
|
||||
[[todo-list|Site To-Do List]] | [[Garden/index|Home]]
|
||||
|
||||
Loading…
Reference in New Issue
Block a user