Quartz sync: Dec 12, 2024, 11:39 AM

This commit is contained in:
bfahrenfort 2024-12-12 11:39:16 -06:00
parent 0b3c84a3f4
commit 520d7982bf
11 changed files with 151 additions and 41 deletions

View File

@ -36,7 +36,7 @@ Training is a deterministic process. It's a pure, one-way, data-to-model transfo
Training can't be analogized to human learning processes, because when an AI trains by "reading" something, it isn't reading for the *forest*; it's reading for the *trees*. In the model, if some words are more frequently associated together, then that association is more "correct" to generate in a given scenario than other options. A parameter sometimes called "temperature" determines how far the model will stray from the correct next word. And the only data to determine whether an association *is* correct would be that training input. This means that an AI trains only on the words as they are on the page. Training can't have some external indicator of semantics that a secondary natural-language processor on the generation side could. If it could, it would need some encoding—some expression—that it turns the facts into. Instead, it just incorporates the word as it read it in, and the data about the body of text it was contained in.
Some transformer models include a structure called a Multi-Layer Perceptron ("MLP" { *training is magic -ed*. }), which is often simplified as "the place where the AI stores facts." However, it's just another matrix-based component of the model with different math that makes it better at preserving a type of word associations. Mathematically, most word generation is linear (really linear-and-tomfoolery but whatever) on the probability-of-occurrence scale. An MLP corrects this mathematical limitation by adding "layers" of generation that preserve associations in non-linearly separable data, a class which *includes* facts. As such, it makes the model perform better if MLPs get more authority over the output of the model in portions of the output where it makes sense to give it that control (and determining that "where" is yet another black box of training). If you've ever seen an AI hallucinate a falsehood in the next sentence after it's been trained on the correct answer, you know that the MLP isn't really storing facts.
Some transformer models include a structure called a Multi-Layer Perceptron ("MLP" { *training is magic -ed*. }), which is often simplified as "the place where the AI stores facts." However, it's just another matrix-based component of the model with different math that makes it better at preserving a type of word associations: Mathematically, most word generation is linear (really linear-and-tomfoolery but whatever) on the probability-of-occurrence scale. An MLP corrects this mathematical limitation by adding "layers" of generation that roughly preserve associations in non-linearly separable data, a class which *includes* facts. As such, it makes the model perform better if MLPs get more authority over the output of the model in portions of the output where it makes sense to give it that control (and determining that "where" is yet another black box of training). If you've ever seen an AI hallucinate a falsehood in the next sentence after it's been trained on the correct answer, you know that the MLP isn't really storing facts.
- Phrases like "authority over the output" really belong in a generation section. It's probably an intuitive enough concept to be included here without further context though.
- Sidebar: Taking this to its logical extreme and demonstrating that self-attention (or any sort of attention component, really) is not a substitute for short-term memory would solidify the fact that generative AI training cannot be likened to a human's capacity to process and store information.
@ -47,17 +47,17 @@ As such, idea and expression are meaningless distinctions to AI.
<img src="/Attachments/common_crap.svg" alt="Common Crawl logo edited to say 'common crap' instead" style="padding:0% 5%">
A very big middle finger to the Common Crawl dataset, whose CCBot still tries to scrape this website. [[Projects/Obsidian/digital-garden#Block the bot traffic!|Block the bot traffic]]. If I had the time or motivation, I would find a way to instead of blocking these bots, redirect them to an AI generated fanfiction featuring characters from The Bee Movie, including poisoned codewords.
A very big middle finger to the Common Crawl dataset, whose CCBot still tries to scrape this website. [[Projects/Obsidian/digital-garden#Block the bot traffic!|Block the bot traffic]]. If I had the time or motivation, I would find a way to instead of blocking these bots, redirect them to an AI generated fanfiction featuring characters from The Bee Movie.
## Generation
Generative AI training creates a sophisticated next-word predictor that generates text based on the words it has read and written previously.
Generative AI training, for LLMs, creates a sophisticated next-word predictor that generates text based on the words it has read and written previously.
In the case of image models, it creates an interpolator that starts from a noise pattern and moves values until they resemble portions of its training data. Specifically, portions which it has been told have roughly parallel expression to the prompt given to it by the user.
In the case of image models, it creates an interpolator that starts from a noise pattern and moves ("diffuses") values until they resemble portions of its training data. Specifically, portions which it has been told have roughly parallel expression to the prompt given to it by the user.
This is the reason that the term "hallucination" is misleading: **all AI-generated text is "hallucinated."** some of it just happens to be "shaped" like reliable information. Many discrete procedures are bolted on to the back of the model to bring the reliability numbers up, but they do nothing to affect the originality of the work.
[[Misc/generation-copyright|Generated output may infringe the training data]].
## Other/emerging terminology
"Retrieval-augmented generation" (RAG) partitions off a specific set of a model's training data as the "knowledge body", which the model will attempt to copy-paste from when responding to your questions. It's implemented by skewing the weights of the training data, and searching the output back in the knowledge body to find the source of the output.
"Retrieval-augmented generation" (RAG) partitions off a specific set of a model's training data as the "knowledge body", which the model will attempt to copy-paste from when responding to your questions. It's implemented by skewing the weights of the training data to favor the knowledge body, and then using the output to search in the knowledge body for its source. In other words, the AI isn't saying "I got this information from this source", it's going "here is a statement" and then before outputting the statement to the user, it plugs it into an exact-match search of its dataset to find original.
"Deep document understanding" is the name of a tool to classify regions of a file. It's a misnomer, this is not in and of itself an 'understanding' any more than drawing circles around your tax return boxes would be.

View File

@ -0,0 +1,19 @@
---
title: Methodologies
tags:
- meta
- project
- glossary
- misc
- legal
- programming
date: 2024-12-01
lastmod: 2024-12-01
draft: true
---
This is my brain's approach to problems, as I understand it. It's like that section in a scientific paper where the authors describe their method.
## Method
### Programming
Many of my mental tricks here are just a faster version of writing the algorithms out on a whiteboard or a sticky note. It's a skill I picked up in undergrad.
I think in terms of visual motion a lot. I typically think of windows and slices as the matrix notation brackets actually moving in space through a field of inputs. If I have to iterate around the field, I picture a 'caret' of sorts at the current step.

38
content/Atomic/rss.md Normal file
View File

@ -0,0 +1,38 @@
---
title: RSS
tags:
- rss
- foss
- meta
- glossary
- difficulty-easy
date: 2024-11-26
lastmod: 2024-11-26
draft: false
---
RSS, or Really Simple Syndication, is the most convenient and privacy-conscious way to subscribe to a website, social media account, and more. No site analytics, no page loading, no Javascript, no ads. All the website can see is that you pulled one file from it, but you can still read all the content that you wanted from the website in a compact format.
On the technology side, think of a social media account. When you scroll someone's posts on their account page, the website is getting those posts from a database. But what if it only stored the X most recent posts' content and metadata, and put them in an easy to access text file that users cache locally rather than requiring complicated backend infrastructure? That's RSS.
- And in fact, on the Fediverse, [there's an RSS feed for every Mastodon account](https://fedi.tips/following-mastodon-and-fediverse-accounts-through-rss/).
This makes it good for uses like news or podcasts, which don't need to incentivize engagement after you see new content.
RSS isn't dead (despite [suggestions to the contrary haha](https://rss-is-dead.lol)).
There is also a derivative of RSS called Atom which accomplishes much of the same things with slightly different syntax.
## Getting started with RSS - User
Find an [[Programs I Like/rss-readers|RSS reader]] and go to a web page you'd like to follow.
They might advertise that they have a feed (perhaps with an icon like this: <img class="bf-icon" src="https://upload.wikimedia.org/wikipedia/en/4/43/Feed-icon.svg">). If not, try pasting the homepage link into your feed reader to let it auto-discover the feeds, or click your "add to RSS" bookmark if the reader had you set one up.
If all else fails, they still might have a feed! Try these links and if one shows up a page that isn't just the homepage again or a "this page doesn't exist", then paste it into your feed reader (common feed names):
- `site.com/index.xml`
- `site.com/index.rss`
- `site.com/feed.xml`
- `site.com/feed.rss`
- `site.com/rss.xml`
You can also turn email lists into RSS feeds to unclutter your email with [Kill The Newsletter](https://kill-the-newsletter.com).
## RSS for developers
I contribute to RSS feed integration, read my thoughts on it [[Projects/rss-foss|here]].

View File

@ -24,49 +24,66 @@ The most important debate is up first, but the others are not particularly order
## Fair Use
In modern copyright practice, this defense seems to be the pivotal question. It's probably going to be the exact same in AI.
I choose to link training fair use and generation fair use. Generation "uses" the works as encoded in the statistical model, which were naturally part of the data used for training. Technologically, they aren't different "uses" of the same data, they're just steps in a process (you're always going to generate with a trained model). Thus, if training is found to be fair use, generation would be fair use as well. **However**, there are arguments for fair use that would absolve the user who generated the content, yet still hold the proprietor of the model liable! This is another facet of copyright that needs to be uniquely applied to AI, as infringement is bilateral, yet fair use is more complex.
Whenever a legal doctrine has strong roots in collective consciousness and policy, there's an epistemological question about how to approach the issue. The debate asks: in the abstract, should the courts protect what *descriptively is* considered within the bounds of protection, or what *ought to be* recognized by society as deserving protection?
- Nerd sidebar: This debate is common in criminal law. For example, examine the reasonable expectation of privacy. *Are* members of the public actually concerned with police access to the data on their phone or do they think they have nothing to hide? *Should* they be? Recent cases on searches and third party access trend towards analysis under the latter, more paternalistic position.
In fair use, the first ("empirical") perspective teaches that fair use should only extend to concepts analogous to prior enforcement which has been accepted in the collective consciousness. In contrast, the second ("normative") perspective would disregard comparison with enforcement in favor of comparison with societal values.
Because it's such an alien technology to the law, I'd argue that generative AI's fair use should be analyzed in view of the normative approach. But even under that approach, I don't think AI training or generation should be considered fair use.
Because it's such an alien technology to the law, I'd argue that generative AI's fair use should be analyzed in view of the normative approach. Even so, I don't think AI training or generation should be considered fair use.
US fair use doctrine has four factors, of which three can speak to whether it ought to be enforced.
US fair use doctrine has four factors, of which two can most speak to whether it ought to be enforced.
### Purpose and character of the use
Training is conducted at a massive scale. Earlier, I mentioned the firehose.
Training is conducted at a massive scale. Earlier, I mentioned the firehose. This isn't relevant to the (not discussed) amount-and-substantiality factor because that operates on a per-work basis.
But for generated output, this factor gets messier. Criticism or comment? Of/on who/what? I can think of one use that would be fair use, but only to defend the person using the model to generate text: criticism of the model itself, or demonstration that it can reproduce copyrighted works. Not to mention if a publisher actually sued a person for *using* a generative AI, that would Streisand Effect the hell out of whatever was generated.
### Nature of training data
First, *why* is the model being produced? It's for the sole purpose of regurgitation of works. This sets AI training apart from other massive scraping, such as for search engines. "deep linking" to a webpage is functionally different from copying its content; [*Ticketmaster v. Tickets.com* pages 5-6](https://casetext.com/case/ticketmaster-corp-v-ticketscom-inc). The service that search engines provide is also different: leading people to the content of the deep links (and, naturally, making money in the process using the data you inadvertently give the search engine when using it).
### Market value; competition
Next, let's look at how it's encoded. The key link between training and generation is that a trained model infringes the input for reproducing its expression as a compilation derivative, because it can be used to generate a literal reproduction of that expression. Not all of these "reproductions of expression" are direct copy-pastes of the words on the page, but the most striking ones are.
But for generated output, this factor gets messier. Is the work "criticism or comment"? Of/on who/what? I can think of one use that would be fair use, but only to defend the person using the model to generate text: criticism of the model itself, or demonstration that it can reproduce copyrighted works. Not to mention, if a publisher actually sued a person for *using* a generative AI, that would Streisand Effect the hell out of whatever was generated.
### Market value, or competition
And most importantly (especially in recent years), let's talk about the competitive position of an AI model. This is directly linked to the notion that AI harms independent artists, and is the strongest reason for enforcement of copyright against AI in my opinion.
Interestingly, I think the USCO Guidance [[#Detour 2 An Alternative Argument|talked about in the Generation section]] is instructive. It analogizes prompting a model to commissioning art, which applies well to a discussion of competition. AI lets me find an artist and say to them, "I want a Warhol, but I don't want to pay Warhol prices"; or "I want to read Harry Potter, but I don't want to give J.K. Rowling my money \[for good reason\]." The purpose of AI's "work product" is solely to compete with human output.
- "I want a contract, but I can't afford a lawyer."
- "I want a website, but I don't know how to program."
A problem I have not researched in detail is the level of competency in alternative needed to prove that an infringing use does compete with the underlying work. Today, many people see AI as the intermediate step on the scale between the average proficiency of an individual at any given task (painting, photography, poetry, *shudder* legal matters) and that of an expert in that field. Does AI need to be "on the level" of that expert in order to be considered a competitor? It certainly makes a stronger argument for infringment if they are, like with creative mediums. But does this hold up with legal advice, where it will produce output but (in my opinion) sane professionals should tell you that AI doesn't know the first thing about the field?
A problem I have not researched in detail is the level of competency in an alternative that a plaintiff needs to prove in order to establish that the infringer does compete with the underlying work. Today, many people see AI as the intermediate step on the scale between the average proficiency of an individual at any given task (painting, photography, poetry, *shudder* legal matters) and that of an expert in that field. Does AI need to be "on the level" of that expert in order to be considered a competitor? It certainly makes a stronger argument for infringment if they are, like with creative mediums. But does this hold up with legal advice, where it will produce output but (in my opinion) sane professionals should tell you that AI doesn't know the first thing about the field?
Note that there are very valid criticisms with being resistant to a technology solely because of the "AI is gonna take our jobs" sentiment. I think there are real parallels between that worry and a merits analysis of the competition factor. So if you find those criticisms persuasive, that would probably mean that you disagree with my evaluation of this factor.
Professionals might not even see AI as a competitor. Say a client comes to a software engineer and says "I made a website with AI, can you look it over and touch it up?" That poor engineer's first impression is that the work will take significantly longer, since it is now entirely the worst task of all: code review. Nonetheless, I think the professionals' opinion and the actual competitive potential shouldn't be considered. The important fact is that **genAI has the impression of an alternative, and does indeed lead consumers of individual and enterprise levels to use it instead of a human**, regardless of its efficacy.
Note that there are very valid criticisms of being resistant to a technology solely because of the "AI is gonna take our jobs" sentiment. I think there are real parallels between that worry and a merits analysis of the competition factor. So if you're persuaded that AI skepticism is FUD over potentially losing one's job, that would probably mean that you disagree with my evaluation of this factor.
### Final thoughts on fair use
I didn't see fit to engage with it in detail here, but amount and substantiality of the use would probably have some effect on AI depending on plaintiff. ==BIG PUBLISHER VS ARTIST?==
## Who's holding the bag?
WIP https://www.wsj.com/tech/ai/the-ai-industry-is-steaming-toward-a-legal-iceberg-5d9a6ac1?St=5rjze6ic54rocro&reflink=desktopwebshare_permalink
https://www.wsj.com/tech/ai/the-ai-industry-is-steaming-toward-a-legal-iceberg-5d9a6ac1?St=5rjze6ic54rocro&reflink=desktopwebshare_permalink
At some point, this AI experiment is going to go very wrong. Someone is going to use AI to cause harm, whether knowingly or negligently, and there will be a lawsuit. AI could be attractive to corporate employees, because in some cases, it may allow them to dodge accountability for the decisions that were made by AI. But should that hold up legally?
The two potential wrongdoers are the proprietor of the AI model and the entity that used the AI model for the alleged harm. ==WIP==
### Detour: Section 230 (*again*)
Well, here it is once more. I think that you can identify a strangely inverse relationship between fair use and § 230 immunity. If the content is directly what was put in (and is not fair use), then it's user content, and Section 230 immunity applies. If the content by an AI is *not* just the user's content and is in fact transformative fair use, then it's the website's content, not user content, and the website can be sued for the effects of their AI. Someone makes an investment decision based on the recommendation of ChatGPT? Maybe it's financial advice. I won't bother with engaging the effects further here. I have written about § 230 and AI [[no-ai-fraud-act#00230: Incentive to Kill|elsewhere]], albeit in reference to AI-generated user content *hosted* by the platform.
Well, here it is once more. When the proprietor is a website (or, really, an "interactive computer service"), and a service user uses the model in a way that results in harm, it will invariably involve a Section 230 issue.
I think that you can identify a strangely inverse relationship between fair use and § 230 immunity. If the content is directly what was put in (and is not fair use), then it's user content, and Section 230 immunity applies. If the content by an AI is *not* just the user's content and is in fact transformative fair use, then it's the website's content, not user content, and the website can be sued for the effects of their AI. Someone makes an investment decision based on the recommendation of ChatGPT? Maybe it's financial advice. I won't bother with engaging the effects further here. I have written about § 230 and AI [[no-ai-fraud-act#00230: Incentive to Kill|elsewhere]], albeit in reference to AI-generated user content *hosted* by the platform.
## The First Amendment and the "Right to Read"
This argument favors allowing GAI to train on the entire corpus of the internet, copyright- and attribution-free, and bootstraps GAI output into being lawful as well. The position most commonly taken is that the First Amendment protects a citizen's right to information, and that there should be an analogous right for generative AI.
The right to read, at least in spirit, is still being enforced today. Even the 5th Circuit (!!!) believes that this particular flavor of First Amendment claim will be likely to succeed on appeal after prevailing at the trial level. [*Book People v. Wong*](https://law.justia.com/cases/federal/appellate-courts/ca5/23-50668/23-50668-2024-01-17.html), No. 23-50668 (5th Cir. 2024) (not an AI case). It also incorporates principles from intellectual property law. Notably, this argument states that one can read the content of a work without diminishing the value of the author's expression (*i.e.*, ideas aren't copyrightable). As such, the output of an AI is not taking anything from an author that a human wouldn't take when writing something based on their knowledge.
The right to read, at least in spirit, is still being enforced today. Even the 5th Circuit (!!!) believed that this particular flavor of First Amendment claim would be likely to succeed on appeal after prevailing at the trial level. [*Book People v. Wong*](https://law.justia.com/cases/federal/appellate-courts/ca5/23-50668/23-50668-2024-01-17.html), No. 23-50668 (5th Cir. 2024) (not an AI case). It also incorporates principles from intellectual property law. Notably, this argument states that one can read the content of a work without diminishing the value of the author's expression (*i.e.*, ideas aren't copyrightable). As such, the output of an AI is not taking anything from an author that a human wouldn't take when writing something based on their knowledge.
I take issue with the argument on two points that stem from the same technological foundation.
First, as a policy point, the argument incorrectly humanizes current generative AI. There are no characteristics of current GAI that would warrant the analogy between a human reading a webpage and an AI training on that webpage. Even emerging tools like the improperly named [Deep Document Understanding](https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md) —which claim to ingest documents "as \[a\] human being"—are just classifiers on stochastic data at the technical level, and are not actual "understanding."
Second, and more technically, [[#Training|the training section]] above is my case for why an AI does not learn in the same way that a human does in the eyes of copyright law. ==more==
Second, and more technically, [[Atomic/gen-ai#Training|the training section]] is my case for why an AI does not learn in the same way that a human does in the eyes of copyright law. ==more==
But for both of these points, I can see where the confusion comes from. The previous leap in machine learning was called "[[Atomic/neural-network|neural networks]]", which definitely evokes a feeling that it has something to do with the human brain. Even more so when the techniques from neural network learners are used extensively in transformer models (that's those absurd numbers of parameters mentioned earlier).
## Points of concern, or "watch this space"
These are smaller points that would cast doubt on the general zeitgeist around the AI boom that I found compelling. These may be someone else's undeveloped opinion, or it might be a point that I don't think I could contribute to in a valuable way. Many are spread across the fediverse; others are blog posts or articles. Others still would be better placed a Further Reading section, ~~but I don't like to tack on more than one post-script-style heading.~~ { *ed.: [[#Further Reading|so that was a fucking lie]]* }. If any become more temporally relevant, I may expand on them.
- [Cartoonist Dorothys emotional story re: midjourney and exploitation against author intent](https://socel.net/@catandgirl/111766715711043428)
- [Misinformation worries](https://mas.to/@gminks/111768883732550499)
- [Large Language Monkeys](https://arxiv.org/abs/2407.21787): another very new innovation in generative AI is called "repeated sampling." It literally just has the AI generate output multiple times and decide among those which is the most correct. This is more stochastic nonsense, and again not how a human learns, despite OpenAI marketing GPT-o1 (which uses the technique) as being capable of reason.
- [Large Language Monkeys](https://arxiv.org/abs/2407.21787): another very new innovation in generative AI is called "repeated sampling." It literally just has the AI generate output multiple times and decide among those which is the most correct. This is more stochastic nonsense, and again not how a human learns, despite OpenAI marketing GPT-o1 (which uses the technique) as being capable of reason. See also [[Atomic/gen-ai#Other/emerging terminology|emerging AI technology]]
- Stronger over time
- One of the lauded features of bleeding-edge AI is its increasingly perfect recall from a dataset. So you're saying that as AI gets more advanced, it'll be easier for it to exactly reproduce what it was trained on? Sounds like an even better case for copyright infringement.
- Inevitable harm

View File

@ -20,5 +20,10 @@ However, this argument forgets that intangible rights are not *yet* so centraliz
Unfortunately, because US copyright law is so easily abused, I think the most likely outcome is that publishers/centralized rights holders get their due, and individual creators get the shaft. This makes me sympathetic to arguments against specific parts of the US's copyright regime as enforced by the courts, such as the DMCA or the statutory language of fair use. We as a voting population have the power to compel our representatives to enact reforms that take the threat of ultimate centralization into account. We can even work to break down what's already here. But I don't think that AI should be the impetus for arguments against the system as a whole.
Finally, remember that perfect is the enemy of good enough. While we're having these discussions about how to regulate GenAI, ==unregulated use== is causing real economic and personal [[Atomic/gen-ai#Causes for concern|harm]] to creators and ==underrepresented minorities.==
- Links to the rest of the content to be added.
Finally, remember that perfect is the enemy of good enough. While we're having these discussions about how to regulate GenAI, unregulated use is causing real economic and personal [[Atomic/gen-ai#Causes for concern|harm]] to creators, underrepresented minorities, and consumers as a whole. I am personally in favor of courts reaching substantive issues sooner than later. As with Section 230, Congress works best in a reflective context, where they can proscribe an approach that they don't like rather than prescribe an approach without prior experimentation.
## Further Reading
- [[Atomic/gen-ai|Generative AI, explained]]
- [[Misc/training-copyright|Copyright applied to training]]
- [[Misc/generation-copyright|Copyright applied to output]]
- [[Essays/normative-ai|Why copyright ought to be applied to AI]]
- [[Essays/no-ai-fraud-act|No AI FRAUD Act bill, Section 230, and platforms]]

View File

@ -38,19 +38,26 @@ Slight variance in semantics, same answer because it's the most popular string o
## Expression and Infringement
It can be said that anything a human produces is just a recombination of everything that person's ever read. Similarly, that process is a simplified understanding of how an AI trains.
However, everything a *person* has ever read is stored as concepts, floating around in their brain. My brain doesn't have a specific person's explanation of a transformer model architecture prepped, or even particular phrases from that explanation. It has a "visual" and emotional linkage of **ideas**, that other regions of my brain leverage vocabulary to put to paper when I explain it. An AI stores words that occurred in its corpus that can be considered responsive to the prompt. It may also have words that succeeded the prompt as the next portion in a containing work of both the prompt and the output. N-grams, not neurons.
However, everything a *person* has ever read is stored as concepts, floating around in their brain. My brain doesn't have a specific person's explanation of a transformer model architecture prepped, or even particular phrases from that explanation. It has a "visual" (sorry folks with aphantasia) and emotional linkage of **ideas**, that other regions of my brain leverage vocabulary to put to paper when I explain it. An AI stores words that occurred in its corpus that can be considered responsive to the prompt. It may also have words that succeeded the prompt as the next portion in a containing work of both the prompt and the output. N-grams, not neurons.
The key difference: talking about a human brain making a work by recombining its input is **metaphor**; talking about an AI recombining a work is **technologically accurate**. A chatbot goes to look at the secret code and shows you the photograph it corresponds to when you ask it to.
Naturally, there are occurrences where a human and an AI would reach approximately the same factual response if you asked them the same question. So what makes some of AI output infringement? The same thing that makes some human responses copyright infringement: reproduction of a copyrighted work. But the difference is that some human responses would be copyrightable in themselves because they don't reproduce enough of a specific work or multiple works to be considered either an ordinary derivative or a compilation derivative. ==ughthis is hardddd==
Naturally, there are occurrences where a human and an AI would reach approximately the same factual response if you asked them the same question. So what makes some of AI output infringement? The same thing that makes some human responses copyright infringement: reproduction of a copyrighted work. But the difference is that some human responses would be copyrightable in themselves because they don't copy "enough" of a specific work to be considered an infringing reproduction of the expression.
- "enough" is messy by design. Expressive works take so many forms that it's a fool's errand to try and cover all edge cases, that's what we have judges for.
- Some of the pertinent "enough" reproductions are: word-for-word copy and paste or reupload; infringing derivatives, and reproduction of style+tone.
- One of the ways to avoid an "enough" ruling is, of course, [[Essays/normative-ai#Fair Use|Fair use]]. I'd argue that fair use allows you to reproduce quantitatively larger portions of a work without infringement, but it does literally encode a quantity factor because you can consider the amount of the work reproduced.
## Detour: An Alternative Argument
There's a more concise and less squishy argument that generative AI output infringes on its training dataset.
Recall that AI output taken right from the model (straight from the horse's mouth) is [not copyrightable according to USCO](https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence). If the model's input is copyrighted, and the output can't be copyrighted, then there's nothing in the AI "black box" that adds to the final product, so it's literally *just* the training data reproduced and recombined. Et voila, infringement.
This isn't to say that anything uncopyrightable will infringe something else, but it does mean that the defendant's likelihood of prevailing on a fair use defense could be minimal. Additionally, the simpler argument makes damages infinitely harder to prove in terms of apportionment.
This argument is not without its drawbacks. First, it does not say that anything uncopyrightable will infringe something else. It does, however, mean that the defendant's likelihood of prevailing on a fair use defense could be minimal.
Note that there are many conclusions in the USCO guidance, so you should definitely read the whole thing if you're looking for a complete understanding of the (very scarce) actual legal coverage of AI issues so far.
Additionally, the simpler argument makes damages infinitely harder to prove. Okay, you're infringing; *whose work*? How much? It would effectively shift the complexity of analysis onto the backend. Where typically parties will employ a damages expert (a statistician, an accountant, or otherwise), they'll also have to find a data scientist to testify along the lines of "x% of the dataset being from y source equals z% in the output" so the damages expert can use that information. So even though it's a simpler argument, it requires breaking a lot more new ground.
Note that there are many conclusions in the USCO guidance (and my favorite analogy, that genAI is like a commission artist), so you should definitely read the whole thing if you're looking for a complete understanding of the (very scarce) actual legal coverage of AI issues so far.
## Further Reading
- Sibling entry on [[Misc/training-copyright|training]]
- Sibling entry on [[Misc/training-copyright|training and copyright]]
- Who should be responsible for the harm caused by a generated work? [[Essays/normative-ai#Who's holding the bag?]]

View File

@ -11,6 +11,7 @@ draft: true
---
> [!important] Note
> **Seek legal counsel before acting/refraining from action re: copyright liability**.
> Nothing you read on the internet is a substitute for legal advice, and this post is no exception.
The field is notoriously paywalled (the field is also [[Essays/law-school|broken]]), but I'll try to link to publicly available versions of my sources whenever possible. The content of this entry is my interpretation, and is not legal advice or a professional opinion. Whether a case is binding on you personally doesn't weigh in on whether its holding is the nationally accepted view.

View File

@ -0,0 +1,22 @@
---
title: MM/YY - Summary of Changes
draft: true
tags:
- "#update"
date: 2024-12-12
lastmod: 2024-12-12
---
I've made the difficult decision to divide my massive AI essay, which approached 10k words at its most verbose, into a more digestible atomic format. You can pick and choose the rabbit holes you go down. Start at [[Atomic/gen-ai|Generative AI]].
## Pages
- New: **The AI Essay**
- [[Misc/ai-prologue|Prologue]]
- [[Atomic/gen-ai|Atomic: Generative AI]]
- [[Atomic/neural-network|Atomic: Neural Network]]
- [[Resources/copyright|Basic Principles of Copyright]]
- [[Essays/normative-ai|Why Copyright Should Apply to AI]]
- [[Misc/training-copyright|Theories of Copyright: AI Training]]
- [[Misc/generation-copyright|Theories of Copyright: AI Output]]
## Status Updates
-
## Helpful Links
[[todo-list|Site To-Do List]] | [[Garden/index|Home]]

View File

@ -1,6 +1,6 @@
---
title: 11/24 - Summary of Changes
draft: true
draft: false
tags:
- "#update"
date: 2024-11-02
@ -9,18 +9,8 @@ lastmod: 2024-11-30
## Housekeeping
Mariah Carey is thawing. May God have mercy, for she has none.
I've made the difficult decision to divide my massive AI essay, which approached 10k words at its most verbose, into a more digestible atomic format. You can pick and choose the rabbit holes you go down. Start at [[Atomic/gen-ai|Generative AI]].
## Pages
==they're all DRAFTS RN UNDRAFT BEFORE PUB==
- New: **The AI Essay**
- [[Misc/ai-prologue|Prologue]]
- [[Atomic/gen-ai|Atomic: Generative AI]]
- [[Atomic/neural-network|Atomic: Neural Network]]
- [[Resources/copyright|Basic Principles of Copyright]]
- [[Essays/normative-ai|Why Copyright Should Apply to AI]]
- [[Misc/training-copyright|Theories of Copyright: AI Training]]
- [[Misc/generation-copyright|Theories of Copyright: AI Output]]
- Content Update (Wayland is now discussed first in light of new testing!): [[Projects/nvidia-linux|Nvidia on Linux]]
- New: [[Atomic/rss|Atomic: RSS]] (reflected on homepage)
- New: [[Programs I Like/rss-readers|RSS Readers]]
- List update: [[Projects/rss-foss|Toward RSS]]
## Status Updates

View File

@ -17,7 +17,7 @@ On my little corner of the internet, I document my adventures in tech and compla
# Welcome!
You're on a [[Atomic/what-is-a-garden|Digital Garden]] dedicated to open-source use and contribution, legal issues in tech, and more.
For a monthly list of what's new on the site, subscribe to the [Updates RSS feed](/Updates.xml).
For a monthly list of what's new on the site, subscribe to the [Updates RSS feed](/Updates.xml).<sup><a class="internal" href="/Atomic/rss">Whats this?</a></sup>
## Important Links
[[curated|\(Optional\) Start here\!]] | [[Misc/disclaimers|Disclaimers/Terms of Use]] | [[/Updates|Monthly Changelog]], [[todo-list|Up Next]] | <a rel="me" href="https://social.treehouse.systems/@be_far">Mastodon</a>

View File

@ -86,12 +86,16 @@ h5 {
}
p {
text-indent: 8px;
margin-top: 2px;
margin-bottom: 16px;
padding-top: 2px;
text-indent: 8px;
}
// :is(h1, h2, h3, h4, .tags *) + p {
// text-indent: 16px;
// }
p:has(+ ul) {
margin-bottom: 2px;
}
@ -117,8 +121,8 @@ footer > p {
padding-bottom: 4px;
}
.content-meta > span {
margin-left: 8px;
margin-right: 8px;
// margin-left: 8px;
// margin-right: 8px;
}
ul {
@ -194,3 +198,10 @@ ul {
#toc-content ul {
margin: 0px 0px;
}
// Custom for my site
.bf-icon {
max-width:0.8em;
max-height:0.8em;
margin: 0px;
}