The current state of the theory that GPL propagates to AI models

202 points by jonymo 20 hours ago

Orygin 18 hours ago

Great article but I don't really agree with their take on GPL regarding this paragraph:

> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.

The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.

> What is important is how to realize the “freedom of software,” which is the philosophy of open source

Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).

I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?

faxmeyourcode 5 hours ago

There are many misconceptions of the GPL, gnu, and free software movement. I love the idealism of free software and you hit the nail on the head.

Below are the four freedoms for those who are interested. Straight from the horse's mouth: https://www.gnu.org/philosophy/free-sw.html

    The freedom to run the program as you wish, for any purpose (freedom 0).

    The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.

    The freedom to redistribute copies so you can help others (freedom 2).

    The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

froh 15 hours ago

> The spirit of the GPL is the freedom of the user, not the code being freely shared.
who do you mean by "user"?
the spirit is that the person who actually uses the software also has the freedom to modify it, and that the users recovering these modifications have the same rights.
is that what you meant?
and while technically that's the spirit of the GPL, the license is not only about users, but about a _relationship_, that of the user and the software and what the user is allowed to do with the software.
it thus makes sense to talk about "software freedom".
last not least, about a single GPL function --- many GPL _libraries_ are licensed less restrictively, LGPL.
- m463 13 hours ago
  
  I don't think you understand the GPL.
  > "the user is allowed to do with the software"
  The GPL does not restrict what the user does with the software.
  It can be USED for anything.
  But it does restrict how you redistribute it. You have responsibilities if you redistribute it. You must provide the source code, and pass on the same freedoms you received to the users you redistribute it to.
  - gizajob 13 hours ago
    
    Thinking on though, if the models are trained on any GPL code then one could consider that they contain that GPL code, and are constantly and continually updating and modifying that code, thus everything the model subsequently outputs and distributes should come under the GPL too. It’s far from sufficient that, say, OpenAI have a page on their website to redistribute the code they consume in their models if such code becomes part of the model’s training data that is resident in memory every time it produces new code for users. In the spirit of the GPL all that derivative code seems to also come under the GPL, and has to be made available for free, even if upon every request the generated code is somehow novel or unique to that user.
    
    marcus_holmes 7 hours ago
    
    Riffing on this:
    If the LLM can reproduce the entire GPL'd code, with licence and attribution intact, then that would satisfy the GPL, correct?
    If the LLM can invent new code, inspired by but not copied from the GPL'd code, that new code does not require a GPL licence.
    This is essentially the same as we humans do: I read some GPL code and go "huh, neat architecture!" and then a year later solve a similar problem using an architecture inspired by that code. This is not copying, and does not require me to GPL the code I'm producing. But if I copy-paste a function from the GPL code into my code base, I need to respect the licence conditions and GPL at least part of my code base.
    I think the argument that the author is talking about is if the model itself should be GPL'd because it contains copies of GPL'd code that can be reproduced. I don't buy this because that GPL code is not being run as part of the model's functioning. To use an analogy: if I create a code storage system, and then use it to store some GPL code, I don't have to GPL the code storage system itself. As long as it can reproduce the GPL code together with its licence and attribution, then the GPL is not being infringed at any point. The system is not using or running the GPL code itself, it is just storing the GPL code. This is what the LLM is doing.
    
    AnthonyMouse 7 hours ago
    
    > Thinking on though, if the models are trained on any GPL code then one could consider that they contain that GPL code, and are constantly and continually updating and modifying that code, thus everything the model subsequently outputs and distributes should come under the GPL too.
    If you ask a model to output a task scheduler in C, and the training data contained a GPL-licensed implementation of the Fibonacci function in Haskell, the output isn't likely to bear a lot of resemblance to that input. It might even be unrelated enough that adding that function to the training data doesn't affect what the model outputs for that prompt at all.
    The nasty thing in terms using code generated by these things is that if you ask the model to output a task scheduler in C and the training data contained a GPL-licensed implementation of a task scheduler in C, the output plausibly could bear a strong resemblance to that input. Without you knowing that. And then if you go incorporate that into something you're redistributing, what happens?
    
    mistrial9 7 hours ago
    
    fundemental architecture of networks, compilers, disk operating systems, databases and more are implemented in GPL family LICENSE code; high value targets to acquire and master.
  - froh 8 hours ago
    
    first I thought you'd go into the nuance of gpl2 vs 3 or lgpl vs gpl vs agpl? patents, tivoization, cloud use?
    :-)
    I agree, I didn't make any statement what you can do with the software as long as you are licensed to use it
    you are allowed to build atomic bombs, nuclear power plants, tanks, whatever.
    but only as long as you comply i.e. give your downstream the freedom you've received.
    if you fail at that, you're no longer allowed to use the software for anything.
    see section 8 Termination for details
    https://www.gnu.org/licenses/gpl-3.0.html#license-text
themafia 11 hours ago

> The virality is a byproduct to ensure the software is not stolen from their users.
If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
> Freedom of software means nothing.
Software is information. Does "freedom of information" mean nothing? I think you're narrowing concepts here into something not particularly useful or reflective of reality.
> Users get the freedom to enjoy the software how they like.
The freedom is to modify the code for my own purposes. This is not at all required to plainly "enjoy" the software. I instead "enjoy a particular benefit."
> Why would my software which only contains a single function not be fair use?
Because fair use implies educational, informational, or transformational outputs. Your software is none of those things.
- Brian_K_White 8 hours ago
  
  "If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way."
  Yes you are. You are just deprived of something you apparently don't recognize or value, but that doesn't make it ok.
  The original author was also stolen from and that doesn't rely on your understanding or perception.
  The original author set some terms. Therm were not money but they are terms exactly like money. They said "you can have this, and only price is you have to make the source, and the further right to redistribute, available to any user you hand a binary to.
  Well MS handed you a binary and did not also hand you the source or the right to redistribute.
  That stole from both you and the original author and me who might otherwise have benefited from your own child work. The fact that you personally apparently were never going to make use of something they owe you doesn't change the fact that they owe you, and the original author and me.
  - jorl17 5 hours ago
    
    It is a tale as old as time, and one which no doubt all of us repeat at some point in our lives. There are hundreds of clichéd books, hundreds of songs, and thousand of letters that echo this sentiment.
    We are rarely capable of valuing the freedoms we have never been deprived of.
    To be privileged is to live at the quiet centre of a never-ending cycle: between taking a freedom for granted (only to eventually lose it), and fighting for that freedom, which we by then so desperately need.
    And as Thomas Paine put it: "Those who expect to reap the blessings of freedom, must, like men, undergo the fatigues of supporting it."
- inlined 10 hours ago
  
  As a user I suffer from not being able to freely use or derive my own work from Microsoft’s
  - reactordev 10 hours ago
    
    This. People conflate consumer to user. A user in the sense of GPL is a programmer or technical person whom the software (including source) is intended for.
    Not necessarily a “user of an app” but a user of this “suite of source code”.
    
    Brian_K_White 8 hours ago
    
    Except really the whole point is it explicitly and actively makes no distinction. Every random user has 100% of the same rights as any developer or vendor.
  - CamperBob2 8 hours ago
    
    At this point they've contributed a reasonably-fair share of open-source code themselves.
    No one benefits from locking up 99.999% of all source code, including most of Microsoft's proprietary code and all GPL code.
    No one.
    When it comes to AI, the only foreseeable outcome to copyright maximalism is that humans will have to waste their time writing the same old shit, over and over, forever less one day [1], because muh copyright!!!1!
    1: https://en.wikipedia.org/wiki/Copyright_Term_Extension_Act
    
    Retric 5 hours ago
    
    > only foreseeable outcome to copyright maximalism
    Nahh, AI companies had plenty of money to pay for access they simply chose not to.
    
    CamperBob2 5 hours ago
    
    Clearing those rights, which don't actually exist yet, would have been utterly impossible for any amount of money. Thousands of lawyers would tie up the process in red tape until the end of time.
    
    Retric 4 hours ago
    
    The basic premise of the economy is people do stuff for money. Any rights holder debating with their punishing house or whatever just means they don’t get paid. Some trivial number of people would opt out, but most authors or their estates would happily take an extra few hundred dollars per book.
    YouTube on the other hand has permission from everyone uploading videos to make derivative works barring some specific deal with a movie studio etc.
    Now there’s a few exceptions like large GPL works but again diminishing returns here, you don’t need to train on literally everything.
- faxmeyourcode 4 hours ago
  
  > If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
  The user in this example is deprived of freedoms 1, 2, and 3 (and probably freedom 0 as well if there are terms on what machines you can run the derivative binary on).
  Read more here: https://www.gnu.org/philosophy/free-sw.html
  Whether or not the user values these freedoms is another thing entirely. As the software author, licensing your code under the GPL is making a conscious effort to ensure that your software is and always will be free (not just as in beer) software.
CamperBob2 14 hours ago

The GPL arose from Stallman's frustration at not having access to the source code for a printer driver that was causing him grief.
In a world where he could have just said "Please create a PDP-whatever driver for an IBM-whatever printer," there never would have been a GPL. In that sense AI represents the fulfillment of his vision, not a refutation or violation.
I'd be surprised if he saw it that way, of course.
- belorn 12 hours ago
  
  The safeguards will prevent the AI from reproducing the proprietary drivers for the IBM-whatever printer, and it will not provide code that breaks the DRM that exist to prevent third-party drivers from working with the printer. There will however be no such safeguards or filters to prevent IBM to write a proprietary driver for their next printer, using existing GPL drivers as a building block.
  Code will only ever go in one direction here.
  - CamperBob2 12 hours ago
    
    Then we'd better stop fighting against AI, and start fighting against so-called "safeguards."
    
    belorn 11 hours ago
    
    I wish you luck. The music industry basically won their fight in forcing safeguards against AI music. The film industry are gaining laws regulating AI film actors. The code generating AI are only training on freely accessible code and not proprietary code. There is multiple laws being made against AI porn all over the world (or possible already on the books).
    What we should fight is Rules For Thee but Not for Me.
    
    CamperBob2 8 hours ago
    
    The music industry basically won their fight in forcing safeguards against AI music. The film industry are gaining laws regulating AI film actors. The code generating AI are only training on freely accessible code and not proprietary code. There is multiple laws being made against AI porn all over the world (or possible already on the books).
    Yeah, well, we'll see what our friends in China have to say about all that.
    
    throwaway290 8 hours ago
    
    "we better stop fighting against CCTVs everywhere and start fighting against them used for indiscriminate surveillance"
    
    AnthonyMouse 6 hours ago
    
    That's the inverse. Mass surveillance is bad so it should be banned, vs. using AI to thwart proprietary lock-in is good and so shouldn't be banned.
    But also, is the inverse even wrong? If some store has a local CCTV that keeps recordings for a month in case someone robs them, there is no central feed/database and no one else can get them without a warrant, that's not really that objectionable. If Amazon pipes the feed from every Ring camera to the government, that's very different.
    
    throwaway290 5 hours ago
    
    > If some store has a local CCTV
    By "everywhere" I obviously don't mean "on your private property", I mean "everywhere" as in "on every street corner and so on".
    If people are OK with their government putting CCTVs on every lamp post on the promise that they are "secure" and "not used to aggregate data and track people" and "only with warrant" then it's kind of "I told you so" when (not if) all of those things turn out to be false.
    > using AI to thwart proprietary lock-in is good and so shouldn't be banned.
    It's shortsighted because whoever runs LLMs isn't doing it to help you thwart lock in. It might for now but then they don't care about anything for now, they steal content as fast as they can and they lose billions yearly to make sure they are too big too fail. Once they are too big they will tighten the screws and literally they have the freedom to do whatever they want as long as it's legal.
    And surprise helping people thwart lock-in is relatively much less legal (in addition to much less profitable) than preventing people from thwarting lock-in.
    It's kind of bizarre to see people thinking these LLM operators will be somehow on the side of freedom and copyleft considering what they are doing.
    
    CamperBob2 5 hours ago
    
    ... Yeah?
    
    throwaway290 5 hours ago
    
    good luck with that!
- saurik 13 hours ago
  
  But that isn't the same code that you were running before. And like, let's not forget GPLv3: "please give me the code for a mobile OS that could run on an iPhone" does not in any way help me modify the code running on MY iPhone.
  - CamperBob2 12 hours ago
    
    Sure it does. Just tell the model to change whatever you want changed. You won't need access to the high-level code, any more than you need access to the CPU's microcode now.
    We're a few years away from that, but it will happen unless someone powerful blocks it.
    
    dzaima 9 hours ago
    
    I believe the point was that iPhones don't even allow running custom code even if you have the code; whereas GPLv3 mandates that any conveyed form of a work must be replacable by the user. So unless LLMs easily spit out an infinite stream of 0days to exploit to circumvent that, they won't help here.
- dzaima 9 hours ago
  
  In said hypothetical world, though, the whatever-driver would also have been written by LLMs; and, if the printer or whatever is non-trivial and made by a typical large company, many LLM instances with a sizable amount of token spending over a long period of time.
  So getting your own LLM rewrite to an equivalent point (or, rather, less buggy as that's the whole point!) would be rather expensive; at the absolute very least, certainly more expensive than if you still had the original source code to reference or modify (even if an LLM is the thing doing those). Having the original source code is still just strictly unconditionally better.
  Never mind the question of how you even get your LLM to reverse-engineer & interact with & observe the physical hardware of your printer, and whatever wasted ink during debugging of the reinvention of what the original driver already did correctly.
  - AnthonyMouse 6 hours ago
    
    Now I'm kind of curious if you give an LLM the disassembly of a proprietary firmware blob and tell it to turn it into human-readable source code, how good is it at that?
    You could probably even train one to do that in particular. Take existing open source code and its assembly representations as training data and then treat it like a language translation task. Use the context to guess what the variable names were before the original compiler discarded them etc.
    
    CamperBob2 5 hours ago
    
    Should be possible. A couple of years ago I used an earlier ChatGPT model to understand and debug some ARM assembly, which I'm not personally very familiar with.
    I can imagine that a process like what you describe, where a model is trained specifically on .asm / .c file pairs, would be pretty effective.

palata 18 hours ago

Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

bwfan123 16 hours ago

Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]
https://github.com/ocaml/ocaml/pull/14369/files#diff-062dbbe...
[1] https://news.ycombinator.com/item?id=46039274
- Chris_Newton 14 hours ago
  
  I once had a well-known LLM reproduce pretty much an entire file from a well-known React library verbatim.
  I was writing code in an unrelated programming language at the time, and the bizarre inclusion of that particular file in the output was presumably because the name of the library was very similar to a keyword I was using in my existing code, but this experience did not fill me with confidence about the abilities of contemporary AI. ;-)
  However, it did clearly demonstrate that LLMs with billions or even trillions of parameters certainly can embed enough information to reproduce some of the material they were trained on verbatim or very close to it.
- quotemstr 14 hours ago
  
  So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.
  - ikawe 14 hours ago
    
    If your brain was distributed as software, I think it might?
  - voxl 13 hours ago
    
    There is a stupid presupposition that LLMs are equivalent to human brains which they clearly are not. Stateless token generators are OBVIOUSLY not like human brains even if you somehow contort the definition of intelligence to include them
    
    quotemstr 13 hours ago
    
    Even if they are not "like" human brains in some sense, are they "like" brains enough to be counted similarly in a legal environment? Can you articulate the difference as something other than meat parochialism, which strikes me as arbitrary?
    
    strogonoff 7 hours ago
    
    If LLMs are like human minds enough, then legally speaking we are abusing thinking and feeling human-like beings possessing will and agency in ways radically worse than slavery.
    What is missing in the “if I can remember and recite program then they must be allowed to remember and recite proframs” argument is that you choose to do it (and you have basic human rights and freedoms), and they do not.
    
    voxl 10 hours ago
    
    All definitions are arbitrary if you're unwilling to couch them in human experience, because humans are the ones defining. And my main difference is right there in my initial response: an LLM is a stateless function. At best, it is a snapshot of a human brain simulated on a computer, but at no point could it learn something new once deployed. This is the MOST CHARITABLE interpretation of which I don't even concede in reality, it is not even a snapshot of a brain.
    
    AlexandrB 13 hours ago
    
    All law is arbitrary. Intellectual property law perhaps most of all.
    Famously, the output from monkey "artists" was found to be non-copyrightable even though a monkey's brain is much more similar to ours than an LLM.
    [1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
    
    quotemstr 13 hours ago
    
    If IP law is arbitrary, we get to choose between IP law that makes LLMs propagate the GPL and law that doesn't. It's a policy switch we can toggle whenever want. Why would anyone want the propagates-GPL option when this setting would make LLMs much less useful for basically zero economic benefit? That's the legal "policy setting" you choose when you basically want to stall AI progress, and it's not going to stall China's progress.
  - matheusmoreira 8 hours ago
    
    > Doesn't mean my brain is GPLed.
    It would be if they could get away with it. The likes of Disney would delete your memories of their films if they could get away with it. If you want to enjoy the film, you should have to pay them for the privilege, not recall the last time you watched it.
  - em-bee 11 hours ago
    
    not your brain, but the code you produce if it includes portions of GPL code that you remembered.
  - furyofantares 12 hours ago
    
    The question was "if I train my model with copyleft material, how do you prove I did?"
  - gspr 11 hours ago
    
    > So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.
    Your brain is part of you. Some might say it is your very essence. You are human. Humans have inalienable rights that sometimes trump those enshrined by copyright. One such right is the right to remember things you've read. LLMs are not human, and thus don't enjoy such rights.
    Moreover, your brain is not distributed to other people. It's more like a storage medium than a distribution. There is a lot less furore about LLMs that are just storage mediums, and where they themselves or their outputs are not distributed. They're obviously not very useful.
    So your analogy is poor.
friendzis 17 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?
In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?
Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?
- isodev 16 hours ago
  
  > even if your model was trained strictly on copyleft material
  That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.
  - tpmoney 14 hours ago
    
    > That's not legal use of the material according to most copyleft licenses.
    That probably doesn't matter given the current rulings that training an AI model on otherwise legally acquired material is "fair use", because the copyleft license inherently only has power because of copyright.
    I'm sure at some point we'll see litigation over a case where someone attempts to make "not using the material to train AI" a term of the sales contract for something, but my guess would be that if that went anywhere it would be on the back of contract law, not copyright law.
  - friendzis 2 hours ago
    
    I have referenced words in the comment I was replying to, you can safely substitute "copyleft" with "public domain" and the argument still stands. Your comment focusing on minutiae of training, however, highlights how relevant the discussion around outputs in particular is.
    edit: wording.
david_allison 15 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
It may produce it when asked
https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...
- chii 4 hours ago
  
  > It may produce it when asked
  that's not proof - it may also be intelligent enough to have produce similar expressions without the original training data.
  Not to mention that having knowledge of copyrighted material is not in violation of any known copyright law - after all, human brains also have the knowledge after learning it. The model, therefore, cannot be in violation regardless of what data was used to train it (as long as that data was not obtained illegally).
  If someone _chooses_ to use the LLM to reproduce harry potter, or some GPL'ed code, then that person would be in violation of the relevant copyright laws. The copyright owner needs to pursue that person, rather than the owner of the LLM. In the exact same way that if someone used Microsoft Word to reproduce harry potter, microsoft would not have any liability.
blibble 17 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
discovery via lawyers
uhfraid 6 hours ago

> Like if there is no way to trace it back to the original material, does it make sense to regulate it?
Training data extraction has seen some success, tracing should be possible for at least some of it
https://arxiv.org/abs/2311.17035
freedomben 18 hours ago

I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.
On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.
At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.
I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.
ACCount37 17 hours ago

You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.
It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.
Now, would that be enough to put the entire AI under GPL? I doubt it.
reactordev 10 hours ago

By reverse inference and model inversion. We can determine what content a pathway has been trained on. We can find out if it’s been trained on GPL material.
PaulKeeble 17 hours ago

Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.
LexiMax 9 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
Discovery.
Animats 14 hours ago

There's the other side of this issue. The current position of the U.S. Copyright Office is that AI output is not copyrightable, because the Constitution's copyright clause only protects human authors. This is consistent with the US position that databases and lists are not copyrightable.[1]
Trump is trying to fire the head of the U.S. Copyright Office, but they work for the Library of Congress, not the executive branch, so that didn't work.[2]
[1] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
[2] https://apnews.com/article/trump-supreme-court-copyright-off...
basilgohar 18 hours ago

Maybe we should requiring training data be published or at least referenced.
mistrial9 18 hours ago

> Should I keep open sourcing my code now that the licence doesn't matter anymore?
your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!
luqtas 17 hours ago

genuine question: why you are training your model with content that explicitly will have requirements violated if you do?
- 1gn15 17 hours ago
  
  out of pure spite for hypocritical "hackers"
isodev 16 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
The burden is on you to prove that you didn't.
- palata 16 hours ago
  
  No it is not. It is exactly how the burden of proof works.
  https://en.wikipedia.org/wiki/Burden_of_proof_(law)
ForHackernews 17 hours ago

https://www.penny-arcade.com/comic/2024/01/19/fypm
Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.

zamadatix 19 hours ago

The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.

I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.

Not a lawyer, just excited to see the outcomes :).

twoodfin 19 hours ago

Ideally, Congress would just settle this basket of copyright concerns, as they explicitly have the power to do—and have done so repeatedly in the specific context of computers and software.
- tpmoney 13 hours ago
  
  I've pitched this idea before but my pie in the sky hope is to settle most of this with something like a huge rollback of copyright terms, to something like 10 or 15 years initially. You can get one doubling of that by submitting your work to an official "library of congress" data set which will be used to produce common, clean, and open models that are available to anyone for a nominal fee and prevent any copyright claims against the output of those models. The money from the model fees is used to pay royalties to people with materials in the data set over time, with payouts based on recency and quantity of material, and an absolute cap to discourage flooding the data sets to game the payments.
  This solution to me amounts to an "everybody wins" situation, where producers of material are compensated, model trainers and companies can get clean, reliable data sets without having to waste time and energy scraping and digitizing it themselves, and model users can have access to a number of known "safe" models. At the same time, people not interested in "allowing" their works to be used to train AIs and people not interested in only using the public data sets can each choose to not participate in this system, and then individually resolve their copyright disputes as normal.
- jeremyjh 18 hours ago
  
  What is ideal about getting more shitty laws written at the behest of massive tech companies? Do you think the DMCA is a good thing?
  - twoodfin 18 hours ago
    
    As opposed to waiting for uncertain court cases (based on the existing shitty laws) to play out for years, ultimately decided by unelected judges?
    Democracy is the worst system we’ve tried, except for all the others.
    (Also: The GPL can only be enforced because of laws passed by Congress in the late ‘70’s and early ‘80’s. And believe you me, people said all the same kinds of things about those clowns in Congress. Plus ça change…)
    
    jeremyjh 18 hours ago
    
    Courts applying legal analysis to existing law and precedent is also an operation of democracy in action and lately they've been a lot better at it than legislators. I don't know if you've noticed, but the quality of our legislators has substantially deteriorated since the 80s, when 24-hour news networks became a thing. It got even worse after the Citizens United decision and social media became a thing. "No new laws" is really the safest path these days.
  - sidewndr46 16 hours ago
    
    DMCA isn't intrinsically copyright. It's a questionable attempt at a safe harbor provision that has horrible provisions for abuse. I'm not even of the opinion that copyright about computer software is poorly executed. It's mostly software patents that don't make any sense to me. When you have a concept that essentially every mathematics undergrad is familiar with getting labels slapped on it & called a novel technique. It's made worse by the fact that the patent office itself isn't enabled to perform any real review. There are no shortage of impossible devices patented each year in the category of things perpetual motion.

myrmidon 19 hours ago

I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.

My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.

/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.

I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.

jeremyjh 18 hours ago

Anyone can very easily avoid training on GPL code. Yes, the model might be not be as strong as one that is trained that way and released under terms of the GPL, but to me that sounds like quite a good outcome if the best models are open source/open weight.
Its all about whose outcomes are optimized.
Of course, the law generally favors consideration of the outcomes for the massive corporations donating hundreds of millions of dollars to legislature campaigns.
- myrmidon 17 hours ago
  
  Would it even actually help to go down that road though? IMO the expected outcome would simply be that AI training stalls for a bit while "unencumbered" training material is being collected/built up and you achieve basically nothing in the end, except creating a big ongoing logistical/administrative hassle to keep lawyers/bureaucrats fed.
  I think the redistribution effect (towards training material providers) from such an scenario would be marginal at best, especially long-term, and event that might be over-optimistic.
  I also dislike that stance because it seems obviously inconsistent to me-- if humans are allowed to train on copyrighted material without their output being generally affected, why not machines?
cardanome 11 hours ago

> I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed
Not sure about undesirable, I so wish we could just ban all generative AI.
I feel profound sadness of having lost the world we had before generative AI became widespread. I really loved programming and seeing my trade devalued with vibe coding is just heart breaking. We will see mass unemployment, deep fakes, more AI induced psychosis, a devaluing of human art. I hate this new world.
It would be the morally correct thing to bann generative AI as it only benefits corporations and doesn't improve the life of the people but makes it worse.
The training of the big LLMs has been criminal. Whether we talk about GPL licensed code or the millions of artist that never released their work under a specific license and would never haven consented to it being used for training.
I still think states will allow it and legalize the crime because they believe that AI offer competitive advantages and they will fear "falling behind". Plus military use.
- redox99 9 hours ago
  
  In my opinion programming has never been this much fun. The vast vast majority of code is repetitive stuff that now is a breeze. I can build so much stuff now, and with more beautiful code because refactoring is effortless.
  I think it's like going from pre industrial revolution manual labor, to modern tools and machines.
  - joegibbs 8 hours ago
    
    I agree, back before LLMs I would be so tired at the end of the day and it would take forever, typing out tedious stuff that I've done before but slightly differently - making a form for some thing or a page that displays something else. Now I can just go and tell it to make me a new page in the style of the last one that displays XYZ information instead and it makes it in 20 seconds. Tell it to implement this algorithm for this data and it does it. It's great, it feels like going up a level in abstraction and just thinking about the bigger picture.
  - archagon 6 hours ago
    
    If the vast majority of your code is repetitive stuff, you are not using the right abstractions.
jay_kyburz 8 hours ago

Reading your comment made me think about the other-side of the equation. I think it's generally considered that AI generated works are not themselves protected by copyright, I wonder if code with little to no human intervention become un-licenable.
You don't have any rights to assert when you have AI write the code for you.
wizzwizz4 15 hours ago

Human learning is materially different from LLM training. They're similar in that both involve providing input to a system that can, afterwards, produce output sharing certain statistical regularities with the input, including rote recital in some cases – but the similarities end there.
- gruez 12 hours ago
  
  >Human learning is materially different from LLM training [...] but the similarities end there.
  Specifically what "material differences" are there? The only arguments I heard are are around human exceptionalism (eg. "brains are different, because... they just are ok?"), or giving humans a pass because they're not evil corporations.
  - octoberfranklin 7 hours ago
    
    Humans can generalize.
    LLMs just predict the statistically-most-likely token.
    
    chii 4 hours ago
    
    human brains are just chemical reactions and electrical transmission between neurons too. You're comparing completely different layers of abstraction in your arguments.
- IshKebab 11 hours ago
  
  Why? I'm pretty sure I can learn the lyrics of a song, and probabilistically output them in response to a prompt.
  Is the existence of my brain copyright infringement?
  The main difference I see (apart from that I bullshit way less than LLMs), is that I can't learn nearly as much as an LLM and I can't talk to 100k people at once 24/7.
  I think the real answer here is that AI is a totally new kind of copying, and it's useful enough that laws are going to have to change to accommodate that. What country is going to shoot itself in the foot so much by essentially banning AI, just so it can feel smug about keeping its 20th century copyright laws?
  Maybe that will change when you can just type "generate a feature length Pixar blockbuster hit", but I don't see that happening for quite a long time.

graemep 19 hours ago

The article repeatedly treats license and contract as though they are the same, even though the sidebar links to a post that discusses the difference.

A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.

xgulfie 19 hours ago

And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
- gruez 12 hours ago
  
  >And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
  What "lobbied"? Copyright law hasn't materially changed since AI got popular, so I'm not sure where these lobbying efforts are showing up in. If anything the companies that have lobbied hard in the past (eg. media companies) are opposed to the current status quo, which seems to favor AI companies.
- graemep 18 hours ago
  
  I am really surprised that media businesses, which are extremely influential around the world, have not pushed back against this more. I wonder whether they are looking at cost savings that will get from the technology as a worthwhile trade-off.
  - gorbachev 18 hours ago
    
    They're busy trying to profit from it rushing to enter into licensing agreements with the LLM vendors.
    
    xgulfie 17 hours ago
    
    Yeah, the short term win is to enter a licensing agreement so you get some cash for a couple years, meanwhile pray someone else with more money fights the legal battle to try and set a precedent for you
  - mr_toad 13 hours ago
    
    Several media companies have sued OpenAI already. So far, none have been successful.
- rileymat2 18 hours ago
  
  All theirs, if they properly obtained the copy.
  This is a big difference that already has bit them.
- exasperaited 18 hours ago
  
  In practice it wouldn't matter a whit if they lobbied for it or not.
  Lobbying is for people trying to stop them; externalities are for the little people.
maxloh 19 hours ago

To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
- blibble 19 hours ago
  
  fair use only applies in the united states (and Poland, and a very limited set of others)
  https://en.wikipedia.org/wiki/Fair_use#/media/File:Fair_use_...
  and it is certainly not part of the Berne Convention
  in almost every country in the world even timeshifting using your VCR and ripping your own CDs is copyright infringement
  - RobotToaster 18 hours ago
    
    Most commonwealth countries have fair dealing, which is similar although slightly different https://en.wikipedia.org/wiki/Fair_dealing
    
    blibble 18 hours ago
    
    importantly "fair dealing" has no concept of "transformation"
    (which is the linch-pin of the sloppers)
  - gruez 12 hours ago
    
    Great, so the US and China can duke it out trying to create AGI or whatever, whereas most other countries are stuck in the past because of their copyright laws?
  - jcelerier 19 hours ago
    
    France and most of europe has fair use (https://fr.wikipedia.org/wiki/Copie_priv%C3%A9e) but also has a mandatory tax on every sold medium that can do storage to recover the "lost fees" due to fair use
- mongol 19 hours ago
  
  > To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
  Is this legally settled?
  - 1gn15 17 hours ago
    
    Yes. There have been multiple court cases affirming fair use.
- graemep 18 hours ago
  
  That is just the sort of point I am trying to make. That is a copyright law issue, not a contractual one. If the GPL is a contract then you are in breach of contract regardless of fair use or equivalents.
OneDeuxTriSeiGo 19 hours ago

It's not specific to open source but it's most clearly enforceable with open source as there will be many contributors from many jurisdictions with the one unifying factor being they all made their copyright available under the same license terms.
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
kronicum2025 19 hours ago

A GPL license is a contract in most other countries. Just not US probably.
- graemep 18 hours ago
  
  That part of the article is about US cases, so its US law that applies.
  > A GPL license is a contract in most other countries. Just not US probably.
  Not just the US. It may vary with version of the GPL too. Wikipedia claims its a civil law vs common law country difference - not sure the citation shows that though.

phplovesong 19 hours ago

We need a new license that forbids all training. That is the only way to stop big corporations from doing this.

maxloh 18 hours ago

To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
- rileymat2 18 hours ago
  
  It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
  But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
- justin_murray 18 hours ago
  
  This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
  - michaelmrose 18 hours ago
    
    It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.
    
    basilgohar 18 hours ago
    
    That isn't even remotely a sensible analogy. Equating copyright violation with stealing physical property is an extremely failed metaphor.
    
    tpmoney 13 hours ago
    
    One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"
    
    voidfunc 11 hours ago
    
    "Rules for thee, but not for me"
    
    MangoToupe 18 hours ago
    
    Maybe you have some legalistic point that escapes comprehension, but I certainly consider my house to be much private and the internet public.
- colechristensen 18 hours ago
  
  I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
  - conartist6 11 hours ago
    
    Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?
- LtWorf 14 hours ago
  
  Fair use was for citing and so on not for ripping off 100% of the content.
  - maxloh 14 hours ago
    
    Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
    This principle is also explicitly declared in US law:
    > In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
    https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...
    
    LtWorf 10 hours ago
    
    Recoding a video file doesn't get rid of the copyright therefore doing some automatic processing on a copyrighted material doesn't remove the copyright.
    The problem is that openai has too much money. But if I did what they are doing I'd get into massive legal troubles.
    
    1gn15 4 hours ago
    
    Not true. You can train on copyrighted material and post the resulting model on HuggingFace, and you won't get into trouble. Pinky promise.
mr_toad 13 hours ago

Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
tensor 13 hours ago

So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
munchler 18 hours ago

By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
- psychoslave 18 hours ago
  
  Isn’t it the very reason why we need cleanroom software engineering:
  https://en.wikipedia.org/wiki/Cleanroom_software_engineering
  - mr_toad 13 hours ago
    
    If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
- bluefirebrand 9 hours ago
  
  There is absolutely no reason that LLMs (or Corporations) should have the same rights as humans
- codedokode 18 hours ago
  
  Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
  Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
  - 1gn15 17 hours ago
    
    > And a human artist doesn't need to steal million pictures to learn to draw.
    They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
    Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
conartist6 11 hours ago

Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.
- themafia 11 hours ago
  
  Taken to an extreme:
  "Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
  It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
  - d0mine 7 hours ago
    
    > cartel continues to profit
    It doesn't follow. The reverse is more likely: If you end prohibition, you end the mafia.
WithinReason 19 hours ago

Wouldn't it be still legal to train on the data due to fair use?
- cryptonector 2 hours ago
  
  Not if it's an EULA and you make the bot click through an "I agree" button.
- gus_massa 18 hours ago
  
  I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
  - justin_murray 18 hours ago
    
    Honest question: why don’t you think it is fair use?
    I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
    These agents are just doing a more sophisticated, faster version of that same act.
    
    gus_massa 17 hours ago
    
    Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.
    I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
    [1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
    > Who can't contribute to Wine?
    > Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.
    
    seanmcdirmid 15 hours ago
    
    > I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
    This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
    
    mixedbit 18 hours ago
    
    Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?
    
    WithinReason 18 hours ago
    
    > this is pretty much what LLMs are doing
    I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?
    
    mixedbit 17 hours ago
    
    Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.
    
    WithinReason 16 hours ago
    
    lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them
    
    conartist6 11 hours ago
    
    The fair use prong that's problematic is that the fair use can't decimate the value of the original work. It's the difference between me imitating your art style for a personal project and me making 1,000,000 copies of your art so that your art isn't worth much anymore. One is a fair use, the other is exploitative extraction
  - LtWorf 14 hours ago
    
    Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.
cryptonector 2 hours ago

So an EULA?
BeFlatXIII 16 hours ago

How is that enforceable against the fly-by-night startups?
James_K 19 hours ago

Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
- Orygin 19 hours ago
  
  My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
  - tpmoney 13 hours ago
    
    In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
  - fouronnes3 18 hours ago
    
    Not sure why the FSF or any other organization hasn't released a license like this years ago already.
    
    amszmidt 18 hours ago
    
    Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
    
    Orygin 18 hours ago
    
    Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.
    You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
    Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
    
    amszmidt 18 hours ago
    
    That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).
    I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
    
    Orygin 18 hours ago
    
    > You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
    "A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
    I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
    Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
    
    amszmidt 17 hours ago
    
    I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".
    
    Orygin 17 hours ago
    
    I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".
    Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.
    My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.
- amszmidt 18 hours ago
  
  It isn't the difficult, a license that forbids how the program is used is a non-free software license.
  "The freedom to run the program as you wish, for any purpose (freedom 0)."
  - Orygin 18 hours ago
    
    Yet the GPL imposes requirements for me and we consider it free software.
    You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
  - helterskelter 18 hours ago
    
    Running the program and analyzing the source code are two different things...?
    
    amszmidt 18 hours ago
    
    In the context of Free Software, yes. Freedom one is about the right to study a program.
  - LtWorf 14 hours ago
    
    But training an AI on a text is not running it.
    
    tpmoney 13 hours ago
    
    And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.
    
    LtWorf 9 hours ago
    
    > And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply.
    If I print an harry potter book in red ink then I won't have any copyright issues?
    I don't think changing how the information is stored removes copyright.
    
    tpmoney 8 hours ago
    
    If it is sufficiently transformative yes it does. That’s why “information” per se is not eligible for copyright, no matter what the NFL wants you to think. No printing the entire text of a Harry Potter book in red ink is not likely to be viewed as sufficiently transformative. But if you take the entirety of that book and publish a list of every word and the frequency, it’s extremely unlikely to be found a violation of copyright. If you publish a count of every word with the frequency weighted by what word came before it, you’re also very likely to not be found to have violated copyright. If you distribute the MD5 sum of the file that is a Harry Potter book you’re also not likely to be found to have violated copyright. All of these are “changing how the information is stored”.
- tomrod 18 hours ago
  
  Model weights, source, and output.
scotty79 18 hours ago

We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
- joegibbs 8 hours ago
  
  That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?
- raincole 14 hours ago
  
  It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.
  - LtWorf 14 hours ago
    
    We need it to be infecting the rest like GPL does.
    
    raincole 13 hours ago
    
    You probably misunderstood how "infection" of GPL works. (which is very common)
    If your close-sourced project uses some GPL code, it doesn't automatically put your whole project in public domain or under GPL. It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
    In the simplest terms, GPL is:
    if codebase.is_gpl_compitable: gpl_code.give_permission(code_base) else if codebase.is_using(gpl_code): throw new COPYRIGHT_INFRINGEMENT // the copyright owner and the court deal with that with usual copyright laws
    GPL can't do much more than that. A license over a piece of code cannot automatically change the copyright status of another piece of code. There simply isn't legal framework for that.
    Similarly, AI code's copyleft status can't affect the rest of the codebase, unless we make new laws specifically saying that.
    Also similarly, even if Github lost the class action, it will NOT automatically release the model behind GPL to the public. It will open the possibility for all the GPL repo authors to ask Microsoft for compensation for stealing their code.
    
    cryptonector 2 hours ago
    
    You can use GPL code in proprietary code. You just can't distribute said proprietary code if you don't also distribute its sources in accordance with the GPL, and that is how the "infection" happens.
    
    em-bee 11 hours ago
    
    It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
    they can sue you and settle for whatever you will accept that makes them happy.
    if you lose then the alternative to not making your code GPL is to make your code disappear, that is you are no longer allowed to sell your product.
    consequently, if AI code is subject to the GPL then the rest of the codebase is too, or the alternative would be that the could not be distributed.
    
    raincole 10 hours ago
    
    First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.
    Secondly, GPL can't "make your (proprietary) code disappear." Violating GPL is essentially just stealing code. One cannot distribute the version that includes stolen code. But they can remove the stolen part and replace it with their own code. Of course they still need to settle/pay for the previous infringement.
    GPL simply can't affect the copyright status of rest of the codebase, because it's a license, not a contract. It cannot restrict the user's right further than the copyright laws.
    Again, it's very common misunderstanding of GPL's "virality." It has been a several-decade long debate about whether GPL should be treated like a contract instead of a mere license, but there is no ruling giving it this special legal state (yet), at least in the US.
    [0]: https://lwn.net/Articles/61292/ [1]: https://en.wikipedia.org/wiki/GNU_General_Public_License#Leg...
    
    em-bee 10 hours ago
    
    First of all, pure AI-generated code is uncopyrightable now. Uncopyrightable code can't be under GPL.
    if AI generates something that is equal to existing code, then the license of that code applies. the AI generated product as a whole can't be copyrighted, but the portions that reproduce copyrighted code retain the original copyright.
    they can remove the stolen part and replace it with their own code
    sure, if they can do that, then they can distribute their code again. but until then they can't.
    
    dragonwriter 9 hours ago
    
    > if AI generates something that is equal to existing code, then the license of that code applies.
    No, it doesn't, if the generation is independent of the existing code. If a person using AI uses existing code and makes a literal copy of it, then, yes, the copyright (and any license offer applicable in the circumstances) of the existing code may apply (it may also not, the same as with copies of portions of code made by other means), and it's less than clear if (especially for small portions of code) that legally such a copy has been made when a work is in the training set.
    Copyright protects against copying. It doesn't protect against someone creating the same content by means other than copying.
    
    em-bee 8 hours ago
    
    if the generation is independent of the existing code
    well, that's the big question, isn't it? if the code is used for training AI and the AI reproduces the same code, is that really independent?
    i don't think so.
    Copyright protects against copying. It doesn't protect against someone creating the same content by means other than copying.
    if the code is the same, how do you prove it's not a copy?
    it's the same problem as with plagiarism, isn't it?
    
    LtWorf 9 hours ago
    
    If I read harry potter and randomly rewrite it you think I have a chance against Rowling?
    
    dragonwriter 8 hours ago
    
    No, almost cerainly it would be practically impossible if you reproduced the entire work, on top of evidence that you had perused it, because it would be very hard to convince a trier of fact that the duplication really was coincidence rather than copying, but it might be a very different story if you had read Harry Potter and then wrote another work that includes the text “Up!” she screeched. (which appears verbatim in the first volume of the series.)
    
    LtWorf 2 hours ago
    
    And what if I reproduced just a chapter of a few paragraphs?
- palata 18 hours ago
  
  But then we would need a way to prove that some code was LLM generated, right?
  Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
  - chii 4 hours ago
    
    you could have the inverse - proof that the code was _not_ LLM generated. It's like a mark of origin/country of origin for produce.
- michaelmrose 18 hours ago
  
  Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.
  - basilgohar 18 hours ago
    
    It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].
    [0] https://factually.co/fact-checks/justice/evidence-investigat...

ljlolel 19 hours ago

And then also to all code made from the GPL’d ai model?

maxloh 19 hours ago

A program's output is likely not owned by the program's authors. For example, if you create a document with Microsoft Word, you are the one who owns it, not Microsoft.
- javcasas 18 hours ago
  
  You sure about that? Have you checked the 400-page EULA?
- pessimizer 19 hours ago
  
  Unless the license says otherwise. The fact that Word doesn't (I wouldn't even be sure if that was true, honestly, especially for the online versions) doesn't mean anything.
  They could start selling a version of Word tomorrow that gives them the right to train from everything you type on your entire computer into any program. Or that requires you to relinquish your rights to your writing and to license it back from Microsoft, and to only be able to dispute this through arbitration. They could add a morals clause.
  - chii 4 hours ago
    
    > They could start selling a version of Word tomorrow ...
    they could, but would anyone agree to this new eula? If they did, then what's the problem?
- LtWorf 14 hours ago
  
  If I take a song and convert it from .mp3 to .ogg, the resulting file has no copyright since it's the output of a program?

pessimizer 19 hours ago

I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.

Corporations have always talked about the virality of GPL, sometimes but not always to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.

Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)

Wouldn't you just pull it out?

NateEag 18 hours ago

If you were a thoughtful, careful, law-abiding business, yes.
I submit the evidence suggests the genAI companies have none of those attributes.
NiloCK 18 hours ago

Not crazy - there's a rational self-interest in doing this.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
I hope you're right in at least some cases!
- pessimizer 18 hours ago
  
  > by the time it's settled
  Why would the GPL settle? Even more, who is authorized to settle for every author who used the GPL? If the courts decided in favor of the GPL, which I think would be likely just because of the age and pervasiveness of the GPL, they'd actually have to lobby Congress to write an exception to copyright rules for AI.
  A large part of the infrastructure of the world is built on the GPL, and the people who wrote it were clearly motivated by the protection that they thought that the GPL would give to what was often a charitable act, or even an act that would allow companies to share code without having to compete with themselves. I can't imagine too many judges just going "nope."
  - hananova 17 hours ago
    
    I think they meant "settled" as in "resolved."
    
    pessimizer 15 hours ago
    
    I meant the same. I don't actually think that the GPL is an entity that can settle a court case; if I meant that I would have said the FSF or something. I mean that in order for it to resolve, a judge has to say that the GPL does not apply.
    If ultimately copyright holds up against the models*, the GPL will be a permanent holdout against any intellectual property-wide cross-licensing scheme. There's nobody to negotiate with other than the license itself, and it's not going to say anything it hasn't said before.
    * It hasn't done well so far, but Obama didn't appoint any SCOTUS judges so maybe the public has a chance against the corporations there.
ares623 14 hours ago

Why do hard thing when easy thing do trick?
exasperaited 18 hours ago

> I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Haha no.
https://windsurf.com/blog/copilot-trains-on-gpl-codeium-does...
And just in the last two days, AI generating LGPL headers (which it could not do if identifying LGPL code was pulled from the codebase) and misattributing authors:
https://devclass.com/2025/11/27/ocaml-maintainers-reject-mas...
- pessimizer 15 hours ago
  
  Thanks for the links.
  That first link shows people actively pulling out GPL code in 2023 and marketing around that fact, though. That's not great evidence that they're not doing it now, especially if testing for if GPL code is still in there seems to be as easy as prompting with an incomplete piece of it.
  I'd think that companies could amass a collection of all known GPL code and test for it regularly in order to refine their methods for keeping it out.
  > (which it could not do if identifying LGPL code was pulled from the codebase)
  Are you sure about this? Linking to LGPL code is fine afaik. And why not train on code that linked to universally available libraries that are legal to use? Seems like one might even prefer it.
  Seems like this was rejected for size and slop reasons, not licensing. If the submitter of the PR isn't even fixing possibly hallucinated author's names, it's obvious that they didn't really read it. Debugging vibe coded stuff is like finding an indeterminate number of needles in a haystack.
  - habinero 7 hours ago
    
    They also cited potential legal reasons, aka fraud and copyright.

simgt 19 hours ago

What triggers me is how insistant Claude Code is on adding "co-authored by Claude" in commits, in spite of my settings and an instruction in CLAUDE.md. I wish all these tech bros were as willing to credit the human shoulders on which their products are built. But they'd be much less successful in our current system if they were that kind of people.

euazOn 19 hours ago

Try changing the system prompt or switch to opencode [0] - they allegedly reverse engineered Claude Code, and so the performance you get with Claude models should be very similar to Claude Code.
[0] https://github.com/sst/opencode
- patrick91 19 hours ago
  
  there's an option for claude to disable co-authoring, see: https://code.claude.com/docs/en/settings
  { "includeCoAuthoredBy": false }
- simgt 19 hours ago
  
  I've changed the settings and added the instruction to the prompt, hence my frustration :)

dmezzetti 19 hours ago

As someone who has spent a fair amount of time developing open source software, I will say I genuinely dislike copyleft and GPL.

For those who are into freedom, I don't see how dictating how you use what you build in such a manner is in the spirit of free and open.

Just my opinion on it, to each their own on the matter.

myrmidon 18 hours ago

I had a very similar view once, and have since understood that this is mainly a difference in perspective:
It's easy as a developer to slip into a role where you want to build/package (maybe sell) some software product with minimal obligations. BSD-likes are obviously great there.
But the GPL follows a different perspective: It tries to make sure that every user of any software product is always capable of tinkering and changing it himself, and the more permissive licenses do not help there because they don't prevent (or even discourage!) companies from just selling you stripped and obfuscated binary blobs that put you fully at the vendors mercy.
- dmezzetti 18 hours ago
  
  I understand people want to control what happens once they build something. Too often do you see startups go with a permissive model only to go to a more restrictive model once something like that happens. Then it ends up upsetting a lot of people.
  I'm of the opinion that what I build, I'm willing to share it and let others use it as they see fit even if it's not to my advantage.
  - myrmidon 18 hours ago
    
    I think the GPL mainly suffers with startups because it makes monetization pretty difficult. Some "commercial" uses of it are also giving it somewhat of an undeserved bad taste (when companies use it to benefit from free contributions while preventing competitors from getting any use out of it).
    My view is that every project and library where I can peruse the source is a gift/privilege. GPL restrictions I view as a small price to "pay it forward", and to keep that privilege for all wherever possible.
    
    dmezzetti 18 hours ago
    
    Fair enough. You'd like to hope that there is a voluntary "pay it back and forward" mentality. But I understand that is a leap of faith with a lot of blind trust.
hgs3 15 hours ago

Copyleft isn't about the software authors freedom, it's about the end-users freedom. Copyleft grants the end-user the freedom to study and modify the code, i.e. the right to repair. Contrast this with closed-source software which may incorporate permissively licensed code: the end-user has no right to study, no right to modify, and no right to repair. Ergo less freedom.
- dmezzetti 14 hours ago
  
  I think it makes a lot of sense for hobby software and non-commercial software. It's just tough to do in a commercial setting for a number of reasons.
  So ultimately while good intentioned, you end up limiting how many people can use what you've built.
amenhotep 19 hours ago

It's not dictating how you use what you build? It's dictating how you redistribute what you build on top of other people's work.
- dmezzetti 19 hours ago
  
  Ok but I just have no interest in imposing restrictions on how people distribute what I build in such a manner either. That's just me.
  - mr_toad 13 hours ago
    
    What if they impose their own restrictions on people further down the line?
    
    dmezzetti 10 hours ago
    
    Once it's out of my hands, so be it. The users can choose not to use derivatives.
gavinhoward 15 hours ago

https://gavinhoward.com/2023/12/is-source-available-really-t...
- em-bee 13 hours ago
  
  just a comment on this article, that may be unrelated to the point you want to make: gavin makes a fatal mistake in interpreting RMS intent. he claims that he only wanted control over his hardware. that is not true. he also wanted the right to share his code with others. the person who had the code for his printer was not allowed to share that code. RMS wanted to ensure that the person who has the code is also allowed to share it. source available does not do that.
cdelsolar 19 hours ago

I disagree as someone who has also spent a huge amount of time on open source software. It’s all GPL or AGPL :)
- dmezzetti 19 hours ago
  
  That's your prerogative. It's just not for me and GPL is basically something I avoid when possible.
LtWorf 9 hours ago

> As someone who has spent a fair amount of time developing open source software, I will say I genuinely dislike copyleft and GPL.
GPL: Help the user
MIT: Help some random company screw the users and save money not hiring people.
Then again I see you're a founder at some AI company so I strongly doubt your motives and statement.
- dmezzetti 9 hours ago
  
  I've spent years building something for free with no expectation of anything in return. Perhaps someone just doesn't believe in the GPL and has no ulterior motive for that.
  - gitaarik an hour ago
    
    What part don't you "believe" exactly? It's just a license with it's particular use cases. It's up to you whether you find the use case appropriate in your circumstance. It depends on what your goal is.
pessimizer 18 hours ago

As somebody who thinks that people currently own the code that they write, I wonder why you're in people's business who want to write GPL'd software.
Are you complaining about proprietary software? I hear the restrictions are a lot tighter for Photoshop's source code, or iOS's, but for some reason you are one of the people who hate GPL as a hobby. Please don't show up whining about "spirits" when Amazon puts you out of business.
- LtWorf 9 hours ago
  
  I opened the profile of the user and he's a founder of an AI company. I guess that explains.
- dmezzetti 18 hours ago
  
  I'm not in anyone's business just sharing my opinion on GPL. I understand why people go GPL / AGPL just not for me. To each their own if they want to go down that path.

rvnx 19 hours ago

GPL and copyright in general don't apply to billionaires, so pretty much a non-topic.

It's just a side cost of doing business, because asking for forgiveness is cheaper and faster than asking for permission.

throwaway198846 19 hours ago

"Information wants to be free"? Many individuals pirated movies and games and got away with it. Of course two wrongs don't make a right and all that. Nonetheless one should be compensated for creating material that ai trained on for the same reasons copyright is compensated - to incentives people to produce it.
rando77 18 hours ago

With an attitude like that they don't

pclmulqdq 19 hours ago

I thought the whole concept of a viral license was legally questionable to begin with. There haven't been cases about this, as far as I know, and GPL virality enforcement has just been done by the community.

omnicognate 19 hours ago

The GPL was tested in court as early as 2006 [1] and plenty of times since. There are no serious doubts about its enforceability.
[1] https://www.fsf.org/news/wallace-vs-fsf
- zamadatix 19 hours ago
  
  I know it's not popular on HN to have anything but supportive statements around GPL, and I'm a big GPL supporter myself, but there is nuance in what is being said here.
  That case was important, but it's not abojt the virality. There have been no concluded court cases involving the virality portion causing the rest of the code to also be GPL'd, but there are plenty involving enforcement of GPL on the GPL code itself.
  The distinction is important because the article is about the virality causing the whole LLM model to be GPL'd, not just about the GPL'd code itself.
  I'd like to think it wouldn't be a problem to enforce, but I've also never seen a court ruling truly about the virality portion to back that up either - which is all GP is saying.
  - omnicognate 18 hours ago
    
    There is no "virality", and the article's use of "propagation" to mean the same thing is wrong. The GPL doesn't "cause" anything to be GPLed that hasn't been explicitly licensed under the GPL by the owner of its copyright. The GPL grants a license to use the copyright material to which it applies. To satisfy the terms of that license for a particular use may require that you license other code under the GPL, but if you don't the GPL can't magically make that code GPLed. You will, however, not be covered by the license so unless your use is permitted for some other reason (eg. fair use or a different license you have been granted) your use of the the original code will be a violation of copyright. All of this has been repeatedly tested in court.
    It's sad to see Microsoft's FUD still festering 20 years later.
    
    zamadatix 15 hours ago
    
    Virality is a very good feature of GPL and part of what makes it a meaningfully different choice than other open licenses, I don't know why you want attribute that to Microsoft of all places.
    
    omnicognate 15 hours ago
    
    A key pillar of Microsoft's FUD campaign against open source was that if you use GPL software you run the risk of inadvertantly including some of it in your proprietary software and accidentally causing the whole thing to suddenly become open source against your horrified company's wishes. It was a lie then and it's a lie now. The comment I was replying to (along with many others on this post) indicates the brainworm lives on.
    The difference between a license and a contract may be too subtle for the denizens of HN to grasp in 2025 but I assure you it's not lost on the legal system. It's not lost on those of us who followed groklaw back in the day, either. Sad we have to live with an internet devoid of such joys now.
    
    zamadatix 13 hours ago
    
    Another key pillar of Microsoft's FUD campaign was you have to open source any code modifications you write to a GPL codebase even if you don't want to. That doesn't make that feature of GPL a fallacy others must be too stupid to understand, it just means Microsoft was trying to make the promises of GPL seem bad when they were actually good. I.e. what Microsoft tried to scare people with is irrelevant to a discussion about what's in the GPL itself. Ironically, it's more akin to FUD than anything else in this conversation.
    I do miss groklaw, been far too long for something like that to appear again.
    
    pessimizer 18 hours ago
    
    It's not Microsoft FUD, you're describing the license as viral too, but playing with words. The fact is that if you include GPL'd stuff in your stuff, that assemblage has to conform to the GPL's rules.
    You're basically saying "the GPL doesn't go back in time and relicense unrelated code." But nobody was ever claiming it does, and describing it as "viral" doesn't imply that it does. It's "viral" because code that you stick to it has to conform to its rules. It's good that the GPL is viral. I want it to be viral, I don't want people to be able to hide GPL'd code in a proprietary structure.
    
    omnicognate 17 hours ago
    
    It's not just words, except to the extent the law is just words. You said there haven't been any cases involving the "virality portion" but there have. Just not under the "GPL makes other code GPLed" interpretation, because that, as we clearly agree, doesn't exist.
    What you're calling the "virality portion" says that one of the ways you *are* allowed to use the code is as part of other GPLed software. If you're going to look for court cases that explicitly "involve" that, it would have to be someone either:
    * using it as a defense, i.e. saying "we're covered by the GPL because the software we embedded this code in is GPL" (That will probably never happen because people don't sue GPLed projects for containing GPLed code), or
    * coming into line with the GPL by open sourcing their own code as part of resolving a case (The BusyBox case [2] was an example of that).
    If you just want cases where companies that were distributing GPL code in closed source software were prevented from doing so, the Cisco [1] and BusyBox [2] cases were both notable examples. That they were settled doesn't somehow make them a weaker "test of the GPL" - rather the companies involved didn't even attempt to argue that what they were doing was permitted. They came into line and coughed up. If you really must insist on one where the defendant dug in and the court ended up awarding damages, I don't think there have been any in the US but there has been one in France [3].
    As for "nobody was ever claiming it does", the "viral" wording has been used for as long as the GPL has been around as a scare tactic for introducing exactly that erroneous idea. Even in cases where people understand what the license says, it leads to subtle misunderstandings of the law, which is why the Free Software Foundation discourages its use. (Also, you literally said, in these exact words, "the virality causing the whole LLM model to be GPL'd".)
    [1] https://en.wikipedia.org/wiki/Free_Software_Foundation,_Inc.....
    [2] https://en.wikipedia.org/wiki/BusyBox#GPL_lawsuits
    [3] https://www.dlapiper.com/en/insights/publications/2024/03/wa...
    
    zamadatix 15 hours ago
    
    I do greatly appreciate you talking about cases instead of leaving it at saying there isn't a part of the license and calling any discussion about it FUD.
    The Cisco case was about distributing GPL binaries, not linking it with the rest of the code base and the rest of that code base then needing to be GPL. It's a standard license enforcement unrelated to the unique requirements of GPL.
    The BusyBox case is probably the closest in the list, but as you already point out we didn't get a ruling to set precedent and instead got a settlement. It seems obvious what the ruling would be (to me at least), but settlement was explicitly not what is being talked about.
    Bringing in French courts, they issued fines - they didn't issue the type of order this article talks about which is about releasing the entire thing involved at the time with GPL.
    This isn't related to fear, uncertainty, or doubt about GPL. It's related to what has/hasn't already been ruled in the court systems handling this license before as the article skips past a bit. Even in the case we assume the courts will rule with what seems obvious (to me at least), it has a tangible difference in how these cases will be run, the assumptions they will have, and how long they will last.
    
    omnicognate 14 hours ago
    
    TBC, I'm not talking about the article, which I've barely read but looks rather misguided as it seems to be talking about LLMs having to be GPLed because of training data, which is not something that would ever happen.
    It has never been the case that including GPL code in your software automatically makes your software GPL or even requires you to make it GPL. If you do get sued because you are distributing GPL code in a way that colloquially "violates the GPL" (technically, rather, in way that is not covered by the GPL or by fair use or any other licence, so it violates copyright) you might choose to GPL your code as a way of coming into compliance, but doing so is neither the only way to achieve compliance (you can instead remove the GPL code, and companies with significant investments in their proprietary code typically do that), nor a remedy for the harm done by your copyright violation to date, which you will typically have to remedy financially, via damages or a settlement.
    As for legally testing, you seem to be to wanting a court to explicitly adjudicate against something so obviously wrong that in well over 20 years of FSF enforcement (edit: actually around 40 years) no company has been daft enough to try and argue it in court.
    It might help if you try and delineate exactly what sort of case you'd accept as proof of "enforceability" of "virality". I think it would have to be something like a company embedding GPL code in proprietary code and then trying to argue in court that doing so is explicitly permitted by the GPL, and sticking to their guns all the way to a verdict against them. I'm not sure whether that argument would be considered frivolous enough to get the lawyers involved censured, but I certainly doubt a judge would be impressed.
    If it helps make it any clearer, if in defending against a case like this your lawyer were to try and argue that the GPL is invalid and somehow just void, you should fire them immediately because they're trying to do the legal equivalent of shooting their own feet off. The GPL is what allows distribution of code, and allowing things is all it can do, because it is a license (not a contract). It can't forbid anything, and removing it from the equation can only decrease the set of things you are allowed to do with the copyrighted code.
    
    zamadatix 12 hours ago
    
    If you've barely read the article then I can understand why you mistook what and why the comment was trying to talk to in response to the virality portions of said article. It's pretty much an argument "because of court cases, this is how propagation of GPL to the whole LLM model could be forced in the future" and the comment was saying there is actually no case precedence at all about that because it never gets there (usually being something obvious with the copyright portions alone or settlement to release the GPL portions).
    Including GPL code in your app requires/results in different things depending on how you do that. E.g. the way Cisco did it with binaries is different than doing it with static linking/ is different than dynamic linking/syscalls/apis is different than expanding the code directly. It's not possible to talk about it as generically as above, especially in context of discussion around an article adding a new method of interaction.
    Yes, the point is precisely the article explicitly asks about this point being tested in court rulings and the comment is that it has never needed to go beyond settlement (usually not even that far). I also don't really agree with how the article assumes things around that in a few places, but that's neither here nor their to this point.
    It's not that I want proof, it's that the article you admit to not reading sets out to look at court cases to "consider the path through which the theory of license propagation to AI models might be recognized in the future". In that regard it's pretty relevant to note past no court case, nor really the two ongoing in the article, involve propagation of the license to the whole entity yet.
- pclmulqdq 19 hours ago
  
  That case has little to do with the license itself and nothing to do with its virality.
  - omnicognate 18 hours ago
    
    As I said, that was merely the first of many. And there is no such thing as "virality" - see my answer to the sibling to your comment.
    The "enforceability" of the GPL was never in any doubt because it's not a contract and doesn't need to be "enforced". The license grants you freedoms you otherwise may not have under copyright. It doesn't deny you any freedoms you would otherwise have, and it cannot do so because it is not a contract. If the terms of the GPL don't apply to your use then all you have is the normal freedoms under copyright law, which may prohibit it. If so, any "enforcement" isn't enforcement of the GPL. It's enforcement of copyright, and there's certainly no doubt on the enforceability of that.
    For the GPL to "fail" in court it would have be found to effectively grant greater freedoms than it was designed to do (or less, resulting in some use not being allowed when it should be, but that's not the sort of case being considered here). It doesn't, and it has repeatedly stood up in court as not granting additional freedoms than were intended.
    
    pclmulqdq 16 hours ago
    
    Look at the "many" if you want to cite better cases about this.
    
    omnicognate 16 hours ago
    
    I've cited further cases elsewhere in this thread. If you'd like to test this yourself feel free to repackage some GPL software as closed source, flog it to people and see what happens. I'm not your lawyer.
    https://news.ycombinator.com/item?id=46070191
    
    pclmulqdq 11 hours ago
    
    I see more discussion in the sibling thread, but I am still waiting for a case actually relevant to the virality of the GPL. The article is about how use of GPL code can "infect" LLM weights and force their disclosure, and I am personally somewhat doubtful that is legally the case. I am also further convinced that this theory has never been given a real test.
CamouflagedKiwi 19 hours ago

There have been a number of of cases, which are linked from Wikipedia (https://en.wikipedia.org/wiki/GNU_General_Public_License#Leg...) - most recently Entr’Ouvert v. Orange had a strong judgement (under French law) in favour of the GPL.
Conversely, to my knowledge there has been no court decision that indicates that the GPL is _not_ enforceable. I think you might want to be more familiar with the area before you decide if it's legally questionable or not.
- pclmulqdq 19 hours ago
  
  I'm not suggesting that you avoid following it. I'm just not that convinced it's enforceable in the US. The French ruling is good, though.
iso1631 19 hours ago

If you don't like the license, then don't accept it.
You are then restricted by copyright just like with any other creation.
If I include the source code of Windows into my product, I can't simply choose to re-license it to say public domain and give it to someone else, the license that I have from Microsoft to allow me to use their code won't let me - it provides restrictions. It's just as "viral" as the GPL.
- pclmulqdq 19 hours ago
  
  I like the GPL. I just don't know how much you can actually enforce it.
  Also, "don't use my code" is not viral. If you break the MSFT license, you pay them, which is a very well-tested path in courts. The idea of forced public disclosure does not seem to be.
  - iso1631 18 hours ago
    
    How much do you pay them?
    If the GPL license didn't exist, and instead you just relying on copyright, then that's an injunction. You have to stop using the code you "stole" and pay reparations.
    In UK law, if you distribute copyright material in the course of a business you can be facing 10 years in prison and an unlimited fine.
    Sure you can't get them to agree to the GPL, they could simply stop distributing and then turn up to their stint in prison and massive fine. In reality I suspect they would take the easy way out and comply with the license.
    
    pclmulqdq 16 hours ago
    
    You pay them an amount determined by the court or your settlement, and you also have to stop using the code. This is how everything works.
    Corporations can't go to prison.

uyzstvqs 17 hours ago

Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently. Even if you repeat patterns and methods you've picked up from that proprietary learning material, it is by no means redistribution. The practical differentiator here is that you do not access the proprietary material during the creation of your own original work, similar in principle to a clean-room design. With AI/ML, it matters that training data is not accessed during inference, which it's not.

The other factor of copyright, which is relevant, is how material is obtained. If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use. If you don't want AI training to be done on your work, you need to put access to it behind explicit authentication with a legally-binding user agreement prohibiting that use-case. Do note that this would lose your project's status as open-source.

ndiddy 17 hours ago

> Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently.
Well the difference is that copyright law applies to work fixed in a tangible medium of expression. This covers i.e. model weights on a hard drive but not the human brain. If the model is able to reproduce others’ work verbatim (like the example the article brings up of the song lyrics) then under copyright law that’s unauthorized reproduction. It doesn’t matter that the data is expressed via probabilistic weights because due to past lobbying/lawsuits by the software industry to get compiled binary code covered by copyright, reproduction can include copies that aren’t directly human readable.
> If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use.
There’s over 20 years of successful GPL infringement lawsuits over unlicensed use of publicly available GPL code that disagrees with this point.
luqtas 17 hours ago

so basically we download the sources files to the training weight and remove the LICENSE.MD as it's exactly the same as a person learning to program from proprietay secret code and outputing code based on that for millions of peoples in matter of seconds /s
we also treat as however we want public goods found over the internet. as the World Intellectual Property Organization Copyright Treaty and Berne Convention for the Protection of Literary and Artistic Works aren't real or because we can as we are operating in international waters, selling products for other sails living exclusively in international waters /s
- tpmoney 13 hours ago
  
  If you download GPL source code and run `wc` on its files and distribute the output of that, is that a violation of copyright and the GPL? What if you do that for every GPL program on github? What if you use python and numpy and generate a list of every word or symbol used in those programs and how frequently they appear? What if you generate the same frequency data, but also add a weighting by what the previous symbol or word was? What if you did that an also added a weighting by what the next symbol or word was? How many statistical analyses of the code files do you need to bundle together before it becomes copyright infringement?
  - sfink 12 hours ago
    
    The line is somewhere between running wc on the entire input and running gzip on the entire input.
    The fact that a slippery slope is slippery doesn't make it not a slope.
    
    tpmoney 8 hours ago
    
    Of course there is a line. And everything we know about how AI models work points to them being on the ‘wc’ side of the line
    
    sfink 7 hours ago
    
    Not the way I see it.
    The argument that GPL code is a tiny minority of what's in the model makes no sense to me. (To be clear, you're not making this argument.) One book is a tiny minority of an entire library, but that doesn't mean it's fine to copy that book word for word simply because you can point to a Large Library Model that contains it.
    LLMs definitely store pretty high-fidelity representations of specific facts and procedures, so for me it makes more sense to start from the gzip end of the slope and slide the other way. If you took some GPL code and renamed all the variables, is that suddenly ok? What if you mapped the code to an AST and then stored a representation of that AST? What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different? It would be the analogue of (lossy) perceptual coding for audio compression, only instead of "perceptual" it's "functional".
    This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.
    It also feels a lot closer to 'gzip' than 'wc', imho.
    
    tpmoney 7 hours ago
    
    > LLMs definitely store pretty high-fidelity representations of specific facts and procedures
    Specific facts and procedures are explicitly NOT protected by copyright. That's what made cloning the IBM BIOS legal. It's what makes emulators legal. It's what makes the retro-clone RPG industry legal. It's what made Google cloning the Java API legal.
    > If you took some GPL code and renamed all the variables, is that suddenly ok?
    Generally no, not sufficiently transformative.
    > What if you mapped the code to an AST and then stored a representation of that AST?
    Generally no, binary distribution of software is considered a violation of copyright.
    > What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different?
    This starts to get a lot fuzzier. De-compilation is legal. Creating programs that are functionally identical to other programs is (generally) legal. Creating an emulator for a system is legal. Copyright protects a specific fixed expression of a creative idea, not the idea itself. We don't want to live in the world where Wine is a copyright violation.
    > This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.
    And yet, so far no one has brought a legal case against the AI companies for being able to extract their copyright protected material from the models. The few early examples of that happening are things that model makers explicitly attempt to train out of their models. It's unwanted behavior that is considered a bug, not a feature. Further the fact that a machine is able to violate copyright does not in and of itself make the machine itself a violation of copyright. See also Xerox machines, DeCSS, Handbrake, Plex/Jellyfin, CD-Rs, DVRs, VHS Recorders etc.
    
    sfink 6 hours ago
    
    > Specific facts and procedures are explicitly NOT protected by copyright.
    No argument there, and I'm grateful for the limits of copyright. That part was only for describing what LLM weights store -- just because the literal text is not explicitly encoded doesn't mean that facts and procedures aren't.
    > Copyright protects a specific fixed expression of a creative idea, not the idea itself.
    Right. Which is why it's weird to talk about the weights being derivative works. Weird but perhaps not wrong: if you look at the most clear-cut situation where the LLM is able to reproduce a big chunk of input bit-for-bit, then the fact that its basis of representation is completely different doesn't feel like it matters much. An image that is lossily compressed, converted to a bitstream, and encoded in DNA is very very different than the input, but if an image can be recovered that is indistinguishable or barely distinguishable from the original, I'd still call that copying and each intermediate step a significant but irrelevant transformation.
    > This starts to get a lot fuzzier. De-compilation is legal.
    I'm less interested in what the legal system is currently capable of concluding. I personally don't think the laws have caught up to the present reality, so present-day legality isn't the crucial determinant in figuring out how things "ought" to work.
    If an LLM is completely incapable of reproducing input text verbatim, yet could become so through targeted ablation (that does not itself incorporate the text in question!), then does it store that text or not?
    I'm not sure why I'm even debating this, other than for intellectual curiosity. My opinion isn't actually relevant to anyone. Namely: I think the general shape of how this ought to work is pretty straightforward and obvious, but (1) it does not match current legal reality, and more importantly, (2) it is highly inconvenient for many stakeholders (very much including LLM users). Not to mention that (3) although the general shape is pretty clear in my head, it involves many many judgement calls such as the ones we've been discussing here, and the general shape of how it ought to work isn't going to help make those calls.
    
    tpmoney 5 hours ago
    
    > An image that is lossily compressed, converted to a bitstream, and encoded in DNA is very very different than the input, but if an image can be recovered that is indistinguishable or barely distinguishable from the original, I'd still call that copying and each intermediate step a significant but irrelevant transformation.
    Sure as a broad rule of thumb that works. But the ability of a machine to produce a copyright violation doesn't mean the machine itself or distributing the machine is a copyright violation. To take an extreme example, if we take a room full infinite monkeys and put them on infinite typewriters and they generate a Harry Potter book, that doesn't mean Harry Potter is stored in the monkey room. If we have a random sound generator that produces random tones from the standard western musical note pallet and it generates the bass line from "Under Pressure" that doesn't mean our random sound generator contains or is a copy of "Under Pressure", even if we encoded all the same information and procedures for generating those individual notes at those durations among the data procedures we gave the machine.
    > If an LLM is completely incapable of reproducing input text verbatim, yet could become so through targeted ablation (that does not itself incorporate the text in question!), then does it store that text or not?
    I would argue not. Just like a xerox machine doesn't contain the books you make copies of when you use it to make a copy, and Handbrake doesn't contain the DVD's you use when you make a copy there.
    I would further argue that copyright infringement is inherently a "human" act. It's sort of encoded in the language we use to talk about it (e.g. "fair use") but it's also something of a "if a tree falls in the middle of the woods" situation. If an LLM runs in an isolated room in an isolated bunker with no one around and generates verbatim copies of the Linux kernel, that frankly doesn't matter. On the other hand, if a Microsoft employee induces an LLM to produce verbatim copies of the Linux kernel, that does, especially if they did so with the intent to incorporate Linux kernel code into Windows. Not because of the LLM, but because a person made the choice to produce a copy of something they didn't have the right to make a copy of. The method by which they accomplished that copy is less relevant than making the copy at all, and that in turn is less relevant than the intent of making that copy for a purpose which is not allowed by copyright law.
    > I'm not sure why I'm even debating this, other than for intellectual curiosity.
    Frankly, that's the only reason to debate anything. 99% of the time, you as an individual will never have the power to influence the actual legal decisions made. But a intellectually curious conversation is infinitely more useful, not just to you and me but to other readers, than another retread of "AI is slop" "you're just jealous you can't code your way out of a paper bag" arguments that pervade so much discussion around AI. Or worse yet another "I used an LLM for a clearly stupid thing and it was stupid" or "I used an LLM to replace all my employees and I'm sure it's going to go great" blog post. For whatever acrimony there might have been in our interchange here, I'm sorry, because this sort of discussion is the only good way to exercise our thoughts on an issue and really test them out ourselves. It's easy to have a knee jerk opinion. It's harder to support that opinion with a philosophy and reasoning.
    For what it's worth, I view the LLM/AI world as the best opportunity we've had in decades to really rethink and scale back/change how we deal with intellectual property. The ever expanding copyright terms, the sometimes bizarre protections of what seem to be blindingly obvious ideas. The technological age has demonstrated a number of weaknesses in the traditional systems and views. And frankly I think it's also demonstrated that many prior predictions of certain doom if copyright wasn't strictly enforced have been overwrought and even where they haven't, the actual result has been better for more people. Famously, IBM would have very much preferred to have won the BIOS copyright issue. But I think so much of the modern computer and tech industry owes their very careers to the effects of that decision. It might have been better for IBM if IBM had won, it's not clear at all that it would have been better for "[promoting] the Progress of Science and useful Arts".
    We could live in a world where we recognize that LLMs and AIs are going to fundamentally change how we approach creative works. We could recognize that the intents of "[promoting] the Progress of Science and useful Arts" is still a relevant goal and something we can work to make compatible with the existence of LLMs and AI. To pitch my crazy idea again, we could:
    1) Cut the terms of copyright substantially, back down to 10 or 15 years by default.
    2) Offer a single extension that doubles that term, but only on the condition that the work is submitted to a central "library of congress" data set.
    3) This could be used to produce known good and clean data sets for AI companies and organization to train models from, with the protection that any model trained from this data set is protected from copyright infringement claims for works in the data set. Heck we could even produce common models. This would save massive amounts of power and resources by cutting the need for everyone who wants to be in the AI space to go out and acquire, digitize and build their own library. The NIST numbers set is effectively the "hello world" set for anyone learning computer vision AI stuff. Let's do that for all sort of AI.
    4) The data sets and models would be provided for a nominal fee, this fee will be used to pay royalties to people whose works are still under copyright and are in the data sets, proportional to the recency and quantity of work submitted. A cap would need to be put in place to prevent flooding the data set to game the royalties. These royalties would be part of recognizing the value the original works contributed to the data set, and act as a further incentive to contribute works to the system and contribute them sooner.
    We could build a system like this, or tweak it, or even build something else entirely. But only if we stop trying to cram how we treat AI and LLMs and the consequences of this new technology into a binary "allowed / not allowed" outcome as determined by an aging system that has long needed an overhaul.
    So please, continue to debate for intellectual curiosity. I'd rather spend hours reading a truly curious exploration of this than another manifesto about "AI slop"