This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.
Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.
Boils down to "use Frida to find the arguments to the TensorFlow call beyond the model file"
Key here is, a binary model is just a bag-of-floats with primitively typed inputs and outputs.
It's ~impossible to write up more than what's here because either:
A) you understand reverse engineering and model basics, and thus the current content is clear you'd use Frida to figure out how the arguments are passed to TensorFlow
or
B) you don't understand this is a binary reverse engineering problem, even when shown Frida. If more content was provided, you'd see it as specific to a particular problem. Which it has to be. You'd also need a walkthrough by hand about batching, tokenization, so on and so forth, too much for a write up, and it'd be too confusing to follow for another model.
TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference
The more impolite version of this basically says "If you can't figure out you're supposed to also use Frida to check the other arguments, you have no business trying." I agree, though, wrote a more polite version.
> TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference
I don't understand what you mean: I have no clue about anything related to reverse engineering, but I ported the mistral tokenizer to Rust and also wrote a basic CPU Llama training and inference implementation in Rust, so I definitely wouldn't need an intro to model inference…
You're also not the person I'm replying to, nor do you appear in any of this comment chain, so I've definitely not implied you need an intro to inference, so I'm even more confused than you :)
This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.
An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.
The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.
Just having the shape of the input and output are not sufficient, the image (in this example) needs to be normalized. It's presumably not difficult to find the exact numbers, but it is a source of errors when reverse engineering a ML model.
Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!
My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.
This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".
Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.
I think you are starting off from the perfect direction, being a forward-engineer first, and then a reverse-engineer.
The community around Frida is a a) a bit small and b) a bit unorganized/shadowy. You cannot find that many resources, atleast I have not found them.
I would suggest you to use Objection, explore an app, enumerate the classes with android hooking list classes or android hooking search classes, then dynamically watch and unwatch them. That is the quickest way to start, when you start developing your own scripts you can always check out code at https://codeshare.frida.re/.
For everything else join the Frida Telegram chat, most knowledge sits there, I am also there feel free to reach out to @altayakkus
Oh and btw, I would start with Android, even though iOS is fun too, and I would really really suggest getting a rooted phone/emulator. For the Android Studio Emulator you can use rootAVD (GitHub), just install Magisk Frida.
Installing the Frida gadget into APKs is a mess which you wont miss when you go root
One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.
E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.
they have a very "interesting" definition of private data on the paper. it's so outlandish that if you buy their definition, there's zero value on the trained data. heh.
they also claim unsuppervisioned users typing away is better than tagged training data, which explain the wild grammar suggestions on the top comment. guess the age of quantity over quality is finally peaking.
in the end it's the same as grammarly but without any verification of the interested data, and calling the collection of user data "federation"
Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.
So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.
They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way.
When you own the device, root it or even gain JTAG access to it, you can access and control everything.
And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.
In principle, device manufacturers could make hardware DRM work for ML models.
You usually inference those on GPUs anyway, and they usually have some kind of hardware DRM support for video already.
The way hardware DRM works is that you pass some encrypted content to the GPU and get a blob containing the content key from somewhere, encrypted in a way that only this GPU can decrypt. This way, even if the OS is fully compromised, it never sees the decrypted content.
You're right about the TPM, I won't get the key out of it. It's a special ASIC which doesn't even have the silicon gates to give me the key.
But is the TPM doing matrix-mulitiplication at 1.3 Petaflops?
Or are you just sending the encrypted file to the TPM, getting the unencrypted file back from it, which I can intercept, be it on SPI or by gaining higher privileges on the core itself? Just like with this app but down lower?
Whatever core executes the multiplications will be vulnerable by some way or the other, for an motivated attacker which has the proper resources. This is true for every hardware device, but the attack vector of someone jailbreaking a Nintendo Switch by using a electron microscope and a ion-beam miller is neglectable.
If you are that paranoid about AI models being stolen, they are worth it, so some attacker will have enough motivation to power through.
Stealing the private key out of a GPU which allows you steal a lot of valuable AI models is break-once-break-everywhere.
Apple trusted enclave is also just a TPM with other branding, or maybe a HSM dunno.
I'll concede you are correct that whether the key is extractable or not doesn't really matter if the GPU eventually will eventually need to store the decrypted model in memory.
However, if NVidia or similar was serious about securing these models, I'd be pretty sure they could integrate the crypto in hardware multipliers / etc such that the model doesn't need to be decrypted anywhere in memory.
But at this point there isn't much value in deploying models to the edge. Particularly the type of models they would really want to protect as they are too large.
The types of models deployed to edge devices (like the Apple ones) are generally quite small and frankly not too difficult (computationally) to reimplement.
I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.
(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")
frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo
It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.
EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.
To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.
If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.
Standard corporate hypocrisy. "Rules for thee, not for me."
If you actually expected anything to be open about OpenAI's products, please get in touch, I have an incredible business opportunity for you in the form of a bridge in New York.
They got backlash, but (if I'm not mistaken) it was ruled that it's okay to use copyrighted works in your model.
So if a model is copyrighted, you should still be able to use it if you generate a different one based on it. I.e. copyright laundry. I assume this would be similar to how fonts work. You can copyright a font file, but not the actual shapes. So if you re-encode the shapes with different points, that's legal.
But, I don't think a model can be copyrighted. Isn't it the case that something created mechanically can't be copyrighted? It has to be authored by a person.
I find it weird that so many hackers go out of their way to approve of the legal claims of Big AI before it's even settled, instead of undermining Big AI. Isn't the hacker ethos all about decentralization?
Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.
I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.
But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.
The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.
It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.
1. the current, unproven-in-court legal understanding,
2. standard disclaimer to cover OP's ass
3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
> reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
Prevalent or not, phrased this way it's clear how nonsense it is. The data isn't hurt or destroyed in the process of being trained on, nor does the process deprive the data owners from their data or opportunity to monetize it the way they ordinarily would.
The right terms here are "learning from", "taking inspiration from", not "being a parasite".
(Now, feeling entitled to rent because someone invented something useful and your work accidentally turned out to be useful, infinitesimally, in making it happen - now that is wanting to be a parasite on society.)
I think the bad part of it is stripping consent from the original creators, after they published their work. I personally see it as an unfortunate side-effect of change. The artists of the future can create with AI already in mind, but this was not the privilege of the current, and previous generations.
Getting back to "learning from", I think the issue is not the learning part, but the recreation part. AI can churn content to orders of magnitude higher than before, even in the age of Fiverr and other tools-opportunities. This changes the dynamics of the interaction, because previously, it took someone tens of hours to create something, now it takes AI minutes. That is not participating in the same playing field, it's absolutely dominating it, completely changing it. That is something to have feelings about, especially if one's livelihood is also impacted. Data is not destroyed, and neither is its ownership, but people don't usually want the exact thing, they are content with a good enough thing, and this takes away a lot of power from the artists, whose work is the lifeblood of artistic AI in the first place.
So I don't think it's as nonsense as you state it. But I do understand that it's not cut and dry the other way around either. Gatekeeping culture is definitely not a humane thing to do. Culture comes and goes, intermingles, inspires and changes all the time, and people take from it and add to it all the time. Preserving copyright perfectly would neuter it, and slant the landscape even more towards the already powerful.
If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.
IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]
The fact that models creators assert that they are protectrd by copyright and offer licenses does not mean:
(1) That they are actually protected by copyright in the first place, or
(2) That the particular act described does not fall into an exception to copyright like fair use, exactly as many model creators assert that the exact same act done with the materials models are trained on does, rendering the restrictions of the license offered moot for that purpose.
The difference is that you pulling out a model is you potentially violating copyright, while the model itself being trained on copyrighted models is potentially them violating copyrights.
I.e. the first one concerns you, the other is none of your business.
Them potentially violating my copyrights is very much my business. But you're right, the difference is how much the respective parties have to spend on legal battles.
Simply showing up to court wearing a tshirt that says "what she said" probably wouldn't fly, but I like to imagine that any arguments made by them about their copyrights would be equally true of my copyrights.
At this point I'm mostly wondering if "you ripped me off first" is a viable legal defense to copyright battles where it's unclear if either party is distributing the works of the other. One thing is for sure though, if I were to do this as an individual, the discovery process would be much more expensive for them than me.
An example for legal reference might be convolution reverb. Basically it's a way to record what a fancy reverb machines does (using copyrighted complex math algorithms) and cheaply recreate the reverb on my computer. It seems like companies can do this as long as they distribute protected reverbs separately from the commercial application. So Liquidsonics (https://www.liquidsonics.com/software/) sells reverb software but puts for free download the 'protected' convolution reverbs specifically the Bricasti ones in dispute (https://www.liquidsonics.com/fusion-ir/reverberate-3/)
Also, while a SQL server can be copyright protected, a SQL database is not given copyright protection/ownership to the SQL server software creators by extension of that.
Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...
There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.
As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.
That sentiment is ethically sound and logically robust and directionally consistent with any uniform application of the law as written.
But there is a group of people, growing daily in influence, who utterly reject such principles as either worthy or useful. This group of people is defined by the ego necessary to conclude that when the stakes are this high, the decisions should be made by them, that the ends justify the means on arbitrary antisocial behavior (c.f. the behavior of their scrapers) as long as this quasi-religious orgasm of singularity is steered by the firm hand that is willing and able to see it through.
That doesn’t distress me: L Ron Hubbard has that.
It distresses me that HN as a community refuses to stand up to these people.
For app developers considering tflite, a safer way would be to host the models on firebase and delete them when their job is done. It comes with other features like versioning for model updates, A/B tests, lower apk size etc.
https://firebase.google.com/docs/ml/manage-hosted-models
That wouldn't help against the technique explained in the article, would it? Since the model makes it way into the device, it can be intercepted in a similar fashion.
I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.
I think the comment author means offering inference via Firebase, with the model never leaving the backend.
This works, just like ChatGPT works, but has the downside of
1. You have to pay the computing for every inference
2. Your users can't access it offline
3. Your users will have to use a lot of data from their mobile network operator.
4. Your inference will be slower
And since SeeingAI infers the model every second, your and your customers bill will be huge.
That's what I thought, but the link doesn't say anything about off-device inference, it's only about storing and retrieving the model. There's just one off-hand note about cloud inference.
In any case, yeah you can not download the model to the device at all, but then you have to deal with the other angle - making sure the endpoint isn't abused.
Maybe a hybrid approach would work - infer just part of the model (layers?) on the cloud, and then carry on the inference on the device? I'm not familiar with how AI models look like and work like exactly, but I feel like hiding even a tiny portion of the model would make it not usable in practice
Your second note is very interesting, having looked at the model myself this is very plausible.
For models which use a lot of input nodes, a lot of "hidden layers" and in the end just perform a softmax this may get infeasible because of the amount of data you would have to transfer.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.
Exactly. Put another way, tensorflow is not an AI. You can build an AI in tensorflow. You can also resize images in tensorflow (using the traditional algorithms, not AI). I am not an expert, but as I understand, it is common for vision models to require a fixed resolution input, and it is common for that resolution to be quite low due to resource constraints.
And similarly, translating those sentences into data points is still a derivative work, like transcribing music and then making a new recording is still derivative.
But as things are, the megacorps are training their LLMs on the commons while asserting "intellectual property" rights on the resulting weights. So, fuck them, and cheers to those who try to do something about this state of affairs.
Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.
> And before you knee-jerk "it's a compression algo!"
It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.
> I invite you to archive all your data with an LLMs "compression algo".
As long as we agree it is _my data_ and not yours.
> It's lossy compression, the same way a JPEG might be
Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.
The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.
I’m just commenting, not disputing any argument about fair use.
and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes
You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.
as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.
LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).
Do you ever listen to music? Is your music ever influenced by the music that you listen to? How do you imagine that works, in an information-theoretical sense, that fundamentally differs from an LLM?
Depending on how much music you've listened to, you very well may have "downloaded terabytes" of it into your brain. Your argument is specious.
Information on how large language models are trained is not hard to come by, there are numerous articles that cover this material. Even a brief skimming of this material will make it clear that the training of large language models is materially different in almost every way from how human beings "learn" and build knowledge. There are still many open questions around the process of how humans collect, store, retrieve and synthesize information.
There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data, the quality of output degrades greatly when novel input is provided. Is your argument that people fundamentally function in the same way? That would be a bold and novel assertion!
> There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data
If this were true, then you would be able to identify the specific work being "parroted" and you'd have a case for copyright infringement regardless of whether it was produced by an LLM at all. This isn't how LLMs work though. For instance, if an LLM's training data includes the complete works of a given author and then you prompt the LLM to write a story in the style of that author, it will actually write an original story instead of reproducing one of the stories in its training corpus. It won't be particularly good but it will be an original work.
It also isn't obvious whether or not, or to what degree, LLM training works differently from human learning. You yourself acknowledged that there are "many open questions" about how human learning works, so how can you be so confident that it's fundamentally different? It doesn't matter anyway because you can still apply the exact same standards to LLM output to judge whether it infringes copyright that you would to something that was produced by a human being.
some of that money that i pay to apple goes to the rights holders of that music for the copying and performance of their work through my speakers.
that’s a pretty big difference to how most LLMs are trained right there! i actually pay original creators some money.
-
i am a human being. you cannot reduce me down to some easy information theory.
an LLM is a tool. an algorithm. with the same random seed etc etc it will get the same results. it is not human.
you put me in the same room as yesterday i’ll behave completely differently.
-
i have listened to way more than terabytes of music in my life. doesn’t mean i have the ability to regurgitate any of it verbatim though. i’m crap at that stuff.
I don't see how this is a double standard. Comparing a person interacting with their culture is not comparable in any way. IMHO, it's kind of a wacky argument to make.
Can you elaborate on how it's not comparable? It seems obvious to me that it is -- they both learn and then create -- so what's the difference?
If I can hire an employee who draws on knowledge they learned from copyrighted textbooks, why can't I hire an AI which draws on knowledge it learned from copyrighted textbooks? What makes that argument "wacky" in your eyes?
It has never been argued that copyright law should apply to information the people learn, whether that be from reading books or newspapers, watching television or appreciating art like paintings or photographs.
Unlike a person, an large language model is product built by a company and sold by a company. While I am not a lawyer, I believe much of the copyright arguments around LLM training revolve around the idea that copyrighted content should be licensed by the company training the LLM. In much the same way that people are not allowed to scrape the content of the New York Time website and then pass it off as their own content, so should OpenAI be barred from scraping the New York Times website to train ChatGPT and then sell the service without providing some dollars back to the New York Times.
You're not going to get an answer you find agreeable, because you're hoping for an answer that allows you to continue to treat the tool as chattel, without conferring to it the excess baggage of being an individuated entity/laborer.
You're either going to get: it's a technological, infinitely scalable process, and the training data should be considered what it is, which is intellectual property that should be being licensed before being used.
...or... It actually is the same as human learning, and it's time we started loading these things up with other baggage to be attached to persons if we're going to accept it's possible for a machine to learn like a human.
There isn't a reasonable middle ground due to the magnitude of social disruption a chattel quasi-human technological human replacement would cause.
Chattel, as I'm using it is in reference to the usage in distinguishing an "ownable piece of property" from an employee".
Namely, a magic, technologically reproducible box that can be applied almost as effectively as a human hireling, but isn't a human hireling is near infinitely more desirable in a capitalist system since the blackbox is chattel, the hired human is not. The chattel has no natural rights, no claim to self sovereignty, and is an asset that is legally extant by virtue of the fact it is owned by the owner.
Chattel that are flexible enough to replace the legal burdens incurred by hiring a human to do the same job, will naturally be converged upon due to the capitalistic optimation function of minimizing unit input cost for output over dollars and potential dollars as expressed through legal exposure.
Imagine you had two human like populations. One made of plastic that aren't considered humans but property. I.e. are chattel. Then you have a bunch of people with all the baggage that comes with it.
Hiring people/employing people is hard. Particularly in the U.S. and other jurisdictions where a great deal of responsibility for actually implementing regulations/ taxation/immigration and such is tacked onto being an employer/being able to hire.
As the gap between the capability of the chattel population closes in on the human population, the more economic and workload sense it makes for the system to improve the chattel population under our current optimization strategy, (given no pre-emptive work to cut off externality dumping). Humans are messy and complicated to work with. Often unpredictable. Chattel are easy to account for; especially when combined with "technical restraints". You have to fundamentally engage in negotiation with another human being to get them on board with working for you. You buy the chattel, and that's that. The chattel has no grounds to refuse service. Socially speaking, we don't even recognize it's outputs as carrying any social weight, or resistance as anything but malfunctions.
Economics is the science around using access to resources as a means to get other people to work with you. Being chattel means you can cut out entirely all that complexity. You are resource. Not people.
Unironically, we need to have an answer to whether or not we are going to consider a sufficiently complex function imitator as something that requires a classification above "chattel" or controls around how we apply it in order to not self-destruct the economic equilibria in which we purport to exist. Because all it takes is removing or sufficiently obstructing the flow of value down from individuals who accrete the most of these wunder-chattel to render things so top heavy, most of the constraints/invariants of our socioeconomic systems as we know them become invalidated.
No, you’re missing the point of copyright. The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works. If an LLM produces original works that are influenced by the training data, that is not a violation of copyright. If it reproduces the training data verbatim, it is.
> The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works.
As I understand, the definition of "the right to produce original works influenced by previous works" has been a slowly moving target in my lifetime. Think about the effects of the album Paul's Boutique by Beastie Boys. They went wild with sampling and paid very little (zero?) to license those samples. Then, there were a bunch of court cases in the US that decided that future samplers needed to license the samples from the original authors. However, the ability to create legal, derivative works is usually carefully defined in copyright law. Can you comment on this matter vis-a-via LLMs?
> If an LLM produces original works that are influenced by the training data, that is not a violation of copyright.
I'm pretty sure if an LLM creates Paul's Boutique 2.0 in 2025 using incredible number of samples, then someone cannot sell it (or use it in a YouTube video) without first licensing those samples. I doubt very much someone could just "hide behind" an LLM and claim, "Oh, it is original, but derivative, work, created by an LLM." I doubt courts would allow that.
> I'm pretty sure if an LLM creates Paul's Boutique 2.0 in 2025 using incredible number of samples, then someone cannot sell it (or use it in a YouTube video) without first licensing those samples. I doubt very much someone could just "hide behind" an LLM and claim, "Oh, it is original, but derivative, work, created by an LLM." I doubt courts would allow that.
This isn’t how LLM’s work though. Samples are just that, literal samples that are copied from one work to another verbatim. LLM’s use training data to construct a predictive model of which tokens follow each other. You probably could get an LLM to use samples deliberately if you wanted to, but this isn’t how they typically work.
Regardless, at that point you’re just evaluating the claim of copyright infringement based on the nature of the work itself, which is exactly what I’m advocating, versus presuming that all LLM output is necessarily copyright infringement if any copyrighted material was used in training.
i weirdly agree with you, but also want to point out that “influenced by the training data” is doing some very heavy lifting there.
exactly how the new work is created is important when it comes to derivative works.
does it use a copy of the original work to create it, or a vague idea/memory of the original work’s composition?
when i make music it’s usually vague memories. i’d argue that LLMs have an encoded representation of the original work in their weights (along with all the other stuff).
but that’s the legal grey area bit. is the “mush” of model weights an encoded representation of works, or vague memories?
I don’t really think it matters because you can just compare the output to the input and apply the same standard, treating the process between the two as a black box.
As far as I’m concerned you are a black box. Just as I’m a black box from your perspective. In principle I could come over and vivisect your brain if you’d like, but I doubt you’d be interested, and I wouldn’t really want to incur the legal liability even if you were.
Besides, “black box” just means that your internal mental life and cognitive mechanism is opaque to me. It’s not like I’m calling you a p-zombie.
One is a collection of highly dithered data generated by machines paid for by a business in order to financially gain from the copyrighted works in order to replace any future need for copyrighted text books.
The other is a person learning from a copyrighted textbook in the legally protected manner, and whom and use the textbook was written for.
I don't think this question really makes any sense... In my opinion, it's kind of mish-mashing several things together.
"Can you elaborate on how it's not comparable?"
The process of individual people interacting with their culture is a vastly different process than that used to train large language models. In what ways to you think these processes have anything in common?
"It seems obvious to me that it is -- they both learn and then create -- so what's the difference?"
This doesn't seem obvious to me (obviously)! Maybe you can argue that an LLM "learns" during training, but that ceases once training is complete. For sure, there are work-arounds that meet certain goals (RAG, fine-tuning); maybe your already vague definition of "learning" could be stretched to include these? Still, comparing this to how people learn is pretty far-fetched. AFAICT, there's no literature supporting the view that there's any commonality here; if you have some I would be very interested to read it. :-)
Do they both create? I suspect not; an LLM is parroting back data from it's training set. We've seen many studies showing that tested LLMs perform poorly on novel problem sets. This article was posted just this week:
The court is still out on the copyright issue, for the perspective of US law we'll have to wait on this one. Still, it's clear that an LLM can't "create" in any meaningful way.
And so on and so forth. How is hiring an employee at all similar to subscribing to an OpenAI ChatGPT plan? Wacky indeed!
Obviously, on the inside, the process that a person goes through in learning and creating, and the process that a LLM goes through in learning and creating, is very different. Nobody will dispute that.
But if they're learning from the same kinds of materials, and producing the same kind of output, then obviously the comparison can be made. And your idea that LLM's don't create seems obviously false.
So I have to conclude the two seem comparable, and someone would have to show why different legal principles around copyright ought to apply, when it's a simple question of input/output. Why should it matter if it's a human or algorithm doing the processing, from a copyright perspective? Nothing "wacky" about the question at all.
Human creators don't store that 'influence' in a digital machine accessible format generated directly from the copyrighted content though.
Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.
Does that imply that if we invent brain upload technology, that my weights have every conflicting license and patent for everything I can quote or create? I don't like that precedent. I have complete rights over my noggin's contents. If I do quote a NYT article in it's entirely, that vould be infringement, but not copying my brain itself.
Your argument boils down to “we don’t know how brains work”, and it is a non-sequitur. It isn’t a violation of copyright law to create original works under the creative influence of works still under copyright.
The moment you earn money from it, that's not fair use anymore. When I last checked, unlimited access to said models were not free, plus it's not "research" anymore.
- Addenda -
For the interested parties, the law states the following [0].
Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:
1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
2. the nature of the copyrighted work;
3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4. the effect of the use upon the potential market for or value of the copyrighted work.
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors
So, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.
Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.
From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.
I really don't think it's that simple. I can read books and then earn money from applying what I learned in them. I can also study art and then make original art in the same or similar styles. If a person was doing this there would be no one claiming copyright infringement. The only difference is it's a machine doing it and not a person.
The nature of copyright and plagiarism boils down to paraphrasing, and so long as LLMs sufficiently paraphrase the content it's an open question whether it's copyright infringement and requires new law/precedent.
So the fact they are earning money is a red herring unless they are reproducing the exact same content without paraphrasing (with exception to commentary). E.g. they can quote part of a work while commenting on it.
Where they have gotten into trouble with e.g. NYT afaik is when the LLM reproduced a whole article word for word. I think they have all tried hard to prevent the LLM from ever doing that to avoid that legal risk.
> I can read books and then earn money from applying what I learned in them.
How many books can you read, understand and memorize in T time, and how many books an AI can ingest in the T time?
If we're down to paraphrasing, watch this video [1], and think again.
Many models, given that you ask the correct questions, reproduce their training set with great accuracy, and this is only prevented with monkey patching, IIUC.
So, it's still a big mess, even if we don't add copyrighted corpus to the mix. Oh, BTW, datasets like "The Stack" are not clean as they claim. I have seen at least two non-permissively licensed code repositories inside that dataset.
I agree it's a big mess, that was kind of my point.
I am curious about the video, but am not compelled to spend 24 min watching it when you haven't summarized its thesis for me. The title of the video makes it seem adjacent at best to the points I was making. (Some automated flagging system =/= actual law)
I would be more nuanced on this matter. As I understand, in the US, fair use allows media to write critiques of cultural artefacts (sorry, I cannot think of a better, broad term). For example, you can include small quotes from the film script when writing a critique of it without requiring permission from the owner of the copyright. And, until the World Wide Web arrived to the masses in the mid-1990s, most critiques were published by commercial media outlets, such as a daily newspaper. They were certainly published by commercial, for-profit entities. That said, I think the intent of the fair use is very important to the courts, much more than the entity that is doing the fair use (newspaper, blogger, etc.).
Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.
Can you clarify this a bit. I presume you are talking about the tone more than the implied statement.
If the last sentence were explicit rather than implied, for instance
This article seems to be serving the growing prejudice against AI
Is that better? It is still likely to be controversial and the accuracy debatable, but it is at least sincere and could be the start of a reasonable conversation, provided the responders behave accordingly.
I would like people to talk about controversial things here if they do so in a considerate manner.
I'd also like to personally acknowledge how much work you do to defuse situations on HN. You represent an excellent example of how to behave. Even when the people you are talking to assume bad faith you hold your composure.
... Because if he did this with a model that's not open that's sure going to keep everyone happy and not result in lawsuit(s)...
The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)
Please don't cross into personal attack or otherwise break the site guidelines when posting here. Your post would be fine with just the first sentence.
I know it feels that way, but people's perceptions of each other online are so distorted that this is just a recipe for massive conflict. That's off topic on HN because it isn't interesting.
I'm not referring to people's perceptions. Some people write with clearly inflated self worth built into their arguments. If writing style isn't related to rules of writing then we're just welcoming chaos through the back door.
If we're at the point of defending people's literacy as a society than we've fallen into the Orwellian trap of goodspeek.
I'm not insulting people I'm making a demonstrable statement that most people post with a view that they are always correct online. I see it from undergrad work too and it gets shot down there as well for being either just wrong or pretentious and wrong.
Not allowing people's egos to get a needed correction is a bad thing.
Using demonstrable right/wrong conversations as a stick to grind other axes however is unacceptable in any context.
People should always approach a topic with an "I am wrong" approach and work backwards to establish that you're not, but almost nobody does, instead wading in with "my trusted source X knows better than you" which is tantamount to "my holy book Y says you should..." Anti-intellectualism at its finest.
> Some people write with clearly inflated self worth built into their arguments.
That's the kind of perception I'm talking about. I can tell you for sure, after all the years I've been doing this job, that such perceptions are anything but clear. They feel clear because that interpretation matches your priors, but such a feeling is not reliable, and when people use it as a basis for strongly-worded comments (e.g. "taking down a peg"), the result is conflict.
Sorry, I don't follow. How do you arrive at that implication? Why would someone having a pecuniary interest in something necessarily make them insincere?
Yes nothing wrong with cool software or showing people how to use it for useful things.
Sorry I'm just kind of sick of the whole 'kool aid', 'rage against AI' thing a lot of people seem to have going on and the way is presented in the post. I have family members with vision impairment helped by this particular app so its a bit personal.
Nothing against opening stuff up and understanding how it works etc. I'd just rather see people build/train useful new models and stuff with the open datasets / models already available.
I guess AI kind of does pay my bills in a round about way.
Sadly companies will hoard datasets and model research in the name of competitive advantage. Obviously with this specific model Microsoft chose to make it open, but this is not always the case, and it's not uncommon to read papers or technical reports saying they trained on an "internal dataset"
Companies do have a lot of data, and some of that data might be useful for training AI. but >99% isn't. When companies do release a cool model or paper that doesn't have open data, (as you point out for competitive or other reasons privacy etc) people can then help build/collect similar open datasets. Unfortunately companies generally don't owe you their data, and if they are in the business of making models they probably won't share the model either, the situation is similar to source code for proprietary LoB applications. but fortunately the best AI researchers mostly do like to share their knowledge and because companies want to attract the best AI researchers they seem to generally allow researchers to publish if its not too commercially sensitive. It could be worse while the competitive situation has reduced some visibility of the cutting edge science, lots of datasets and papers are still published.
In my view there was almost nothing like that in this article, besides the first sentence it went right into the technical stuff, which I liked. Compared to a lot of articles linked here it felt almost free from the battles between "AI" fashions.
It seems dang thinks I mistreated you somehow, if you agree I'm sorry, it wasn't my intention.
“ Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner.”
That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.
I'm not even sure if event the first part is true. Has it been determined if AI models are intellectual property? Machine generated content may not be copyrightable. It isn't just the output of generative AI that falls under this, the models themselves are.
Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.
An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.
People possess models, I'm not sure if they own them.
There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.
If companies train on data they don't own and expect to own their model weights, that's hypocritical.
Model weights shouldn't be copyrightable if the training data was pilfered.
But this hasn't been tested because models are locked away in data centers as trade secrets. There's no opportunity to observe or copy them outside of using their outputs as synthetic data.
On that subject, training on model outputs should be fair use, and an area we should use legislation to defend access to (similar to web scraping provisions).
> If companies train on data they don't own and expect to own their model weights, that's hypocritical.
Its not hypocritical to follow a line of legal analysis whoch holds that copying material in the course of training AI on it is outside the scope of copyright protection (as, e.g., fair use in the US), but that the model weights resulting from the training are protected by copyright.
It maybe wrong, and it may
be convenient for the interests of the firms involved, but it is not self-inconsistent in the way required for it to be hypocrisy.
If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.
Educated human beings are not protected by copyright, hence neither should trained AI models. Conversely, if a copyrightable work is produced based on work which itself is copyrighted, the resulting work needs the consent of the original authors of the prior work.
> If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.
No one training (foundation) models makes that fair use argument by analogy, they make arguments that addresses the specific statutory and case law criteria for fair use (abd frequently focus on the transformative character of the use); its true that the analogy to a learning human argument is frequently made in internet fora by AI enthusiasts who aren't the people training models on vaat scraped datasets. That argument is bunk for a number of reasons, but most critically the fact that a human learning from material isn’t fair use, because a human brain isn’t treated as a fixed medium, so learning in a human brain isn’t legally a copy or derivative work that would violate copyright without the fair use exception, so it's not a use to which fair use analysis even applies, so you can't argue anything is fair use by analogy to that. But its moot to any argument for hypocrisy by the big model makers, because they aren’t using that argument to start with.
If I take 1000 books and count the distributions of the lengths of the words, and the covariance between the lengths of one word and the next word for each book, and how much this covariance matrix tends to vary across the different books, and other things like this, and publish these summaries, it seems fairly clear to me that this should count as fair use.
(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)
Should the resulting work be protected by copyright? I’m not entirely sure…
I guess one thing is, the specific numbers I obtain by doing this are not a consequence of any creative decision making on my part, which I think in some jurisdictions (I don’t remember which) plays a role in whether a work is copyrightable (I will use “copyrightable” as an abbreviation for “protected by copyright”. I don’t mean to imply a requirement that someone specifically registers for copyright.). (Iirc this makes it so phone books are copyrightable in some jurisdictions but not others?)
The particular choice of statistical analysis does seem like it may involve creative decision making, but that would just be about like, what analysis I describe, and how the numbers I publish are to be interpreted, not what the numbers are? (Analogous to the source code of an ML model, not the parameters.)
Here is another question: suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, but requires a large (expensive) amount of compute to produce, and which also uses a lot of randomness so that the result would be different each time it was done (but suppose also that there isn’t much point doing it multiple times at the same scale, as having two of this kind of data artifact wouldn’t be much more valuable than having one).
Should such data artifacts be protected by copyright or something like it?
Well, if copyright requires creative human decision making, then they wouldn’t be.
It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes (to a point of course. Only as much as is justified by the value that is produced by them being available.) .
If such data artifacts can always be distributed without restriction, then ones that are publicly available would be public goods, and I guess only ones that are trade secrets would be private goods? It seems to me like having some mechanism to incentivize their creation and being-eventually-freely-distributed would be beneficial?
But maybe copyright isn’t the best way to do that? Idk.
> The particular choice of statistical analysis does seem like it may involve creative decision making
The selection and structuring of the training set may involve sufficient creativity to be copyrightable (as demonstrated by the existence of “compilation copyrights”), even if it is largely or even entirely composed of existing works, the statistical analysis part doesn't have to be the source of the creativity.
'Should the resulting work be protected by copyright? I’m not entirely sure…'
This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'? Just compiled lists of facts can't be protected. Which is why things like election result companies having to rely on NDAs and not copyright protections to protect their services on election night.
> This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'?
No, flaws are generally introduced to make it easier to detect copies; if multiple flawless reference works covering the same data (road maps of the same region, for instance) exist, each is copyrightable without flaws to the extent any would be with flaws, but you can't prove that someone else copied yours without permission if copying any of the others would give the same result. With flaws, gou can attribute the source that was copied more easily, but this isn't about being legally protected but about the practicality of enforcing that protection.
> suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, [...] It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes [...] But maybe copyright isn’t the best way to do that? Idk.
The model weights are the result of an automated process, by definition, and thus not protected by copyright.
In my unusually well-informed on copyright but not a lawyer opinion, without any new legislation on the subject, I suspect that the most likely scenario for intellectual property rights surrounding AI is that using other people's works for training probably falls under fair use, since it's extremely transformative (an AI that makes text and a textual work are very different things) and it's extremely difficult to argue that the AI, as it exists today, directly impacts the value of the original work.
The list of what traing data to use is probably protected by copyright if hand-picked, otherwise just whatever web-crawler they wrote to gather it.
The AI models, as in, the inference and training applications are protected by copyright, like any other application.
The architecture of a particular AI model can be protected by patents.
The weights, as the result of an automated process, are probably not protected by copyright.
> The model weights are the result of an automated process, by definition, and thus not protected by copyright.
Object code is the result of an automated process and is covered by the copyright on the source code.
Compilations are covered by copyright separate from that of the individual works, and it is arguable that a training set would be covered by a compilation copyright, and the result of applying an automated training processs to it would remain covered by that copyright.
I think it is fair to say that existing copyright law was not written to handle all of this. It was written for people who created works, and for other people who were using those works.
To substitute either party with a computer system and assume that the existing law still makes sense may be assuming too much.
There are certainly publicly available weights with restrictive licenses (eg some of the StableDiffusion stuff). I’d agree that it’d seem fairly perverse to say “our process for making this by slurping in a ton of copyright content was not copyright theft, but your use of it outside our restrictive license is”, but then I’m not a lawyer.
Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.
> Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.
No, there have been lawsuits, and the data centers have not been fair game because whether or not the models were trained on copyright-protected works is not generally in dispute. Discovery only applies to evidence relevant to facts in dispute.
> strongly suggestive that they have been trained on copyrighted materials
Given that everything -- including this comment -- is copyrighted unless it is (1) old or (2) deliberately put into the public domain, this is almost certainly true.
Isn’t this comment in the public domain? I presume that’s what I’m doing when I’m posting on a forum. If somebody copied and pasted something I wrote on here could I in theory use copyright law to restrict distribution? I think the law would say I published it on a public forum and thus it is in the public domain.
Why would it be in the public domain? Anything you create, under US copyright law, is the opposite of being in the public domain, it's yours. According to the legalese of YC, you are granting YC and YC alone a license to use the UGC you submitted to their website, but if anything, the YC agreement DEMANDS that you own the copyright to the comment you are posting.
> User Content Transmitted Through the Site: With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed. However, please review the Privacy Policy located here for more information on how we treat information included in applications submitted to us.
> You acknowledge and agree that any questions, comments, suggestions, ideas, feedback or other information about the Site (“Submissions”) provided by you to Y Combinator are non-confidential and Y Combinator will be entitled to the unrestricted use and dissemination of these Submissions for any purpose, without acknowledgment or compensation to you.
Another example of this is people putting code, intended to be shared, up on e.g. Github without a licence.
Many people seem to think that no licence = public domain, but it's still under strong copyright protection. This is the point of things like the Unlicense license.
>models are locked away in data centers as trade secrets
The architecture and the weights in a model are the secret process used to make a commercially valuable output. It makes the most sense to treat them as a trade secret, in a court of law.
The weights are a product of a mechanical process, 5 years ago it would be generally uncontroversial that they would be not subject to copyright in the US... but 'industry' has done a tremendous job of spreading confusion.
Going a step further, weights, i.e. coefficients, aren't produced by a person at all – they're produced by machine algorithms. Because a human did not create the weights, the weights have no author. Thus they are ineligible for copyright in the first place and are in the public domain. Whether the model architecture is copyrightable is more of an open question, but I think a solid argument could be that the model architecture is simply a mathematical expression – albeit a complex one –, though Python or other source code is almost certainly copyrighted. But I imagine clean-room methods could avoid problems there, and with much less effort than most software.
IANAL, but I have serious doubts about the applicability of current copyright law to existing AI models. I imagine the courts will decide the same.
Each compiled executable has a one-to-one relation with its source code, which has an author (except for LLM code and/or infinite monkeys). Thus compiled executables are derivative works.
There is an argument also that LLMs are derivative works of the training data, which I'm somewhat sympathetic to, though clearly there's a difference and lots of ambiguity about which contributions to which weights correspond to any particular source work.
Again IANAL, and this is my opinion based on reading the law & precedents. Consult a real copyright attorney for real advice.
Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.
Actually, in terms of copyright control "The Federal Circuit went on to clarify the nature of the DMCA's anti-circumvention provisions. The DMCA established causes of action for liability and did not establish a property right. Therefore, circumvention is not infringement in itself."[1]
>Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.
If that is the law, it is a defect that we need to fix.
Laws do not come down from heaven in the form of commandments.
We, humans, write laws.
If there is a defect in the laws, we should fix it.
If this is the law, time shifting and format shifting is unlawful as well which to me is unacceptable.
DMCA 1201 is written so broadly that any feature of a product or service can be construed to prevent copying, and thus gain 1201 protection.
I don't think YouTube intended regular uploads to have DRM, if only because they support Creative Commons metadata on uploads, and Creative Commons specifically forbids the use of technical protection measures on CC-licensed content[0]. On a less moralistic note, applying encryption to all YouTube videos would be prohibitively expensive because DRM vendors charge $$$ for the tech.
But the RIAA wants DRM because, well, they don't want people taking what they have rightfully stolen. So YouTube engineered a weak form of URL obfuscation that would only stop very basic scrapers[1]. DMCA 1201 doesn't care about encryption or obfuscation, though. What it does care about is if something was intended to stop copying, and if so, if the defendant's product was designed to defeat that thing.
There's an interesting wrinkle in DMCA 1201 in that merely being able to defeat DRM does not make something illegal. Defeating DRM has to be the tool's only function[2], or you have to advertise the tool as being able to defeat DRM[3], in order to actually violate DMCA 1201. DRM vendors usually resort to encryption, because it makes the circumvention tools specialized enough that they have no other purpose and thus fall afoul of DMCA 1201. But there's nothing stopping you from using really basic schemes (ROT-13 your DVDs!) and still getting to sue for 1201.
Going back to the AI ripping question, this blog post is probably not in and of itself a circumvention tool[4], but anyone implementing it is very much making circumvention tools, which are illegal to distribute. Circumvention itself is also illegal, but only when there's an underlying copyright infringement. i.e. you can't just encrypt something that's public domain or uncopyrightable and sue anyone who decrypts it.
So the next question is: is AI copyrightable? And can you sue for 1201 circumvention for something that is fundamentally composed of someone else's copyrighted work that you don't own and haven't licensed?
[0] Additionally, there is a very large repository of CC-BY music from Kevin MacLeod that is used all over YouTube that would have to be removed or relicensed if the RIAA were to prevail on this case.
I have no idea if Kevin actually intends to enforce the no-DRM clause in this way, though. Kevin actually has a fairly loose interpretation of CC-BY. For example, nobody attributes his music correctly, either the way the license requires, or with Kevin's (legally insufficient) recommended attribution strings. He does sell commercial (non-attribution) licenses but I've yet to hear of any enforcement actions from him.
[1] To be clear, without DRM encryption, any video can be ripped by hooking standard HTML5 video APIs using an extension.
[2] Things with "limited commercial purposes" beyond breaking DRM may also be construed as circumvention tools under DMCA 1201.
[3] My favorite example: someone tried selling a VGA-to-composite adapter as a way to copy movies off Netflix. That is illegal under DMCA 1201.
[4] To be clear, this is NOT settled law, this is "get sued and find out if the Supreme Court likes you that day" law.
Your comment confused me, but I'm very interested in what you're getting at.
> Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies.
Yep, this is the DMCA section 1201. Late '90s law in the US.
> Copyright covers the right to make copies, not the right to distribute
This is where I got confused. Copyright covers four rights: copying, distribution, creation of derivative works, and public performance. So I'm not sure what you were getting at with the copy/distribute dichotomy.
But here's a question I'm curious about: Can DMCA apply to a copy-protection mechanism that's being applied to non-copyrightable work? Based on my reading of https://www.copyright.gov/dmca/:
> First, it prohibits circumventing technological protection measures (or TPMs) used by copyright owners to control access to their works.
That's not the letter of the law, but an overview, but it does seem to suggest you can't bring a DMCA 1201 claim against someone circumventing copy-protection for uncopyrightable works.
> Whether or not model weights are copyrightable remains an open question.
And this is where the interaction with the wording of 1201 gets interesting, in my (non-professional) opinion!
> No person shall circumvent a technological measure that effectively controls access to a work protected under this title.
The inclusion of “work protected under this title” makes it clear in the law, though I doubt a judge would rule otherwise without that line. (Otherwise, I’d wonder if I could claim damages that Google et al. are violating the technological measures I’ve put in place to protect the specificity of my interests, because it wouldn’t matter that such is not protected by copyright law.)
> (A) to “circumvent a technological measure” means to descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner
Right, that’s what I was getting at with my parenthetical. Obviously the work has to have an owned copyright in order to be protected by copyright law.
If you mean that you might be able to decrypt a copyrighted work because you used that same encryption method on a non-copyrighted work, then definitely not. The work under protection will be considered. (Otherwise, I am unsure what you meant.)
From what I recall, it was the actual protection method that was protected by DMCA - when DVD protection was cracked it was forbidden to distribute a particular section of code so they just printed it on a Tee-shirt to troll the powers that be.
> Outside the Internet and the mass media, the key has appeared in or on T-shirts, poetry, songs and music videos, illustrations and other graphic artworks, tattoos and body art, and comic strips.
Using the encryption key to decrypt the data on a DVD is illegal “circumvention” per DMCA 1201, if it’s done without authorization from the copyright owner of the data on the DVD. If it were really illegal to simply publish the key on a website, then printing it on clothing that they sold instead would not be a viable loophole.
I’m glad it is still referred to as a controversy that they were issuing cease and desist letters for publishing information when the actual crime they had in mind, which was not alleged in the letters, is using the information to decrypt a DVD.
Publishing the key is a crime but even “discovering” the key is a crime. My toy thought is that you could legally do key discovery using non-copyrighted media though of course now that I think about it why would it be ciphered in that case LOL
just imagine, like just for a second how it becomes illegal to train anything that does not then afterwards produce, if publicly used or distributed, a copyright token which is both in the training set - to mark it - and in the produce - to recognize it.
so this is where it all goes in several years, if i were the gov.
No, copyright violation occurs at the first unauthorized copying or creation of a derivative work or exercise of any of the other exclusive rights of the copyright holder (that does not fall into an exception like that for fair use.) That distribution is required for a copyright violation is a persistent myth. Distribution is a means by which a violation becomes more likely to be detected and also more likely to involve significant liability for damages.
(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)
It's also worth noting that there is still no legal clarity on these issues, even if a license claims to provide specific permissions.
Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.
I doubt the models are copyrighted, arn’t works created by machine not eligible? Or you get into cases autogenerating and claiming ownership of all possible musical note combinations.
It’s hard to say because as far as I know this stuff hasn’t been definitively tested on any courts that I know of. Europe not America.
AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.
This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.
Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.
Boils down to "use Frida to find the arguments to the TensorFlow call beyond the model file"
Key here is, a binary model is just a bag-of-floats with primitively typed inputs and outputs.
It's ~impossible to write up more than what's here because either:
A) you understand reverse engineering and model basics, and thus the current content is clear you'd use Frida to figure out how the arguments are passed to TensorFlow
or
B) you don't understand this is a binary reverse engineering problem, even when shown Frida. If more content was provided, you'd see it as specific to a particular problem. Which it has to be. You'd also need a walkthrough by hand about batching, tokenization, so on and so forth, too much for a write up, and it'd be too confusing to follow for another model.
TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference
> It's ~impossible to write up more than what's here
Except you just did - or at least you wrote an outline for it, which is 80% of the value already.
The more impolite version of this basically says "If you can't figure out you're supposed to also use Frida to check the other arguments, you have no business trying." I agree, though, wrote a more polite version.
> TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference
I don't understand what you mean: I have no clue about anything related to reverse engineering, but I ported the mistral tokenizer to Rust and also wrote a basic CPU Llama training and inference implementation in Rust, so I definitely wouldn't need an intro to model inference…
You're also not the person I'm replying to, nor do you appear in any of this comment chain, so I've definitely not implied you need an intro to inference, so I'm even more confused than you :)
This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.
An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.
The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.
Thanks for the article OP, really fascinating.
Just having the shape of the input and output are not sufficient, the image (in this example) needs to be normalized. It's presumably not difficult to find the exact numbers, but it is a source of errors when reverse engineering a ML model.
Right, you get it: it's a Frida problem.
If you can't fix this with a little help from chatgpt or Google you shouldn't be building the models frankly let alone mucking with other people's...
Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!
Thanks a lot :)
My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.
This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".
Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.
I'm a mobile developer and I'm new to using Frida and other such tools. Do you have any tips or reading material on how to use things like Frida?
I think you are starting off from the perfect direction, being a forward-engineer first, and then a reverse-engineer.
The community around Frida is a a) a bit small and b) a bit unorganized/shadowy. You cannot find that many resources, atleast I have not found them.
I would suggest you to use Objection, explore an app, enumerate the classes with android hooking list classes or android hooking search classes, then dynamically watch and unwatch them. That is the quickest way to start, when you start developing your own scripts you can always check out code at https://codeshare.frida.re/.
For everything else join the Frida Telegram chat, most knowledge sits there, I am also there feel free to reach out to @altayakkus
Oh and btw, I would start with Android, even though iOS is fun too, and I would really really suggest getting a rooted phone/emulator. For the Android Studio Emulator you can use rootAVD (GitHub), just install Magisk Frida. Installing the Frida gadget into APKs is a mess which you wont miss when you go root
One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.
E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.
Can something similar help in model encryption?
Had to look it up, this seems to be the paper https://research.google/pubs/federated-learning-for-mobile-k...
they have a very "interesting" definition of private data on the paper. it's so outlandish that if you buy their definition, there's zero value on the trained data. heh.
they also claim unsuppervisioned users typing away is better than tagged training data, which explain the wild grammar suggestions on the top comment. guess the age of quantity over quality is finally peaking.
in the end it's the same as grammarly but without any verification of the interested data, and calling the collection of user data "federation"
actually letting users type whatever they want is good because they are many dialects of english : chinglish, thailish, singlish, hinglish and so on.
they have made the system so general that it can handle any quirk users throw at it.
Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.
So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.
They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.
And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.
In principle, device manufacturers could make hardware DRM work for ML models.
You usually inference those on GPUs anyway, and they usually have some kind of hardware DRM support for video already.
The way hardware DRM works is that you pass some encrypted content to the GPU and get a blob containing the content key from somewhere, encrypted in a way that only this GPU can decrypt. This way, even if the OS is fully compromised, it never sees the decrypted content.
But then you could compromise the GPU, probably :)
Look at the bootloader, can you open a console?
If not, can you desolder the flash and read the key?
If not, can you access the bootloader when the flash is not detected anymore?
...
Can you solder off the capacitors and glitch the power line, to do a [Voltage Fault Injection](https://www.synacktiv.com/en/publications/how-to-voltage-fau...)?
Can you solder a shunt resistor to the power line, observe the fluctuations and do [Power analysis](https://en.wikipedia.org/wiki/Power_analysis)?
There are a lot of doors and every time someone closes them a window remains tilted.
Any company serious about building silicon that has keys, wouldn't just be storing them in flash.
Try getting a private key off a TPM. There have been novel attacks, but they are few and far between.
Try getting a key from Apple's trusted enclave (or whatever buzz-word they call it).
You're right about the TPM, I won't get the key out of it. It's a special ASIC which doesn't even have the silicon gates to give me the key.
But is the TPM doing matrix-mulitiplication at 1.3 Petaflops?
Or are you just sending the encrypted file to the TPM, getting the unencrypted file back from it, which I can intercept, be it on SPI or by gaining higher privileges on the core itself? Just like with this app but down lower?
Whatever core executes the multiplications will be vulnerable by some way or the other, for an motivated attacker which has the proper resources. This is true for every hardware device, but the attack vector of someone jailbreaking a Nintendo Switch by using a electron microscope and a ion-beam miller is neglectable.
If you are that paranoid about AI models being stolen, they are worth it, so some attacker will have enough motivation to power through.
Stealing the private key out of a GPU which allows you steal a lot of valuable AI models is break-once-break-everywhere.
Apple trusted enclave is also just a TPM with other branding, or maybe a HSM dunno.
I'll concede you are correct that whether the key is extractable or not doesn't really matter if the GPU eventually will eventually need to store the decrypted model in memory.
However, if NVidia or similar was serious about securing these models, I'd be pretty sure they could integrate the crypto in hardware multipliers / etc such that the model doesn't need to be decrypted anywhere in memory.
But at this point there isn't much value in deploying models to the edge. Particularly the type of models they would really want to protect as they are too large.
The types of models deployed to edge devices (like the Apple ones) are generally quite small and frankly not too difficult (computationally) to reimplement.
Homomorphic, not homeomorphic
`enc(coffee cup) == enc(donut)` would be an interesting guarantee.
In theory yes, in practice right now no. Homomorphic encryption is too computationally expensive.
I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.
Maybe someday we will build a society where standing on the shoulders of giants is encouraged, even when they haven't been dead for 100 years yet.
this would be yellow in https://en.wikipedia.org/wiki/Spiral_Dynamics but we are still a mix of orange and green.
pretty cool; that frida tool seems really nice. https://frida.re/docs/home/
(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")
frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo
It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.
EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.
> Keep in mind that AI models [...] are considered intellectual property
Is it ironic or missing a /s? I can't really tell here.
To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.
If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.
Standard corporate hypocrisy. "Rules for thee, not for me."
If you actually expected anything to be open about OpenAI's products, please get in touch, I have an incredible business opportunity for you in the form of a bridge in New York.
They got backlash, but (if I'm not mistaken) it was ruled that it's okay to use copyrighted works in your model.
So if a model is copyrighted, you should still be able to use it if you generate a different one based on it. I.e. copyright laundry. I assume this would be similar to how fonts work. You can copyright a font file, but not the actual shapes. So if you re-encode the shapes with different points, that's legal.
But, I don't think a model can be copyrighted. Isn't it the case that something created mechanically can't be copyrighted? It has to be authored by a person.
I find it weird that so many hackers go out of their way to approve of the legal claims of Big AI before it's even settled, instead of undermining Big AI. Isn't the hacker ethos all about decentralization?
Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.
Hey, author here.
I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.
But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.
For now, it is better to assume it is the truth.
The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.
It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.
I think it's both. It's
1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
> reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
Prevalent or not, phrased this way it's clear how nonsense it is. The data isn't hurt or destroyed in the process of being trained on, nor does the process deprive the data owners from their data or opportunity to monetize it the way they ordinarily would.
The right terms here are "learning from", "taking inspiration from", not "being a parasite".
(Now, feeling entitled to rent because someone invented something useful and your work accidentally turned out to be useful, infinitesimally, in making it happen - now that is wanting to be a parasite on society.)
I think the bad part of it is stripping consent from the original creators, after they published their work. I personally see it as an unfortunate side-effect of change. The artists of the future can create with AI already in mind, but this was not the privilege of the current, and previous generations.
Getting back to "learning from", I think the issue is not the learning part, but the recreation part. AI can churn content to orders of magnitude higher than before, even in the age of Fiverr and other tools-opportunities. This changes the dynamics of the interaction, because previously, it took someone tens of hours to create something, now it takes AI minutes. That is not participating in the same playing field, it's absolutely dominating it, completely changing it. That is something to have feelings about, especially if one's livelihood is also impacted. Data is not destroyed, and neither is its ownership, but people don't usually want the exact thing, they are content with a good enough thing, and this takes away a lot of power from the artists, whose work is the lifeblood of artistic AI in the first place.
So I don't think it's as nonsense as you state it. But I do understand that it's not cut and dry the other way around either. Gatekeeping culture is definitely not a humane thing to do. Culture comes and goes, intermingles, inspires and changes all the time, and people take from it and add to it all the time. Preserving copyright perfectly would neuter it, and slant the landscape even more towards the already powerful.
If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.
IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]
[1] https://github.com/meta-llama/llama/blob/main/LICENSE
The fact that models creators assert that they are protectrd by copyright and offer licenses does not mean:
(1) That they are actually protected by copyright in the first place, or
(2) That the particular act described does not fall into an exception to copyright like fair use, exactly as many model creators assert that the exact same act done with the materials models are trained on does, rendering the restrictions of the license offered moot for that purpose.
LLMs are trained on works -- software, graphics and text -- covered by my copyright. What's the difference?
The difference is that you pulling out a model is you potentially violating copyright, while the model itself being trained on copyrighted models is potentially them violating copyrights.
I.e. the first one concerns you, the other is none of your business.
Them potentially violating my copyrights is very much my business. But you're right, the difference is how much the respective parties have to spend on legal battles.
Simply showing up to court wearing a tshirt that says "what she said" probably wouldn't fly, but I like to imagine that any arguments made by them about their copyrights would be equally true of my copyrights.
At this point I'm mostly wondering if "you ripped me off first" is a viable legal defense to copyright battles where it's unclear if either party is distributing the works of the other. One thing is for sure though, if I were to do this as an individual, the discovery process would be much more expensive for them than me.
If I understand the position of major players in this field, copyright itself is optional (for them at least).
True, I think there has to be a case that sets precedent for this issue.
They claim “safe harbour” - if nobody complains it’s fair game
Is there a material difference between the copyright laws for software and the copyright laws for images and text?
Yeah no.
An example for legal reference might be convolution reverb. Basically it's a way to record what a fancy reverb machines does (using copyrighted complex math algorithms) and cheaply recreate the reverb on my computer. It seems like companies can do this as long as they distribute protected reverbs separately from the commercial application. So Liquidsonics (https://www.liquidsonics.com/software/) sells reverb software but puts for free download the 'protected' convolution reverbs specifically the Bricasti ones in dispute (https://www.liquidsonics.com/fusion-ir/reverberate-3/)
Also, while a SQL server can be copyright protected, a SQL database is not given copyright protection/ownership to the SQL server software creators by extension of that.
There's an interesting research paper from a few years ago that extracted models from Android apps on a large scale: https://impillar.github.io/files/ccs2022advdroid.pdf
That's pretty cool! I am impressed by the Frida tool, especially to read in the binary and dump it to disk by overwriting the native method.
The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?
Yeah, you can basically just unzip IPA files. Gaining them is hard though, I have a pathway if you are interested.
But the Objective C code is actually compiled, and decompilation is a lot harder than with the JVM languages on Android.
My next article will be about CoreML on iOS, doing the same exact thing :)
Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...
There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.
As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.
The weights are my training data. I scraped them from the internet
That sentiment is ethically sound and logically robust and directionally consistent with any uniform application of the law as written.
But there is a group of people, growing daily in influence, who utterly reject such principles as either worthy or useful. This group of people is defined by the ego necessary to conclude that when the stakes are this high, the decisions should be made by them, that the ends justify the means on arbitrary antisocial behavior (c.f. the behavior of their scrapers) as long as this quasi-religious orgasm of singularity is steered by the firm hand that is willing and able to see it through.
That doesn’t distress me: L Ron Hubbard has that.
It distresses me that HN as a community refuses to stand up to these people.
To some extent this is how many models are being produced today.
Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.
This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.
Excellent introduction to some cool tools I wasn't aware of!
For app developers considering tflite, a safer way would be to host the models on firebase and delete them when their job is done. It comes with other features like versioning for model updates, A/B tests, lower apk size etc. https://firebase.google.com/docs/ml/manage-hosted-models
That wouldn't help against the technique explained in the article, would it? Since the model makes it way into the device, it can be intercepted in a similar fashion.
I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.
I think the comment author means offering inference via Firebase, with the model never leaving the backend.
This works, just like ChatGPT works, but has the downside of 1. You have to pay the computing for every inference 2. Your users can't access it offline 3. Your users will have to use a lot of data from their mobile network operator. 4. Your inference will be slower
And since SeeingAI infers the model every second, your and your customers bill will be huge.
That's what I thought, but the link doesn't say anything about off-device inference, it's only about storing and retrieving the model. There's just one off-hand note about cloud inference.
In any case, yeah you can not download the model to the device at all, but then you have to deal with the other angle - making sure the endpoint isn't abused.
Maybe a hybrid approach would work - infer just part of the model (layers?) on the cloud, and then carry on the inference on the device? I'm not familiar with how AI models look like and work like exactly, but I feel like hiding even a tiny portion of the model would make it not usable in practice
Your second note is very interesting, having looked at the model myself this is very plausible.
For models which use a lot of input nodes, a lot of "hidden layers" and in the end just perform a softmax this may get infeasible because of the amount of data you would have to transfer.
You may have inspired a second article :)
In addition to the sibling comment this would require repeatedly re-downloading models when you want to use them, which sucks.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.
See, e.g., https://news.ycombinator.com/item?id=42617889
Can anyone explain that resize_to_320.tflite file? Surely they aren't using an AI model to resize images? Right?
tflite files can contain a ResizeOp that resizes the image: https://ai.google.dev/edge/api/tflite/java/org/tensorflow/li...
The file is only 7.7kb, so it couldn't contain many weights anyways.
Exactly. Put another way, tensorflow is not an AI. You can build an AI in tensorflow. You can also resize images in tensorflow (using the traditional algorithms, not AI). I am not an expert, but as I understand, it is common for vision models to require a fixed resolution input, and it is common for that resolution to be quite low due to resource constraints.
Probably not what your alluding to but AI upscaling of images is definitely a thing
This was a great article and I really appreciate it!
You wouldn't train a LLM on a corpus containing copyrighted works without ensuring you had the necessary rights to the works, would you?
LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set.
And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".
Copying a single sentence verbatim from a 1000 page book is still plagiarism.
And is technically copyright infringement outside fair use exceptions.
And similarly, translating those sentences into data points is still a derivative work, like transcribing music and then making a new recording is still derivative.
derivative works still tend to be copyright violations.
Yes, that's what I'm saying. An LLM washing machine doesn't get rid of the copyright.
It doesn't matter. It's still a derived work.
Well what isn’t in this world?
Would Einstein would have been possible without Newton?
I'm fine with us ditching copyright altogether.
But as things are, the megacorps are training their LLMs on the commons while asserting "intellectual property" rights on the resulting weights. So, fuck them, and cheers to those who try to do something about this state of affairs.
Newton was public domain by Einstein's time.
Indeed. Copyright was introduced in 1710, Principia was published in 1687.
and even with our current copyright laws providing for long dated protection, it would have still been in public domain
It's hard to say what the current laws actually imply. Steamboat Willie was originally meant to be in the public domain in 1955. Got there in 2024.
> LLMs are not massive archives of data.
Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.
> And before you knee-jerk "it's a compression algo!"
It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.
> I invite you to archive all your data with an LLMs "compression algo".
As long as we agree it is _my data_ and not yours.
> It's lossy compression, the same way a JPEG might be
Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.
The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.
I’m just commenting, not disputing any argument about fair use.
You wouldn't read a book and teach others its lessons without a derived license, would you?
When I was at school, we were sometimes all sat down in front of a TV to watch some movie on VHS tape (it was the 90s).
At the start of the tape, there was a copyright notice forbidding the VHS tape from being played at, amongst other places, schools.
Copyright rules are a strange thing.
copyright refers to the act of copying the material at hand (including distribution, reproduction, performance) etc.
as an example: saying “i really like james holden’s inheritors album for the rough and dissonant sounds” isn’t covered by copyright.
if i reproduced it verbatim using my mouth, or created a derived work which is noticeably similar to the original, that’s a different question though.
in your example, a derivative work example could be akin to only quoting from the book for the audience and modifying a word of each quote.
“derived” works are always a grey area, especially around generative machine learning right now.
and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes
[flagged]
[dead]
You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.
as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.
LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).
Do you ever listen to music? Is your music ever influenced by the music that you listen to? How do you imagine that works, in an information-theoretical sense, that fundamentally differs from an LLM?
Depending on how much music you've listened to, you very well may have "downloaded terabytes" of it into your brain. Your argument is specious.
Information on how large language models are trained is not hard to come by, there are numerous articles that cover this material. Even a brief skimming of this material will make it clear that the training of large language models is materially different in almost every way from how human beings "learn" and build knowledge. There are still many open questions around the process of how humans collect, store, retrieve and synthesize information.
There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data, the quality of output degrades greatly when novel input is provided. Is your argument that people fundamentally function in the same way? That would be a bold and novel assertion!
> There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data
If this were true, then you would be able to identify the specific work being "parroted" and you'd have a case for copyright infringement regardless of whether it was produced by an LLM at all. This isn't how LLMs work though. For instance, if an LLM's training data includes the complete works of a given author and then you prompt the LLM to write a story in the style of that author, it will actually write an original story instead of reproducing one of the stories in its training corpus. It won't be particularly good but it will be an original work.
It also isn't obvious whether or not, or to what degree, LLM training works differently from human learning. You yourself acknowledged that there are "many open questions" about how human learning works, so how can you be so confident that it's fundamentally different? It doesn't matter anyway because you can still apply the exact same standards to LLM output to judge whether it infringes copyright that you would to something that was produced by a human being.
i do listen to music.
i listen to it on apple music.
i pay money to apple for this.
some of that money that i pay to apple goes to the rights holders of that music for the copying and performance of their work through my speakers.
that’s a pretty big difference to how most LLMs are trained right there! i actually pay original creators some money.
-
i am a human being. you cannot reduce me down to some easy information theory.
an LLM is a tool. an algorithm. with the same random seed etc etc it will get the same results. it is not human.
you put me in the same room as yesterday i’ll behave completely differently.
-
i have listened to way more than terabytes of music in my life. doesn’t mean i have the ability to regurgitate any of it verbatim though. i’m crap at that stuff.
LLMs seem to be really good at it though.
I don't see how this is a double standard. Comparing a person interacting with their culture is not comparable in any way. IMHO, it's kind of a wacky argument to make.
Can you elaborate on how it's not comparable? It seems obvious to me that it is -- they both learn and then create -- so what's the difference?
If I can hire an employee who draws on knowledge they learned from copyrighted textbooks, why can't I hire an AI which draws on knowledge it learned from copyrighted textbooks? What makes that argument "wacky" in your eyes?
you're asking why you have to treat people differently than you treat tools and machines.
Well obviously not in general. But when it comes to copyright law specifically, yes absolutely. That is the question I'm asking.
It has never been argued that copyright law should apply to information the people learn, whether that be from reading books or newspapers, watching television or appreciating art like paintings or photographs.
Unlike a person, an large language model is product built by a company and sold by a company. While I am not a lawyer, I believe much of the copyright arguments around LLM training revolve around the idea that copyrighted content should be licensed by the company training the LLM. In much the same way that people are not allowed to scrape the content of the New York Time website and then pass it off as their own content, so should OpenAI be barred from scraping the New York Times website to train ChatGPT and then sell the service without providing some dollars back to the New York Times.
You're not going to get an answer you find agreeable, because you're hoping for an answer that allows you to continue to treat the tool as chattel, without conferring to it the excess baggage of being an individuated entity/laborer.
You're either going to get: it's a technological, infinitely scalable process, and the training data should be considered what it is, which is intellectual property that should be being licensed before being used.
...or... It actually is the same as human learning, and it's time we started loading these things up with other baggage to be attached to persons if we're going to accept it's possible for a machine to learn like a human.
There isn't a reasonable middle ground due to the magnitude of social disruption a chattel quasi-human technological human replacement would cause.
Hi. I like this post. There are some careful thoughts here.
Can you help me to understand the term "chattel" as you used it? I never heard the term before I read your post, and I needed to Google for it: <<
(in general use) a personal possession.
(in law) an item of property other than freehold land, including tangible goods ( chattels personal ) and leasehold interests ( chattels real ). >>
Chattel, as I'm using it is in reference to the usage in distinguishing an "ownable piece of property" from an employee".
Namely, a magic, technologically reproducible box that can be applied almost as effectively as a human hireling, but isn't a human hireling is near infinitely more desirable in a capitalist system since the blackbox is chattel, the hired human is not. The chattel has no natural rights, no claim to self sovereignty, and is an asset that is legally extant by virtue of the fact it is owned by the owner.
Chattel that are flexible enough to replace the legal burdens incurred by hiring a human to do the same job, will naturally be converged upon due to the capitalistic optimation function of minimizing unit input cost for output over dollars and potential dollars as expressed through legal exposure.
Imagine you had two human like populations. One made of plastic that aren't considered humans but property. I.e. are chattel. Then you have a bunch of people with all the baggage that comes with it.
Hiring people/employing people is hard. Particularly in the U.S. and other jurisdictions where a great deal of responsibility for actually implementing regulations/ taxation/immigration and such is tacked onto being an employer/being able to hire.
As the gap between the capability of the chattel population closes in on the human population, the more economic and workload sense it makes for the system to improve the chattel population under our current optimization strategy, (given no pre-emptive work to cut off externality dumping). Humans are messy and complicated to work with. Often unpredictable. Chattel are easy to account for; especially when combined with "technical restraints". You have to fundamentally engage in negotiation with another human being to get them on board with working for you. You buy the chattel, and that's that. The chattel has no grounds to refuse service. Socially speaking, we don't even recognize it's outputs as carrying any social weight, or resistance as anything but malfunctions.
Economics is the science around using access to resources as a means to get other people to work with you. Being chattel means you can cut out entirely all that complexity. You are resource. Not people.
Unironically, we need to have an answer to whether or not we are going to consider a sufficiently complex function imitator as something that requires a classification above "chattel" or controls around how we apply it in order to not self-destruct the economic equilibria in which we purport to exist. Because all it takes is removing or sufficiently obstructing the flow of value down from individuals who accrete the most of these wunder-chattel to render things so top heavy, most of the constraints/invariants of our socioeconomic systems as we know them become invalidated.
That does not bode well for anyone.
Aren't animals a current example of a middle ground? They are incapable of authoring copyrightable works under current US law.
No, you’re missing the point of copyright. The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works. If an LLM produces original works that are influenced by the training data, that is not a violation of copyright. If it reproduces the training data verbatim, it is.
> I'm pretty sure if an LLM creates Paul's Boutique 2.0 in 2025 using incredible number of samples, then someone cannot sell it (or use it in a YouTube video) without first licensing those samples. I doubt very much someone could just "hide behind" an LLM and claim, "Oh, it is original, but derivative, work, created by an LLM." I doubt courts would allow that.
This isn’t how LLM’s work though. Samples are just that, literal samples that are copied from one work to another verbatim. LLM’s use training data to construct a predictive model of which tokens follow each other. You probably could get an LLM to use samples deliberately if you wanted to, but this isn’t how they typically work.
Regardless, at that point you’re just evaluating the claim of copyright infringement based on the nature of the work itself, which is exactly what I’m advocating, versus presuming that all LLM output is necessarily copyright infringement if any copyrighted material was used in training.
i weirdly agree with you, but also want to point out that “influenced by the training data” is doing some very heavy lifting there.
exactly how the new work is created is important when it comes to derivative works.
does it use a copy of the original work to create it, or a vague idea/memory of the original work’s composition?
when i make music it’s usually vague memories. i’d argue that LLMs have an encoded representation of the original work in their weights (along with all the other stuff).
but that’s the legal grey area bit. is the “mush” of model weights an encoded representation of works, or vague memories?
I don’t really think it matters because you can just compare the output to the input and apply the same standard, treating the process between the two as a black box.
did you just call me a black box? :/
not sure how i feel about being reduced down to that as a human being.
As far as I’m concerned you are a black box. Just as I’m a black box from your perspective. In principle I could come over and vivisect your brain if you’d like, but I doubt you’d be interested, and I wouldn’t really want to incur the legal liability even if you were.
Besides, “black box” just means that your internal mental life and cognitive mechanism is opaque to me. It’s not like I’m calling you a p-zombie.
Also, even if an LLM generates an original work, the weights it used may still be a derived work.
One is a collection of highly dithered data generated by machines paid for by a business in order to financially gain from the copyrighted works in order to replace any future need for copyrighted text books.
The other is a person learning from a copyrighted textbook in the legally protected manner, and whom and use the textbook was written for.
I don't think this question really makes any sense... In my opinion, it's kind of mish-mashing several things together.
"Can you elaborate on how it's not comparable?"
The process of individual people interacting with their culture is a vastly different process than that used to train large language models. In what ways to you think these processes have anything in common?
"It seems obvious to me that it is -- they both learn and then create -- so what's the difference?"
This doesn't seem obvious to me (obviously)! Maybe you can argue that an LLM "learns" during training, but that ceases once training is complete. For sure, there are work-arounds that meet certain goals (RAG, fine-tuning); maybe your already vague definition of "learning" could be stretched to include these? Still, comparing this to how people learn is pretty far-fetched. AFAICT, there's no literature supporting the view that there's any commonality here; if you have some I would be very interested to read it. :-)
Do they both create? I suspect not; an LLM is parroting back data from it's training set. We've seen many studies showing that tested LLMs perform poorly on novel problem sets. This article was posted just this week:
https://news.ycombinator.com/item?id=42565606
The court is still out on the copyright issue, for the perspective of US law we'll have to wait on this one. Still, it's clear that an LLM can't "create" in any meaningful way.
And so on and so forth. How is hiring an employee at all similar to subscribing to an OpenAI ChatGPT plan? Wacky indeed!
Obviously, on the inside, the process that a person goes through in learning and creating, and the process that a LLM goes through in learning and creating, is very different. Nobody will dispute that.
But if they're learning from the same kinds of materials, and producing the same kind of output, then obviously the comparison can be made. And your idea that LLM's don't create seems obviously false.
So I have to conclude the two seem comparable, and someone would have to show why different legal principles around copyright ought to apply, when it's a simple question of input/output. Why should it matter if it's a human or algorithm doing the processing, from a copyright perspective? Nothing "wacky" about the question at all.
most probably your employee actually 'paid' for their textbook.
Unless you are making an argument for personhood, one is a machine, the other is a human. Different standards apply, end of discussion.
That's a little simplistic. You're almost trying to say blank and white sands gray can't be compared which is a bit weird.
Strangely like the situation itself.
The question is just looked to how can we guarantee a model is influenced rather than memorising an input?
And then is a human who is influenced simply relying on a faulty or less than perfect memory?
Human creators don't store that 'influence' in a digital machine accessible format generated directly from the copyrighted content though.
Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.
Does that imply that if we invent brain upload technology, that my weights have every conflicting license and patent for everything I can quote or create? I don't like that precedent. I have complete rights over my noggin's contents. If I do quote a NYT article in it's entirely, that vould be infringement, but not copying my brain itself.
Your argument boils down to “we don’t know how brains work”, and it is a non-sequitur. It isn’t a violation of copyright law to create original works under the creative influence of works still under copyright.
Fair use.
*only available in the USA, terms and conditions apply.
most other places use fair dealing which is more restrictive https://en.m.wikipedia.org/wiki/Fair_dealing
Easy to claim, harder to justify once you start charging money for your subsequent creation.
Unless all LLM are a ruthless parody of human intelligence, which they may be, the legal issues will continue.
The moment you earn money from it, that's not fair use anymore. When I last checked, unlimited access to said models were not free, plus it's not "research" anymore.
- Addenda -
For the interested parties, the law states the following [0].
Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factorsSo, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.
Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.
From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.
[0]: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors
I really don't think it's that simple. I can read books and then earn money from applying what I learned in them. I can also study art and then make original art in the same or similar styles. If a person was doing this there would be no one claiming copyright infringement. The only difference is it's a machine doing it and not a person.
The nature of copyright and plagiarism boils down to paraphrasing, and so long as LLMs sufficiently paraphrase the content it's an open question whether it's copyright infringement and requires new law/precedent.
So the fact they are earning money is a red herring unless they are reproducing the exact same content without paraphrasing (with exception to commentary). E.g. they can quote part of a work while commenting on it.
Where they have gotten into trouble with e.g. NYT afaik is when the LLM reproduced a whole article word for word. I think they have all tried hard to prevent the LLM from ever doing that to avoid that legal risk.
> I can read books and then earn money from applying what I learned in them.
How many books can you read, understand and memorize in T time, and how many books an AI can ingest in the T time?
If we're down to paraphrasing, watch this video [1], and think again.
Many models, given that you ask the correct questions, reproduce their training set with great accuracy, and this is only prevented with monkey patching, IIUC.
So, it's still a big mess, even if we don't add copyrighted corpus to the mix. Oh, BTW, datasets like "The Stack" are not clean as they claim. I have seen at least two non-permissively licensed code repositories inside that dataset.
[1]: https://youtu.be/LrkAORPiaEA
I agree it's a big mess, that was kind of my point.
I am curious about the video, but am not compelled to spend 24 min watching it when you haven't summarized its thesis for me. The title of the video makes it seem adjacent at best to the points I was making. (Some automated flagging system =/= actual law)
"Making money" does not immediately invalidate fair use, but it does wave a big red flag in the courts' faces.
I would be more nuanced on this matter. As I understand, in the US, fair use allows media to write critiques of cultural artefacts (sorry, I cannot think of a better, broad term). For example, you can include small quotes from the film script when writing a critique of it without requiring permission from the owner of the copyright. And, until the World Wide Web arrived to the masses in the mid-1990s, most critiques were published by commercial media outlets, such as a daily newspaper. They were certainly published by commercial, for-profit entities. That said, I think the intent of the fair use is very important to the courts, much more than the entity that is doing the fair use (newspaper, blogger, etc.).
Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.
So you say that, every law is a suggestion depending who's being tried?
Er, what? I'm speaking directly from the law, 17 U.S.C. § 107. It's deliberately written in terms of "factors to consider", rather than absolutes.
> In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:
> * the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
> * the nature of the copyrighted work;
> * the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
> * the effect of the use upon the potential market for or value of the copyrighted work.
You can absolutely monetize works altered under fair use.
Any examples sans current AI models? I have not seen any, or failed to find any, to precise.
Basically any YouTube video that shows another YouTube video, song, movie, etc. as part of something else (eg a voiceover.)
Well done you seem to have liberated an open model trained on open data for blind and visually impaired people.
Paper: https://arxiv.org/pdf/2204.03738
Code: https://github.com/microsoft/banknote-net Training data: https://raw.githubusercontent.com/microsoft/banknote-net/ref...
model: https://github.com/microsoft/banknote-net/blob/main/models/b...
Kinda easier to download it straight from github.
Its licenced under MIT and CDLA-Permissive-2.0 licenses.
But lets not let that get in the way of hating on AI shall we?
> But lets not let that get in the way of hating on AI shall we?
Can you please edit this kind of thing out of your HN comments? (This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)
It leads to a downward spiral, as one can see in the progression to https://news.ycombinator.com/item?id=42604422 and https://news.ycombinator.com/item?id=42604728. That's what we're trying to avoid here.
Your post is informative and would be just fine without the last sentence (well, plus the snarky first two words).
Can you clarify this a bit. I presume you are talking about the tone more than the implied statement.
If the last sentence were explicit rather than implied, for instance
This article seems to be serving the growing prejudice against AI
Is that better? It is still likely to be controversial and the accuracy debatable, but it is at least sincere and could be the start of a reasonable conversation, provided the responders behave accordingly.
I would like people to talk about controversial things here if they do so in a considerate manner.
I'd also like to personally acknowledge how much work you do to defuse situations on HN. You represent an excellent example of how to behave. Even when the people you are talking to assume bad faith you hold your composure.
Sure, that would be better. It isn't snarky, and it makes fewer uncharitable assumptions.
I don't seem to be able to edit it, apologies I will try not to let this type of thing get to me in future.
I would also like to point out that this is a fine tuned classifier vision model based on mobilenetv2 and not an LLM.
Don't you think its intentional, so as not to demonstrate the technique on potentially copyrighted data?
Author here, it would be nice to claim that I did this on purpose but I really did not know it was open source.
I was rather interested in the process of instrumenting of TF to make this "attack" scalable to other apps.
If this is exactly the same model then what's the point of encrypting it?
... Because if he did this with a model that's not open that's sure going to keep everyone happy and not result in lawsuit(s)...
The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)
[flagged]
Please don't cross into personal attack or otherwise break the site guidelines when posting here. Your post would be fine with just the first sentence.
https://news.ycombinator.com/newsguidelines.html
Really... Some people do need to be taken down a peg here at times though.
I know it feels that way, but people's perceptions of each other online are so distorted that this is just a recipe for massive conflict. That's off topic on HN because it isn't interesting.
I'm not referring to people's perceptions. Some people write with clearly inflated self worth built into their arguments. If writing style isn't related to rules of writing then we're just welcoming chaos through the back door.
If we're at the point of defending people's literacy as a society than we've fallen into the Orwellian trap of goodspeek.
I'm not insulting people I'm making a demonstrable statement that most people post with a view that they are always correct online. I see it from undergrad work too and it gets shot down there as well for being either just wrong or pretentious and wrong.
Not allowing people's egos to get a needed correction is a bad thing. Using demonstrable right/wrong conversations as a stick to grind other axes however is unacceptable in any context.
People should always approach a topic with an "I am wrong" approach and work backwards to establish that you're not, but almost nobody does, instead wading in with "my trusted source X knows better than you" which is tantamount to "my holy book Y says you should..." Anti-intellectualism at its finest.
> Some people write with clearly inflated self worth built into their arguments.
That's the kind of perception I'm talking about. I can tell you for sure, after all the years I've been doing this job, that such perceptions are anything but clear. They feel clear because that interpretation matches your priors, but such a feeling is not reliable, and when people use it as a basis for strongly-worded comments (e.g. "taking down a peg"), the result is conflict.
[flagged]
Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.
https://news.ycombinator.com/newsguidelines.html
[flagged]
I am groot
[flagged]
Can you please edit swipes out of your HN comments? Your post would be fine with just the first sentence.
This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.
What do you mean, "swipe"? The other person agreed they'd misjudged the article and apologised several hours before you wrote this.
"Does 'AI' pay your bills" was a gratuitous personal attack.
Is it? How? In your mind, does it imply some particular humiliation or something?
It's a variant of the "shill" argument, implying that the other person isn't posting in good faith.
Sorry, I don't follow. How do you arrive at that implication? Why would someone having a pecuniary interest in something necessarily make them insincere?
Yes nothing wrong with cool software or showing people how to use it for useful things.
Sorry I'm just kind of sick of the whole 'kool aid', 'rage against AI' thing a lot of people seem to have going on and the way is presented in the post. I have family members with vision impairment helped by this particular app so its a bit personal.
Nothing against opening stuff up and understanding how it works etc. I'd just rather see people build/train useful new models and stuff with the open datasets / models already available.
I guess AI kind of does pay my bills in a round about way.
Sadly companies will hoard datasets and model research in the name of competitive advantage. Obviously with this specific model Microsoft chose to make it open, but this is not always the case, and it's not uncommon to read papers or technical reports saying they trained on an "internal dataset"
Companies do have a lot of data, and some of that data might be useful for training AI. but >99% isn't. When companies do release a cool model or paper that doesn't have open data, (as you point out for competitive or other reasons privacy etc) people can then help build/collect similar open datasets. Unfortunately companies generally don't owe you their data, and if they are in the business of making models they probably won't share the model either, the situation is similar to source code for proprietary LoB applications. but fortunately the best AI researchers mostly do like to share their knowledge and because companies want to attract the best AI researchers they seem to generally allow researchers to publish if its not too commercially sensitive. It could be worse while the competitive situation has reduced some visibility of the cutting edge science, lots of datasets and papers are still published.
In my view there was almost nothing like that in this article, besides the first sentence it went right into the technical stuff, which I liked. Compared to a lot of articles linked here it felt almost free from the battles between "AI" fashions.
It seems dang thinks I mistreated you somehow, if you agree I'm sorry, it wasn't my intention.
Welcome to check out Sam Altman’s January 5, 2025 blog post, “Reflections.”
https://web.powtain.com/pow/qao631
“ Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner.”
That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.
I'm not even sure if event the first part is true. Has it been determined if AI models are intellectual property? Machine generated content may not be copyrightable. It isn't just the output of generative AI that falls under this, the models themselves are.
Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.
An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.
People possess models, I'm not sure if they own them.
There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.
> AI models are intellectual property
If companies train on data they don't own and expect to own their model weights, that's hypocritical.
Model weights shouldn't be copyrightable if the training data was pilfered.
But this hasn't been tested because models are locked away in data centers as trade secrets. There's no opportunity to observe or copy them outside of using their outputs as synthetic data.
On that subject, training on model outputs should be fair use, and an area we should use legislation to defend access to (similar to web scraping provisions).
> If companies train on data they don't own and expect to own their model weights, that's hypocritical.
Its not hypocritical to follow a line of legal analysis whoch holds that copying material in the course of training AI on it is outside the scope of copyright protection (as, e.g., fair use in the US), but that the model weights resulting from the training are protected by copyright.
It maybe wrong, and it may be convenient for the interests of the firms involved, but it is not self-inconsistent in the way required for it to be hypocrisy.
If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.
Educated human beings are not protected by copyright, hence neither should trained AI models. Conversely, if a copyrightable work is produced based on work which itself is copyrighted, the resulting work needs the consent of the original authors of the prior work.
AI models can't have their ©ake and eat it.
> If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.
No one training (foundation) models makes that fair use argument by analogy, they make arguments that addresses the specific statutory and case law criteria for fair use (abd frequently focus on the transformative character of the use); its true that the analogy to a learning human argument is frequently made in internet fora by AI enthusiasts who aren't the people training models on vaat scraped datasets. That argument is bunk for a number of reasons, but most critically the fact that a human learning from material isn’t fair use, because a human brain isn’t treated as a fixed medium, so learning in a human brain isn’t legally a copy or derivative work that would violate copyright without the fair use exception, so it's not a use to which fair use analysis even applies, so you can't argue anything is fair use by analogy to that. But its moot to any argument for hypocrisy by the big model makers, because they aren’t using that argument to start with.
If I take 1000 books and count the distributions of the lengths of the words, and the covariance between the lengths of one word and the next word for each book, and how much this covariance matrix tends to vary across the different books, and other things like this, and publish these summaries, it seems fairly clear to me that this should count as fair use.
(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)
Should the resulting work be protected by copyright? I’m not entirely sure…
I guess one thing is, the specific numbers I obtain by doing this are not a consequence of any creative decision making on my part, which I think in some jurisdictions (I don’t remember which) plays a role in whether a work is copyrightable (I will use “copyrightable” as an abbreviation for “protected by copyright”. I don’t mean to imply a requirement that someone specifically registers for copyright.). (Iirc this makes it so phone books are copyrightable in some jurisdictions but not others?)
The particular choice of statistical analysis does seem like it may involve creative decision making, but that would just be about like, what analysis I describe, and how the numbers I publish are to be interpreted, not what the numbers are? (Analogous to the source code of an ML model, not the parameters.)
Here is another question: suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, but requires a large (expensive) amount of compute to produce, and which also uses a lot of randomness so that the result would be different each time it was done (but suppose also that there isn’t much point doing it multiple times at the same scale, as having two of this kind of data artifact wouldn’t be much more valuable than having one).
Should such data artifacts be protected by copyright or something like it?
Well, if copyright requires creative human decision making, then they wouldn’t be.
It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes (to a point of course. Only as much as is justified by the value that is produced by them being available.) .
If such data artifacts can always be distributed without restriction, then ones that are publicly available would be public goods, and I guess only ones that are trade secrets would be private goods? It seems to me like having some mechanism to incentivize their creation and being-eventually-freely-distributed would be beneficial?
But maybe copyright isn’t the best way to do that? Idk.
> The particular choice of statistical analysis does seem like it may involve creative decision making
The selection and structuring of the training set may involve sufficient creativity to be copyrightable (as demonstrated by the existence of “compilation copyrights”), even if it is largely or even entirely composed of existing works, the statistical analysis part doesn't have to be the source of the creativity.
'Should the resulting work be protected by copyright? I’m not entirely sure…'
This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'? Just compiled lists of facts can't be protected. Which is why things like election result companies having to rely on NDAs and not copyright protections to protect their services on election night.
> This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'?
No, flaws are generally introduced to make it easier to detect copies; if multiple flawless reference works covering the same data (road maps of the same region, for instance) exist, each is copyrightable without flaws to the extent any would be with flaws, but you can't prove that someone else copied yours without permission if copying any of the others would give the same result. With flaws, gou can attribute the source that was copied more easily, but this isn't about being legally protected but about the practicality of enforcing that protection.
> suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, [...] It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes [...] But maybe copyright isn’t the best way to do that? Idk.
Exactly. It would be patents, not copyright.
The model weights are the result of an automated process, by definition, and thus not protected by copyright.
In my unusually well-informed on copyright but not a lawyer opinion, without any new legislation on the subject, I suspect that the most likely scenario for intellectual property rights surrounding AI is that using other people's works for training probably falls under fair use, since it's extremely transformative (an AI that makes text and a textual work are very different things) and it's extremely difficult to argue that the AI, as it exists today, directly impacts the value of the original work.
The list of what traing data to use is probably protected by copyright if hand-picked, otherwise just whatever web-crawler they wrote to gather it.
The AI models, as in, the inference and training applications are protected by copyright, like any other application.
The architecture of a particular AI model can be protected by patents.
The weights, as the result of an automated process, are probably not protected by copyright.
> The model weights are the result of an automated process, by definition, and thus not protected by copyright.
Object code is the result of an automated process and is covered by the copyright on the source code.
Compilations are covered by copyright separate from that of the individual works, and it is arguable that a training set would be covered by a compilation copyright, and the result of applying an automated training processs to it would remain covered by that copyright.
I think it is fair to say that existing copyright law was not written to handle all of this. It was written for people who created works, and for other people who were using those works.
To substitute either party with a computer system and assume that the existing law still makes sense may be assuming too much.
There are certainly publicly available weights with restrictive licenses (eg some of the StableDiffusion stuff). I’d agree that it’d seem fairly perverse to say “our process for making this by slurping in a ton of copyright content was not copyright theft, but your use of it outside our restrictive license is”, but then I’m not a lawyer.
Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.
Perhaps 'strongly suggestive' isn't enough.
Wasn't that the goal of both the New York Times lawsuit and other class action lawsuits from authors?
https://harvardlawreview.org/blog/2024/04/nyt-v-openai-the-t...
https://www.publishersweekly.com/pw/by-topic/industry-news/p...
> Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.
No, there have been lawsuits, and the data centers have not been fair game because whether or not the models were trained on copyright-protected works is not generally in dispute. Discovery only applies to evidence relevant to facts in dispute.
> strongly suggestive that they have been trained on copyrighted materials
Given that everything -- including this comment -- is copyrighted unless it is (1) old or (2) deliberately put into the public domain, this is almost certainly true.
Isn’t this comment in the public domain? I presume that’s what I’m doing when I’m posting on a forum. If somebody copied and pasted something I wrote on here could I in theory use copyright law to restrict distribution? I think the law would say I published it on a public forum and thus it is in the public domain.
Why would it be in the public domain? Anything you create, under US copyright law, is the opposite of being in the public domain, it's yours. According to the legalese of YC, you are granting YC and YC alone a license to use the UGC you submitted to their website, but if anything, the YC agreement DEMANDS that you own the copyright to the comment you are posting.
> User Content Transmitted Through the Site: With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed. However, please review the Privacy Policy located here for more information on how we treat information included in applications submitted to us.
> You acknowledge and agree that any questions, comments, suggestions, ideas, feedback or other information about the Site (“Submissions”) provided by you to Y Combinator are non-confidential and Y Combinator will be entitled to the unrestricted use and dissemination of these Submissions for any purpose, without acknowledgment or compensation to you.
Another example of this is people putting code, intended to be shared, up on e.g. Github without a licence.
Many people seem to think that no licence = public domain, but it's still under strong copyright protection. This is the point of things like the Unlicense license.
> If somebody copied and pasted something I wrote on here could I in theory use copyright law to restrict distribution?
Yes you could, unless you agreed to forum terms that said otherwise, fair use aside. Its the same in most jurisdictions
>models are locked away in data centers as trade secrets
The architecture and the weights in a model are the secret process used to make a commercially valuable output. It makes the most sense to treat them as a trade secret, in a court of law.
I think you have to distinguish between a model, its implementation, and its weights/parameters. AFAIU:
- Models are processes/concepts, thus not copyrightable, but are subject to trade secret law, contract and license restrictions, patents, etc.
- Concrete implementations may be copyrighted like any code.
- Parameters are "facts", thus not copyrightable, but are similarly subject to trade secret and contract law.
IANAL, not legal advice, yadda yadda yadda.
The weights are a product of a mechanical process, 5 years ago it would be generally uncontroversial that they would be not subject to copyright in the US... but 'industry' has done a tremendous job of spreading confusion.
Going a step further, weights, i.e. coefficients, aren't produced by a person at all – they're produced by machine algorithms. Because a human did not create the weights, the weights have no author. Thus they are ineligible for copyright in the first place and are in the public domain. Whether the model architecture is copyrightable is more of an open question, but I think a solid argument could be that the model architecture is simply a mathematical expression – albeit a complex one –, though Python or other source code is almost certainly copyrighted. But I imagine clean-room methods could avoid problems there, and with much less effort than most software.
IANAL, but I have serious doubts about the applicability of current copyright law to existing AI models. I imagine the courts will decide the same.
You can say the same about compiled executable code though.
Each compiled executable has a one-to-one relation with its source code, which has an author (except for LLM code and/or infinite monkeys). Thus compiled executables are derivative works.
There is an argument also that LLMs are derivative works of the training data, which I'm somewhat sympathetic to, though clearly there's a difference and lots of ambiguity about which contributions to which weights correspond to any particular source work.
Again IANAL, and this is my opinion based on reading the law & precedents. Consult a real copyright attorney for real advice.
Datasets want to be free.
Just ask the owner of the data for their consent before adding to a dataset which wants to be free.
The main disagreement is who the "owner" is in the first place.
Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.
https://www.law.cornell.edu/uscode/text/17/1201
Actually, in terms of copyright control "The Federal Circuit went on to clarify the nature of the DMCA's anti-circumvention provisions. The DMCA established causes of action for liability and did not establish a property right. Therefore, circumvention is not infringement in itself."[1]
https://en.m.wikipedia.org/wiki/Chamberlain_Group,_Inc._v._S...
Circumvention is not infringement, but the DMCA makes it a separate crime punishable by up to 5 years in prison.
>Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.
If that is the law, it is a defect that we need to fix. Laws do not come down from heaven in the form of commandments. We, humans, write laws. If there is a defect in the laws, we should fix it.
If this is the law, time shifting and format shifting is unlawful as well which to me is unacceptable.
Disclaimer: As usual, I anal.
Time shifting is protected by 40 years of judicial precedent establishing it as fair use.
This is being tested in the courts currently, https://torrentfreak.com/appeals-court-hears-riaa-and-yout-i...
DMCA 1201 is written so broadly that any feature of a product or service can be construed to prevent copying, and thus gain 1201 protection.
I don't think YouTube intended regular uploads to have DRM, if only because they support Creative Commons metadata on uploads, and Creative Commons specifically forbids the use of technical protection measures on CC-licensed content[0]. On a less moralistic note, applying encryption to all YouTube videos would be prohibitively expensive because DRM vendors charge $$$ for the tech.
But the RIAA wants DRM because, well, they don't want people taking what they have rightfully stolen. So YouTube engineered a weak form of URL obfuscation that would only stop very basic scrapers[1]. DMCA 1201 doesn't care about encryption or obfuscation, though. What it does care about is if something was intended to stop copying, and if so, if the defendant's product was designed to defeat that thing.
There's an interesting wrinkle in DMCA 1201 in that merely being able to defeat DRM does not make something illegal. Defeating DRM has to be the tool's only function[2], or you have to advertise the tool as being able to defeat DRM[3], in order to actually violate DMCA 1201. DRM vendors usually resort to encryption, because it makes the circumvention tools specialized enough that they have no other purpose and thus fall afoul of DMCA 1201. But there's nothing stopping you from using really basic schemes (ROT-13 your DVDs!) and still getting to sue for 1201.
Going back to the AI ripping question, this blog post is probably not in and of itself a circumvention tool[4], but anyone implementing it is very much making circumvention tools, which are illegal to distribute. Circumvention itself is also illegal, but only when there's an underlying copyright infringement. i.e. you can't just encrypt something that's public domain or uncopyrightable and sue anyone who decrypts it.
So the next question is: is AI copyrightable? And can you sue for 1201 circumvention for something that is fundamentally composed of someone else's copyrighted work that you don't own and haven't licensed?
[0] Additionally, there is a very large repository of CC-BY music from Kevin MacLeod that is used all over YouTube that would have to be removed or relicensed if the RIAA were to prevail on this case.
I have no idea if Kevin actually intends to enforce the no-DRM clause in this way, though. Kevin actually has a fairly loose interpretation of CC-BY. For example, nobody attributes his music correctly, either the way the license requires, or with Kevin's (legally insufficient) recommended attribution strings. He does sell commercial (non-attribution) licenses but I've yet to hear of any enforcement actions from him.
[1] To be clear, without DRM encryption, any video can be ripped by hooking standard HTML5 video APIs using an extension.
[2] Things with "limited commercial purposes" beyond breaking DRM may also be construed as circumvention tools under DMCA 1201.
[3] My favorite example: someone tried selling a VGA-to-composite adapter as a way to copy movies off Netflix. That is illegal under DMCA 1201.
[4] To be clear, this is NOT settled law, this is "get sued and find out if the Supreme Court likes you that day" law.
Not really. The fair use status of time shifting isn't in question there by either party.
Your comment confused me, but I'm very interested in what you're getting at.
> Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies.
Yep, this is the DMCA section 1201. Late '90s law in the US.
> Copyright covers the right to make copies, not the right to distribute
This is where I got confused. Copyright covers four rights: copying, distribution, creation of derivative works, and public performance. So I'm not sure what you were getting at with the copy/distribute dichotomy.
But here's a question I'm curious about: Can DMCA apply to a copy-protection mechanism that's being applied to non-copyrightable work? Based on my reading of https://www.copyright.gov/dmca/:
> First, it prohibits circumventing technological protection measures (or TPMs) used by copyright owners to control access to their works.
That's not the letter of the law, but an overview, but it does seem to suggest you can't bring a DMCA 1201 claim against someone circumventing copy-protection for uncopyrightable works.
> Whether or not model weights are copyrightable remains an open question.
And this is where the interaction with the wording of 1201 gets interesting, in my (non-professional) opinion!
Here is the relevant text in the law:
> No person shall circumvent a technological measure that effectively controls access to a work protected under this title.
The inclusion of “work protected under this title” makes it clear in the law, though I doubt a judge would rule otherwise without that line. (Otherwise, I’d wonder if I could claim damages that Google et al. are violating the technological measures I’ve put in place to protect the specificity of my interests, because it wouldn’t matter that such is not protected by copyright law.)
Also not an attorney, for what it’s worth.
It seems clear from this definition especially:
> (A) to “circumvent a technological measure” means to descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner
In this case there is no copyright owner.
Right, that’s what I was getting at with my parenthetical. Obviously the work has to have an owned copyright in order to be protected by copyright law.
sorry, yes, reread your comment and dirty-edited mine
This is interesting. I wonder could you use it as a basis for “legally” circumventing a technology by applying it to non-copyrighted works.
If you mean that you might be able to decrypt a copyrighted work because you used that same encryption method on a non-copyrighted work, then definitely not. The work under protection will be considered. (Otherwise, I am unsure what you meant.)
From what I recall, it was the actual protection method that was protected by DMCA - when DVD protection was cracked it was forbidden to distribute a particular section of code so they just printed it on a Tee-shirt to troll the powers that be.
Presuming you are referring to this: https://en.wikipedia.org/wiki/AACS_encryption_key_controvers...
> Outside the Internet and the mass media, the key has appeared in or on T-shirts, poetry, songs and music videos, illustrations and other graphic artworks, tattoos and body art, and comic strips.
Using the encryption key to decrypt the data on a DVD is illegal “circumvention” per DMCA 1201, if it’s done without authorization from the copyright owner of the data on the DVD. If it were really illegal to simply publish the key on a website, then printing it on clothing that they sold instead would not be a viable loophole.
I’m glad it is still referred to as a controversy that they were issuing cease and desist letters for publishing information when the actual crime they had in mind, which was not alleged in the letters, is using the information to decrypt a DVD.
Publishing the key is a crime but even “discovering” the key is a crime. My toy thought is that you could legally do key discovery using non-copyrighted media though of course now that I think about it why would it be ciphered in that case LOL
Better yet just print the colors that represent the number, see https://en.m.wikipedia.org/wiki/Illegal_number
But then again, knowing the number is a far cry from using that number to circumvent DRM
just imagine, like just for a second how it becomes illegal to train anything that does not then afterwards produce, if publicly used or distributed, a copyright token which is both in the training set - to mark it - and in the produce - to recognize it.
so this is where it all goes in several years, if i were the gov.
Is using millions of copyrighted works to train your AI a valid exemption? Asking for a few billionaire friends.
No, copyright violation occurs at the first unauthorized copying or creation of a derivative work or exercise of any of the other exclusive rights of the copyright holder (that does not fall into an exception like that for fair use.) That distribution is required for a copyright violation is a persistent myth. Distribution is a means by which a violation becomes more likely to be detected and also more likely to involve significant liability for damages.
(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)
It's also worth noting that there is still no legal clarity on these issues, even if a license claims to provide specific permissions.
Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.
I doubt the models are copyrighted, arn’t works created by machine not eligible? Or you get into cases autogenerating and claiming ownership of all possible musical note combinations.
It’s hard to say because as far as I know this stuff hasn’t been definitively tested on any courts that I know of. Europe not America.
AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.
its insane to state it tbh
[dead]
[flagged]
> hoarding data
Laundering IP. FTFY.
"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."
If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.
Where the model owner is not the owner of the training data consider also that weights may be derivative works:
https://www.arxiv.org/pdf/2407.13493