There is so much work out there for free, with no copyright
There’s actually a lot less than you’d think (since copyright lasts for so long), but even less now that any online and digitized sources are being locked down and charged for by the domain owners. But even if it were abundant, it would likely not satisfy the true concern here. If there was enough data to produce an LLM of similar quality without using copyrighted data, it would still threaten the security of those writers. What is to say a user couldn’t provide a sample of Stephen King’s writing to the LLM and have it still produce derivative work without having trained it on copyrighted data? If the user had paid for that work, are they allowed to use the LLM in the same way? If they aren’t who is really at fault, the user or the owner of the LLM?
The law can’t address the complaints of these writers because interpreting the law to that standard is simply too restrictive and sets an impossible standard. The best way to address the complaint is to simply reform copyright law (or regulate LLM’s through some other mechanism). Frankly, I do not buy that the LLM’s are a competing product to the copyrighted works.
The biggest cost in training is most likely the hardware
That’s right for large models like the ones owned by OpenAI and Google, but with the amount of data needed to effectively train and fine-tune these models, if that data suddenly became scarce and expensive it could easily overtake hardware cost. To say nothing for small consumer models that are run on consumer hardware.
capitalists just stealing whatever the fuck they want “move fast and break things”
I understand this sentiment, but keep in mind that copyright ownership is just another form of capital.
There’s actually a lot less than you’d think (since copyright lasts for so long), but even less now that any online and digitized sources are being locked down and charged for by the domain owners. But even if it were abundant, it would likely not satisfy the true concern here. If there was enough data to produce an LLM of similar quality without using copyrighted data, it would still threaten the security of those writers. What is to say a user couldn’t provide a sample of Stephen King’s writing to the LLM and have it still produce derivative work without having trained it on copyrighted data? If the user had paid for that work, are they allowed to use the LLM in the same way? If they aren’t who is really at fault, the user or the owner of the LLM?
The law can’t address the complaints of these writers because interpreting the law to that standard is simply too restrictive and sets an impossible standard. The best way to address the complaint is to simply reform copyright law (or regulate LLM’s through some other mechanism). Frankly, I do not buy that the LLM’s are a competing product to the copyrighted works.
That’s right for large models like the ones owned by OpenAI and Google, but with the amount of data needed to effectively train and fine-tune these models, if that data suddenly became scarce and expensive it could easily overtake hardware cost. To say nothing for small consumer models that are run on consumer hardware.
I understand this sentiment, but keep in mind that copyright ownership is just another form of capital.