Databricks can't seem to shake authors' copyright claim that could result in 'extraordinary' damages
Authors say it acquired an LLM that was trained on their copyrighted data, and judge keeps asking for more info
Databricks cannot shake a class action lawsuit targeting its LLM, which several book authors contend was created with a database that contained pirated versions of some of their copyrighted books – and about 196,000 titles in all.
Databricks’ motion to dismiss the case was denied last week by Judge Charles Breyer in U.S. District Court in Northern California, who said the plaintiffs, a group of writers that includes bestsellers and a Pulitzer Prize finalist, had grounds to continue their suit against the data analytics platform.
Databricks LLM, called DBRX, was cobbled together with parts from MosaicLM, which Databricks acquired in 2023. Early versions of that model used a database called RedPajama – which contained Book3 and has since been pulled from Hugging Face for copyright infringement. Databricks is essentially arguing that the authors can't prove that DBRX was trained with the Book3 data, and has testified to that effect.
Databricks closed its acquisition of MosaicLM in July 2023. In a statement at the time, Databricks called Mosaic “a leading generative AI platform known for its state-of-the-art MPT large language models.” MosaicLM released its first MPT model in May 2023 and in a blog announced it had used the RedPajama dataset in training.
Then when Databricks released its DBRX model in March 2024, it said “The development of DBRX was led by the Mosaic team that previously built the MPT model family.” The case hinges on how closely those two steps were tied.
Speaking of the authors, Judge Breyer wrote in his ruling, “They directly tie their infringed works to DBRX, and the employee statements provide supporting inferences when read in context, particularly when viewed alongside other more direct statements."
While Databricks has provided fourteen depositions, thousands of pages of documents, and terabytes of discovery information in its bid to show the court it did nothing wrong, Breyer wants to see more, said Brandon Butler, a copyright lawyer and executive director of Re:Create, a coalition of groups that advocates for balanced copyright laws.
“Judge Breyer basically says, ‘We need to know more before we can say that you didn't actually engage in any infringing copying,’ ” Butler told The Register. “We don't know enough yet, about what happened. Step by step, what did they physically do?”
Butler said potential damages against Databricks are massive if the authors can convince the court that the infringements were willful.
“The damages provisions in copyright law are draconian with a capital D. I mean, they are extraordinary. They are six figures per work infringed up to $150,000,” he said. “This is bet-the-company litigation. If they win, they could get enough damages they just liquidate every asset that belongs to some of these companies, and probably especially a smaller player like Databricks.”
So far several authors have joined the suit, among them young adult best selling author Jason Reynolds, Stuart O’Nan, Brian Keene, and Rebeccas Makkai, whose book The Great Believers was a finalist for the Pulitzer Prize.
Meta won a similar lawsuit last year against book authors who sued for copyright infringement during the creation of its LLAMA models by arguing that its actions were covered by fair use provisions of copyright law. Anthropic also won on a similar fair use claim in a separate case (but had ingested pirated books and agreed to establish a $1.5 billion fund to compensate authors.)
But Databricks has not yet made that argument.
Instead, Databricks' unsuccessful motion said the authors’ complaint was “nonsensical” and encompass actions that predate the training of DBRX.
“By Plaintiffs' strained logic, if a car company experimented on emissions technology with and without a patented component, and later manufactured a car without that component, the patent owner could still assert infringement claims as to the non-infringing car based solely on the earlier experimentation that led to the decision not to include the component,” lawyers for Databricks wrote.
- DuckDB uses RDBMS to attack classic 'small changes' problem in lakehouses
- Spark creator bags computing gong for making big data a little bit smaller
- No membrane in sight as Osmos diffuses into Microsoft Fabric
- Industry reacts to DuckDB's radical rethink of Lakehouse architecture
The authors argue they only need to show the court that their works were copyrighted and that those works were then copied by Databricks.
“Databricks copied Books3 multiple times in the process of developing its DBRX models and by so doing, directly infringed Plaintiffs’ copyrights in the asserted works,” the authors who brought the suit stated. “Under Defendants’ logic, as long as an AI company does not incorporate copyrighted books into the final training dataset of a model, it is free to download, store, reproduce, and indefinitely use pirated works for its own benefit. That argument gets it backwards.”
Butler said there are a couple of paths Databricks could take to succeed. First they could argue fair use, which has been a winning argument in the same federal court that is hearing this case. The second is that they could claim the authors cannot show damages and thus have no claim to file suit.
“That may be an argument that would be useful here, which is to say, ‘Whatever happened with all those books back then, none of that ever saw the light of day. It had no impact on our model. It was a mistake, and we undid it, and it had literally no impact in the world. So, why are we here? Why are we wasting the court's time? But I think that's a thing they have to prove, and they haven't proven it yet,” he said. ®



