You Were On The Internet, Maybe You Are In The Class

An examination of the potentially groundbreaking class action lawsuit filed this week against OpenAI, Microsoft and others.

Jul 03, 2023

A list of Plaintiffs filed a class action lawsuit against a list of Defendants in a California Federal District Court yesterday.

Who Derserves To Doe?

The first issue with the complaint is the Plaintiffs’ implicit assertion of the right to proceed anonymously. The first paragraph of the complaint reads as follows:

Plaintiffs P.M., K.S., B.B., S.J., N.G., C.B., S.N., J.P., S.A., L.M., D.C., C.L., C.G, R.F., N.J., and R.R., (collectively, “Plaintiffs”),1 individually and on behalf of all others similarly situated, bring this action against Defendants OpenAI LP, OpenAI Incorporated, OpenAI GP LLC, OpenAI Startup Fund I, LP, OpenAI Startup Fund GP I, LLC, and Microsoft Corporation (collectively, “Defendants”). Plaintiffs’ allegations are based upon personal knowledge as to themselves and their own acts, and upon information and belief as to all other matters based on the investigation conducted by and through Plaintiffs’ attorneys.

It turns out, the default in federal court lawsuits is that all plaintiffs have to be listed by their legal names. “[I]dentifying the parties to the proceeding is an important dimension of publicness. The people have a right to know who is using their courts.” Doe v. Blue Cross & Blue Shield United, 112 F.3d 869, 872 (7th Cir.1997) (Posner, J.).

The public's right to know a litigant's identity derives from the United States Constitutions and the common law. See Doe v. Stegall, 653 F.2d 180, 185-186 (5th Cir.1981) (recognizing that a party's use of a pseudonym implicates the public's First Amendment rights); Does I thru XXIII v. Advanced Textile Corp., 214 F.3d 1058, 1067 (9th Cir.2000) (“Plaintiffs' use of fictitious names runs afoul of the public's common law right of access to judicial proceedings”). The public, therefore, must have a way to challenge a court order that limits access to judicial proceedings. See Globe Newspaper Co. v. Superior Court for Norfolk Cty., 457 U.S. 596, 609, 102 S.Ct. 2613, 73 L.Ed.2d 248 (1982), fn. 25.

There are exceptions, but they are exceedingly rare. Some courts have complained that too many districts are loosening those restrictions and permitting lawsuits where the public has no idea “who is using their courts.” One of the first considerations for the Defendants in this matter will be whether they will argue that these plaintiffs are not entitled to proceed pseudonymously. Since no apparent affidavit or other document was filed contemporaneous with the complaint outlining the justification for plaintiffs to proceed pseudonymously, it is unknown as of yet what basis might make plaintiffs eligible to proceed in that way.

OTHER COURTS HAVE FOUND THE PROCEDURE HERE CRITICAL
Ordinarily, a plaintiff wishing to proceed anonymously files a protective order that allows him or her to proceed under a pseudonym. Failure to seek permission to proceed under a pseudonym is fatal to an anonymous plaintiff's case, because…”the federal courts lack jurisdiction over the unnamed parties, as a case has not been commenced with respect to them.” Citizens for a Strong Ohio v. Marsh, 123 F. App'x 630, 636-37 (6th Cir. 2005) (quoting Nat'l Commodity & Barter Ass'n v. Gibbs, 886 F.2d 1240, 1245 (10th Cir. 1989)). This Court similarly lacks jurisdiction over the unnamed Defendant and a case has not been commenced against him. Malibu Media, LLC v. John Doe, 1:14CV493 S.D. Ohio Western Division (January 21, 2015).

There are several cases in which the factors making a party eligible to proceed pseudonymously are laid out. From the content of the complaint, it does not appear that the Plaintiffs here meet those requirements. We shall see.

Tell A Good Story

While the crux of the complaint is one of theft, it begins with a heavy dose of fear about the future of AI. It is not clear what legal remedies the plaintiffs would have if companies producing LLMs like OpenAI and others were somehow to “onboard[] society into a plane that over half of the surveyed AI experts believe has at least a 10% chance of crashing and killing everyone on board.” Complaint at ¶2.

Leading with the fear that some perceive is already upon us from the release of tools like ChatGPT 3.5 and now 4 is an interesting legal approach. The real legal claim, however, is one of theft. The plaintiffs are claiming that OpenAI in building its LLM and its image creation tool Dall-E collected from the Internet “stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge.” Id. at ¶3. Read that again carefully. The claim is not that OpenAI or other defendants somehow invaded the phones, encrypted files, or otherwise password protected content of random people. It is claiming that it collected information from the publicly available internet. But, in so doing, it also collected information that should not have been on the Internet. And, further, that its collection of that information, whether appropriately on the Internet or not, should have been done with the consent of the person(s) who have an ownership interest in that information. Here, again, as in previous columns, it appears that the complaint misperceives the notion of what and LLM does…or more to the point, what it is.

BRIEF PRIMER ON WHAT AN LLM IS
An LLM (Large Language Model) is a model. It is a mini-brain of sorts. It has been trained, much like our brains are trained by both experience (crawling and then walking without any real instruction) and instruction. (math equations, reading, etc). It is not the collection of a pile of information. The model does not have within it the memorized sum total of all the information used to train it. Think of it this way. A one year old who just learned to walk has put together a series of trial and error actions, responses, analyses to eventually assemble the physical and mental skill to successfully walk. Once they are walking, they are not simultaneously recalling and processing all that was used to develop the walking skill. They are just walking. Those training sessions are long gone and what is left is the “ability.” When they come across slopes, curbs, grass, ice, sand, etc they adjust their “model” for walking to those surfaces. LLMs are similar. They have accumulated information about how words are sequenced and make trained predictions as to what word ought to appear next in a given sequence. They are not storing the entirety of the Internet in a hard drive in that brain. Just responding with mathematical predictions based upon training.

Whose Information Is It Anyway?

Consider this legal precedent. Information that is publicly available on the Internet cannot be collected or otherwise used by anyone without the explicit permission of the person who owns that information. This is the concept upon which much of this complaint rests. Despite information being freely available to the public on the Internet (some of which is improperly there having been intentionally or unintentionally placed there) persons collecting that information must identify, locate and obtain the consent of the individuals with an ownership interest in that information before they can “use” it. What does “use” it mean? Can that information be used to target ads to those persons? Can it be used to justify advertising on a given website? Can it be used in statistical analysis by companies, government agencies or private citizens? To what degree is publicly information restricted in its use? Do I have to get the permission of people walking on the street who happen to walk behind my friend when I am casually taking his or her picture with my phone and later use that image to make a million dollar viral video? After all, they are in public, I am capturing them in public, but not with their permission?

What Is Copyrighted and What Does That Mean?

The complaint makes a more interesting claim in paragraph seven leading with this: without the Defendants theft of “copyrighted information belonging to real people, communicated to unique communities, for specific purposes, targeting specific audiences, the Products would not be the multi-billion-dollar business they are today.” Id. at ¶7. Here it appears the Plaintiffs are solidifying the notion that the copyright violations of which they speak have generated commercial value for the Defendants. That is certainly true that OpenAI is hugely successful, valued in the tens of billions as of this writing.

Automatic Copyright Protection

In most jurisdictions, original creative works are automatically protected by copyright from the moment they are created and fixed in a tangible form. This includes written content, images, videos, music, and other creative expressions.

The Ubiquitous Fair Use Exception

As with the First Amendment, that copyright protection is not absolute. You can use, without permission, anyone’s copyrighted works without paying them if you do so consistent with the Fair Use exception.

Fair Use or Fair Dealing: The concept of fair use or fair dealing, allows limited use of copyrighted material without permission for purposes such as criticism, commentary, news reporting, education, or research. Fair use/fair dealing is a flexible doctrine and its applicability can vary based on factors like purpose, nature of the work, amount used, and effect on the market.

What is interesting about this exception is that it implies the person who is operating under the Fair Use protection “used” the copyrighted material by including it in another work. For example, a journalist quoting a poem, a line from a movie, a book excerpt when reviewing it or a cartoonist satirizing a portion of one of those. I am not aware of a copyright case, before this one, in which the Fair Use exception, if it is invoked by the Defendants, will involve never having excerpted, printed, included or otherwise publicly displayed the copyrighted work. It seems akin to a musician saying about their song they believed another musician has improperly used, “I know he didn’t include any of the notes/lyrics in his music, but he listened to my song over and over again and by putting that in his brain, he was able to make a new song. So, his new song is essentially the result of him repeatedly thinking about and listening to my song.”

If the Defendants here attempt to claim a Fair Use exception, how would they actually conform to the implied “use” of the material? And, to access this exception, to what “use” did the Defendants put the copyrighted works that would qualify under the exception? I cannot see any Fair Use exception that superficially applies. Likewise for the Plaintiffs, to establish a copyright infringement, how will the “use” of their clients’ copyrighted works be shown when presumably none of the Defendants actually wrote about, excerpted, or otherwise “used” that material to make a new work?

The Defendants return again and again the complaint to the AI powered Armageddon that somehow the legal system should intervene to halt.

Experts believe that without immediate legal intervention this will lead to scenarios where AI can act against human interests and values, exploit human beings without regard for their well-being or consent, and/or even decide to eliminate the human species as a threat to its goals. Id. at ¶11.

It is unclear how copyright law could be used to effectively halt the Apocalypse, but advancements of the law are made in such ways historically. It is a funny hypothetical that a roque, AI-powered drone hovers over some person’s house about to fire a missile and a lawyer runs onto the front lawn with a shield that reads. “Copyright Law.” I think we all have a notion of how that Sci-Fi hypothetical ends.

Wait, But There’s More

Now, this claim in paragraph 12 is interesting. It claims that since its release of ChatGPT, OpenAI has been improperly stealing users’ “personal information from the Products’ 100+ million registered users without their full knowledge and consent.” Id. at ¶12. This claim, if true, is a well known basis for class action certification. But, is it happening? I have not researched this, but one has to assume that OpenAI had access to experienced lawyers who wrote their Terms Of Service to alert users to precisely how their information could be used.

You’re Killing Me Smalls

In paragraph 24 Plaintiffs urge that the court needs to step in to avoid the “professional obsolescence of software engineers.” Id. at ¶24. This is a fascinating claim. That a court ought to stop a technology because its widespread use/adoption might result in one or more the plaintiffs’ jobs becoming obsolete. (Obviously, the phrase “buggy whip” came into your mind when reading that). It will be curious if the court bites on that argument seeking to justify an injunction or other initial order based upon occupation protection. In the paragraphs just after this one, this same plaintiff asserts he “did not consent to the use of his private information by third parties in this manner. Notwithstanding, Defendants stole Plaintiff P.M.’s personal data from across this wide swath of online applications and platforms to train the Products.” Id. at ¶25. The crux of this claim is the admission by this plaintiff to posting various text, images and other personal information on platforms like Instagram, Twitter, Tinder and elsewhere that he believes was used by the Defendants to train its models. And, that he had a reasonable expectation that the Defendants would not use this material without his consent. Really? Imagine an Internet where people publicly post information and then can clawback anyone’s use of that public information without compensating them. Perhaps that is where we want to go with the Internet. It certainly protects people’s property rights in their content, images, and the like. But, at what cost to the exchange of information and the burden on platforms to somehow prevent unauthorized use of….everything?

The summarizing claim for each of the Plaintiffs, some who are listed as children, is “Plaintiff [] reasonably expected that the information that he exchanged with these websites prior to 2021 would not be intercepted by any third-party looking to compile and use all his information and data for commercial purposes.” Using the data for commercial purposes can be flexed many ways. CNN, FoxNews, ABC are all commercial operations. They gather information from the Internet to produce data analysis, polls and the like for use in their broadcasts. Is that the kind of “use” the Plaintiffs are claiming here? Seems not to be what they mean, but it is unclear how wide a swath through the world of “use” of publicly available information on the Internet the Plaintiffs imagine their claims run.

The use of the term “intercepted” as the means by which the Defendants collected the Plaintiffs information is interesting. Is it the common understanding of that term that when a person posts some information on a publicly available website, social media network, etc. that a company “scraping” that data is “intercepting” it? Publicly posted information is intended for….the public. Is an entity also in public who gathers that information somehow intercepting it from others in public? Perhaps the Plaintiffs here are being more precise. “I publicized this information for only a subset of the public and that subset did not include [insert names of Defendants].” In this way, a court would be asked to fashion a rule that persons posting on the Internet can retroactively decide that certain members of the public, including organizations, were not whom they intended to share their information with and, therefore, those organizations using that information should pay for copyright infringement. This type of Time Machine enforcement of a claim is not likely to succeed, but as we all know, judges and juries surprise lawyers every day.

Drum Roll Please

The complaint explains the intertwined nature of OpenAI and Microsoft including the various exclusive licenses Microsoft obtained to use OpenAI’s technology in its various products like Github, Azure (its cloud service competing with Google Cloud and AWS), Office 365 and others. The plaintiffs reason that federal jurisdiction is appropriate since there is a diversity of citizenship between the California plaintiffs and OpenAI which is a Delaware corporation. Oh, yeah, and the $3,000,000,000 prayer for relief. That little detail.

Thank you for reading Legal AI. This post is public so feel free to share it.

The Switcherooo, with Elon’s Money

In paragraph 134 Plaintiffs summarize an interesting truth about the evolution of the main Defendant, OpenAI. It started as a non-profit to which Elon Musk (among other donors) donated $100,000,000 dollars. Then, somewhere along the way OpenAI pivoted to a for profit company controlled by Microsoft. They quote Musk saying “[I]f soliciting non-profit contributions to then turn around and build a for-profit company is legal then why doesn’t everyone do it?” The plaintiffs respond to this question with a flat, “It isn’t.” Here again, this is not a legal claim, but apparently more information to highlight that OpenAI is pretty tightly aligned with Microsoft making it the Darth Vader of LLMs and AI right now. Perhaps there is some case law I am unaware of where simply being Darth Vader adjacent is sufficient to impose liability on a company. I am not up on the latest Star Wars related iterations, but no one wants to be associated with Darth Vader I presume. That research is for another day.

Whose Data Is It?

In paragraph 146 Plaintiffs make a claim that OpenAI by collecting this data from the Internet via web scraping, was required to register as a Data Broker under California law. They failed to do so. In paragraph 248 Plaintiffs claim that various state and federal laws as well as the Terms of Service of various websites where Plaintiffs posted their data were violated by this web scraping activity. These claims are interesting as it would appear to be a question of law. There is no doubt OpenAI scraped data from the Internet to train its models.1 The only question then is an interpretation of the data broker law and these other various laws as it relates to the scope of what OpenAI did to train its model. (Review of this claim will be the subject of an upcoming edition of Legal AI). Persuasively, the complaint notes a company called ClearView which scraped publicly available images from the internet for years to produce a product sold to law enforcement involving facial recognition. It was sued and eventually registered as a data broker in both Vermont and California. Those are pretty tight facts to what OpenAI has acknowledged doing here. ClearView did not sell these collected images. There was no unlawful duplication. It merely trained a model to recognize faces “using” these images as training materials.

What Is Theft?

The Plaintiffs repeatedly use the words “stole” or “stolen” when referring to their data and how it was used by Defendants to train their LLMs. This is also an interesting semantic choice. Typically, when someone steals something, they take control of it, whether it be a physical object, a space, or some intangible like customer lists or a user’s data. Perhaps even an unauthorized duplication could be a theft of sorts. But, what becomes of data that is merely read, reviewed and evaluated for the way in which its words are sequenced, their frequency, etc? Do I steal pieces of art from the museum by repeatedly visiting and staring at them to eventually develop a style which mimic’s those pieces? Likewise, do I steal a playwright or poet’s work when I simply read it over and over and learn the sequence, structure, frequency and usage of the words in those texts? What if I blatantly copy the entire work with a printer, or a click of a mouse or a photocopy machine, then review all the words, evaluate them for their sequence, frequency and such? Is it really the copying that was the harm here or the thing I developed by looking at the text intently and repeatedly? These are precedent setting issues to be sure, but not ones that are easily applicable to the copyright, theft and other structures this complaint wishes to graft them onto. This does not mean the complaint is poorly drafted or doomed to fail. This is legal wilderness exploration. It is a document posed to a judiciary that is largely unprepared to handle the technological aspects of what they are facing. Both parties in such cases are going to have to rely on experts to help explain these concepts to judges who are likely not familiar with some of the fundamental technology at work. That is not a criticism of judges. Theirs is not the job to be software engineers. But, alas, in the age of AI when even the engineers building some of these tools cannot precisely explain how they work (or why they sometimes go awry), educating the court will be a key task.

The claims in this complaint might be the claim or claims that sets the bar for future cases and heralds in a new era in data privacy and protection for consumers. There is no way to know at this early stage. The Defendants will certainly answer. Will they use ChatGPT to help write the answer? Did the Plaintiffs’ lawyers use ChatGPT to help draft their complaint? Wouldn’t that be an interesting coincidence.

Those Interminably Confusing Terms Of Service

A claim Plaintiffs make near the end of their complaint likely resonates with most people. In summary, they claim (with examples) that the Terms of Service (TOS) supposedly alerting users of ChatGPT how it will use user’s data is purposefully confusing, vague and legally insufficient. There is much sympathy to be gained by presenting anyone who is regularly online with the oft confusing and long winded TOS documents that website seek users’ ascent to. The complaint even highlights case law regarding the size of buttons, text and other markers involved in the process websites use to inform and obtain user consent to data use. While not the crux of their claim, this one has a significant chance to get the attention of even the most technophobic jurist.

Conclusion, There Isn’t One…Yet

Yes, there is applicable precedent, at least some persuasive arguments for it. And, yes, the defense has yet to speak. But, whether a court could or should intervene to essentially shut down the production of future LLMs (which a judgment in favor of Plaintiffs would certainly do due to the cost then imposed on creating future LLMs) is another question. Do we want our information used in this way without our consent? Do we want innovation on AI and LLMs, or image producing tools like Dall-E halted by imposition of these costs? As with so many things in politics and culture these days, “there appear to be no solutions, just trade-offs.”

Any other LLM is also doing the same scraping. As I have said in earlier posts, there is no AI without data. There is no LLM the size and scale of OpenAI’s ChatGPT without scraping, basically, the entire Internet. There is no other way to train a thing to do something without serving to it a bunch of data where that thing is demonstrated. You want a machine that writes poetry via AI, you have to feed it tons of examples of, you guessed it, poetry.