Multiple dispatches from Legal AI have talked about different aspects of Large Language Models (LLMs). But, understanding what they are, how they are created, curated, fine-tuned and then how they are intended to be used is helpful to making legal arguments related to their use. In real estate, finance, marketing, ad targeting, content suggestion, etc. algorithms that are prepared in much the same way as LLMs are working to provide predictive help to humans. (What could go wrong?). It turns out, just about everything. This is not a doomscroll article to go all Luddite on LLMs or AI. Quite the opposite. Knowledge is power and so, let’s get some about how these things work which, it turns out, is not complicated.
One LLM To Rule Them All
The likelihood there will ever be just one, or even a small handful of LLMs that dominate the field is low. As was discussed in last week’s piece, the hallucination problem is just one impediment to reliance on LLMs trained on broad, un-curated information. Instead, LLMs are quickly beginning to diverge into industry, organization and even person-specific curated datasets. That’s the endgame. First, let’s talk about what they are in general.
The Gold is Data
Why are Facebook or Twitter or Google so valuable? It’s simple, they have all the gold, which is data. As one of the interviewees in the recent Netflix documentary about social networks said, “whenever you are getting a product for free, you are the product.” Without data, LLMs cannot be created.
The first step in creating an LLM is to gather a vast amount of diverse text data. Keep in mind, this process itself is the first opportunity for humans to influence the eventual operation and answers that the LLM will produce. How? Well, they can point the data gathering at particular sources and away from others. They can collect “everything” and then excise selected parts of that collected data before moving on to the next step. In short, they can put a thumb on the scale right from the start. Any useful transparency rules regarding the creation and use of LLMs has to start from the start. Require companies offering LLMs as part of their service to precisely describe where they collected their training data from, specific instructions they used to point it away from or toward certain data and the methodology used to excise categories of data from that dataset prior to training the LLM.
Regardless of those choices, a general purpose LLM like the one powering ChatGPT can include in its dataset: books, pdf documents, powerpoint slides, articles, websites, and just about anything that can be read. Once this data is filtered or modified as noted in the paragraph above, it is then provided to a software program to ingest. By ingest, I mean, to read all of that data and begin to look for patterns of how words are sequenced in all of that text data. Simple example here. After consuming a bunch of data online, most LLMs trained on general internet content would be able to predict the next word in this sequence. “Four score and ____________”. But, how and why are they able to predict that?
Contrary to what many surmise, the model has not memorized all the text that it was trained on. Instead, it evaluated all that text and looked for patterns. It has noticed which words generally appear near other words and in what sequence and distance. It has derived a notion of how that sentence above should be completed and ways in which the next word cannot possibly, statistically, be “donut” for example. It looked not just for patterns in famous quotes, but patterns in every day writing. In this way, it becomes more accurate in predicting what the next word is in a sequence when prompted to finish that sequence. But, still how? The model or brain of the LLM is not even considering words, it is considering vectors.
Vectors Are The Thing
Computers compute. The stuff of computing is numbers, not strings of data, or text as it is commonly known. These LLMs do not compute strings, but instead they convert those words (commonly called strings by developers) into numbers. After that they place those words on a plot, or vector, and then use what they have ingested to decide, how close is dog to cat as compared to frog? How close is door to table to car to wood, etc?
The graphic above is one very small example of billions of such word comparisons on vectors with multiple dimensions, not just 2D like this graphic, but also more dimensions. The position of those words are two numbers (an x value and a y value on that vector) and from there, the math can be done by the computer to predict the next word in the above sentence by the proximity of the previous word to the latest word and the word before that, etc on the vector. You can see now why these kinds of models were not universally available even 5 years ago. The computing power necessary to develop these models was prohibitively expensive for all except the largest of companies. That is changing and changing fast in 2023. While the open source (i.e. freely available under various licenses) are entering the market rapidly in the past six months, they are still not quite on par with the models that companies paid millions to produce. Just link professional sports, there are various LLM leaderboards tracking metrics to determine which LLM is the best performer at the moment. Click on the image below to see the leaderboard if you like.
Vector Databases In Five Minutes or Less
Training and Testing
Large Language Models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human-like text. They are created through a multi-step process that involves collecting training data (as outlined above), then preprocessing , tokenization, model training, and fine-tuning. Let’s break those down briefly.
Most Data is Garbage
The reality is that most of the work of data analysts and data engineers is not training models, building amazing dashboards of cogent information/analysis. Nope. Most of their work is cleaning garbage. Why? Because humans create data or they create programs that create data. Either way, the flaws of humans propagate through data with sometimes viral effect. When analysts are trying to analyze data, their first step is to discover and account for (remove, replace, average out, etc) missing, wrong, badly formatted data. Imagine how many applications have input fields asking for users to enter their phone number an instead, they enter the word “car.” It happens. All the time. If that data gets past the initial checks (which the application should have in place, but not all do) then the database storing that information now has junk. If that data is somehow copied to other tables, junk proliferates. You get the messy picture.
Preprocessing is the fancy term for cleaning data, removing the garbage. Once that is completed, the next move is tokenization.
Get Your Tokens Ready
Tokenization helps computers understand and process human language by providing a structured representation of textual data. It is as a crucial preparatory step in building LLMs.
Imagine you have a sentence: "I love to eat pizza." During tokenization, this sentence would be divided into separate tokens, which in this case would be the individual words: "I," "love," "to," "eat," and "pizza." These tokens provide a more manageable and structured representation of the sentence, allowing the computer to analyze and understand it more effectively.
Tokenization is not limited to just words; it can also involve splitting text into tokens based on other elements like punctuation marks, numbers, or special characters. For example, the sentence "I earned $100 last night!" would be tokenized into "I," "earned," "$100," "last," and "night," considering the dollar amount as a distinct token.
By breaking down text into tokens, the algorithms creating the LLM content can process, analyze, and extract meaningful information from text data more easily.
Not Practicing, But Training
The LLM training involves presenting that preprocessed data to the model. It will create the vectors (often called embeddings) and use those to get better at predicting words in sequence. As more data is added, the designers/developers can run tests to gage the prediction success rate gradually improving the LLM’s ability to generate coherent and contextually appropriate responses. As an example, OpenAI took months and millions of dollars to train what became ChatGPT 3.5. That is multiple computers in parallel running constantly crunching data and numbers. It was also augmented by legions of humans reviewing its output and grading it along various parameters (accuracy, tone, truthfulness, etc). This is yet another area where humans can and do affect the performance and output of LLMs.
The final step in the process is fine tuning and optimization. Fine-tuning involves training the model on specific tasks or domains to enhance its performance in those areas. One example might be training an LLM on exclusively verifiably published court opinions. This kind of model, devoid of the detritus of the wider Internet, would not be prone to the hallucination that has already tripped up one lawyer in grand fashion. (See our post of last week). The purpose of this last step is to customize the LLM's capabilities and make it more useful for particular applications, such as natural language understanding, translation, or text generation.
So What? Lawyers Don’t Need To Know How LLMs are Created
Why should you care? You’re a lawyer and all you want is a good research resource that might outperform existing tools for a lower cost or no cost at all. You should care because the use of these kinds of tools are already in finance, healthcare, construction, insurance, criminal justice, urban planning, government loan programs and more. To properly litigate issues within these realms whether related to LLMs or not, it will be increasingly relevant to ask whether an LLM is at work, how it was designed and for transparent information accompanying all of the steps above. Civil rights cases involving poorly designed or deployed LLMs will undoubtedly emerge. Subpoenas and records requests for information will be scrutinized by the receiving companies. Those companies will have people intimately familiar with these models and how they were created. If those subpoenas are poorly worded or devoid of the right technical language, it makes it that much easier for respondents to provides something they can argue is responsive when in reality it is unhelpful. Lawyers do not need to be data engineers, but understanding these things at a high level will help avoid missing legal issues hiding in the data.