AI Censorship: The Self-Defeating Paradox
From all of our earliest days of the now mythological “If you two don’t stop fighting I will turn this car around” we all inherently knew someone always has to make the decision. Even as we got older and our parents were more flexible and asking, “okay kids, let’s vote on what restaurant to go to”, still, the vote, the majority, made the decision. And so it is with any AI tool. Midjourney and images, LLMs and text, Runway.ml and video, deepfakes, writers/actors being replaced by AI and so forth. All AI tools will be products of decisions even if that decision is to not make one.
As we discussed in a recent article in more detail, all AI tools need data. Data is the new oil. Without it no AI algorithm or prediction engine. End of story. But, what data? And, so it begins…the censorship and the rise of so-called uncensored AI tools.
Too Much, Not Enough, Not Good
Training AI tools (aka models) uses data obtained from somewhere. Where are the humans going to look for data, how much are they going to use? (They have to turn off that supply at some point and actually ship a product). What audience are they targeting? The censorship of AI tools is not a conspiracy of the arch enemies of freedom. It is not a conspiracy at all. The method is the conspiracy. There is no means by which to make an AI model without censorship. Censorship, of course, has all its loaded connotations regarding hand wringing bad actors depriving “everyone else” of information for their benefit. Fair enough. However, censorship is something we all do. I watch certain programs, and not others. I hear certain introductory comments in a given speech and sometimes just move on. We all do that in different ways, to different degrees, all the time. We are all censors. The problem arises from “that woman” or “that guy” censoring what I have a right to see or read or view. Of course, that instinct to bristle against censorship is so often myopic. I can clearly see it when someone else is doing it. Myself, hmmm, “what is that you’re saying?” :)
Censorship We All Like
No such thing. Never will be. We won’t ever reach any agreement on this as a culture, a society or…uh, two people. Let’s just put that aside. So, in light of that, the law is a very blunt tool to attempt what is impossible. But, should it not try at all? At the very least, censorship of content produced by AI tools that was already illegal before the tools were ever released might be a good floor. From there, it seems, the rest will have to be sorted out by courts. (See recent Legal AI articles about the class action lawsuits targeting OpenAI, Microsoft and others). However, this reality has not discouraged the proliferation of “uncensored bots.” Given the reality that nothing is every truly, absolutely uncensored, what is the attraction of these claimed uncensored bots and what should the law do with them?
What Does Uncensored Mean?
An uncensored AI tool is a machine learning model (aka algorithm) that has not undergone any form of content filtering, restriction, or suppression during the training phase of its creation. But, as we know, this is not complete. No matter who created it for whatever reason, the could not consume all information. At some point for reasons of time or cost or others, they stopped pointing it at otherwise available data. But, let’s assume the creators of uncensored AI tools made those choices without reference to the type of content that would not be included in their training data. The reasons for doing this could be libertarian (“information should be free!”) or nefarious (Let’s tell everyone how to build a dangerous ordnance). Whichever motivation, or any other, this means the uncensored AI tool has the ability to produce outputs based on its full range of training data, without any limitations set to prevent it from expressing certain ideas, words, or themes.
Why Uncensored is A Great Idea
The future of AI has already been discussed as including AI tutors for everyone. Some have even posited an AI tutor that begins when a child is learning to read, do math, etc. and continues to accompany them throughout life. This AI tutor turned advisor would have the widest and most comprehensive set of data about that person to presumably provide them the most sage advice coinciding with that person’s authentic, evolving interests, goals, etc. One can imagine the supremely unhelpful, but reaffirming “yes” man of AI advisors in this role.
An uncensored AI tool is valuable in research settings where observing the tool’s unfiltered responses provides insights into its learning processes and the influence of its training data. To most effectively avoid unwanted output from the model (“Yes, here is the way to build a bomb because your neighbor mows their lawn too early on Sunday mornings”) it is most helpful to see uncensored output. From that undesirable, uncensored output working backward would be useful in discovering how that output was produced to enable preventing it going forward.
An educational AI tutor teaching topics like history would ideally just present the facts. It might present faithful recitations of the basis for opinions or philosophies on either side of the great issues of history including the controversial takes. This is what the uncensored AI interest would be. Just give me the facts, I can sort out what I think about whatever. In principle, hard to argue with that paradigm. There is no real need for laws or regulations here because this notion pushes the role of content filtering onto the end user, the imagined responsible citizen.
The accuracy of an uncensored tool is likely to be superior to one that has been censored in some fashion or another. The math here is straightforward. The accuracy of these tools depends on the volume of information they consume. Less information nearly always means less accuracy in making predictions for all purposes to which AI tools are deployed.
Of course, this means that such a tool is guaranteed to be trained on data that is copyrighted. As we will eventually learn from the pending cases regarding the use of copyrighted data, that might be a wee bit of a problem.
Why Uncensored is A Terrible Idea
Ah, the “no solutions just trade-offs” axiom raises its head again. An uncensored AI tool poses risks. It will enable the generation of illegal content. Notice, I didn’t say it “may” enable the generation of illegal content. That is the point of uncensored. But, as a society, our laws reflect a shared vision of acceptable censorship already. AI tools will simply accelerate the ability to create versions of content (maybe even speech) that the law will have to run up and blockade.
Imagine you are a parent of a school-aged student and you see advertising that encourages you to purchase a subscription to a company’s educational AI tutor. That 30 second ad emphasizes that the tool will enable your child to be exposed to a diversity of viewpoints on the issues within the curriculum. Sounds great. We all want to imagine our child being inspired to develop and regularly exercise critical thinking skills. That is, until your son or daughter looks up from their computer and says, “I think you can fall off the edge of the Earth. It all seems flat to me. Have you heard the explanation from [insert name of flat-earther] about why the Earth is flat? I think it’s pretty persuasive.” Uh, what? That is a potential reality of uncensored AI tools. Everything (or nearly everything) you might not want your child to be even considering could be in there. Not that the flat-Earth argument should be suppressed everywhere. But, providing it to an impressionable 12-year-old in an educational context for example, might have them fall into a rut that takes a few months to get out of. Wasted time. Censoring sounding a little better?
So, What Is The Law To Do?
For uncensored AI tools, what I purposefully omitted until now is their post-training information gathering. ChatGPT and other such tools are not merely trained on data up to some point and then they stop. Nope. They continue to view the data provided by users in the form of questions, the AI tool’s responses, whether they were satisfactory or not, comments from users in response to the AI responses, etc. So, AI tools will be trained, released for production (think subscription based websites) and then continue to gather information to learn and improve….forever. Who owns all that data? In 2020 an investment firm purchased controlling interest in Ancestry.com. Did they also purchase the right to use all the DNA data that the company had amassed? What did the terms of service say and did users even read them or just click through on their way to learning about their family tree? What if I am comfortable providing my data or permitting my student/child to provide it because the educational AI company is based entirely in North Carolina. Then, years after my daughter is out of college, that company is purchased by a company from Russia, China or even Canada. There goes my kid’s data and not just any data but data from a period in their life fraught with all the idea formation and misfiring certainty that accompanies adolescence. Yikes.
One thing the law could do is require companies creating AI tools to disclose with specificity the data that was used to train the tool.1 In addition, the law could require those companies to disclose with specificity the type of data it gathers from users and what rights users have “to be forgotten” as in have their data removed from the tool at their request. Perhaps also require informing all current past customers any time the company is under contract to be purchased by another. The complexity is geometric at some point.
As AI tools are likely to replace traditional Google searching, this means that the data these tools retain will be the sum total of all the ideas you ever had which were important enough to conduct an online search to learn more about. (Someone looking at my ChatGPT searches is going to learn a lot about Vitamix recipes and python coding).
The law could also require AI tool creators to affirm that they are not relying on currently illegal information to train their models. Seems like an easy fix. Just don’t have your data gathering/web scraping gather anything that is illegal. Were it so easy. The automation required to efficiently gather the training data necessarily means that the creation process involves netting all the tuna and accidentally catching a bunch of dolphins (maybe even sharks). The filtering of any contraband content comes after the data is gathered. There is currently no 100% reliable means by which the gathering of such information can be done such that it previews data and can filter it before gathering. The gathering process has to be one in which that scraping tool first captures the data, then, perhaps, submits it to a filter with some dictionary of contraband material that it checks against. This means, likely, all the AI tool creators in the area of images, text, etc. have already gathered contraband content and only after noticing it, performed whatever functions to wipe that data from their systems before training.
The dark web and the application to AI there looms. Law enforcement is going to want tools that can effectively gather data here uncensored to enable efficient searching and investigation of cases now and in the future. Companies will have to build uncensored tools that knowingly gather contraband information to be most effective for law enforcement use.
I am at a loss as to how legislation could be written that would mandate AI tool creators not collect contraband initially. This does not mean legislators will not make attempts. But, merely consulting with developers and those versed in the training of AI tools will educate them that what they might prohibit in legislation is practically impossible for AI tool makers to comply with. The result? No more AI development. Why is this the case?
To know something is contraband is to first copy the data and then evaluate it. This is akin to the concerns about minors accessing legal adult content online. The companies providing that content have rightly argued that there is no foolproof way to confirm a viewer of their content is 18 years of age or older. The only means available would effectively close those businesses. Think requiring users to send a copy of their driver’s license or birth certificate to establish their age prior to accessing the content. Now consider how many people would rightly think, “uh, I am not supplying my identity to this site, no thanks.” If the First Amendment is going to continue to protect the distribution of adult content to adults, this problem will be insoluble.
Likewise, in the world of gathering training data for AI, copying it first, then evaluating is the only current method to accomplish this task.
In The End, Splintering
In the current era, we are seeing the dominance and popularity of LLMs (Large Language Models) and image creation tools in the hands of a few large (or soon to be large) companies. That is unlikely to be the future. The future is likely to be a splintering of AI tools into ever more discrete, curated products targeted at slices of the larger market. Just like vehicle, restaurants, bicycles, clothes….basically everything, the one-size-fits-all is disfavored and the “just for me” sensation is king. So it is likely to be with AI tools. Consider the AI tutor. As a parent, you are going to want as much information about what data that tool was trained on, what has been filtered, suppressed or promoted within the tool and consider how that matches your philosophical instinct regarding what a well-rounded education means. Are conservatives likely to subscribe to an AI tutor for their child trained on the best minds at Harvard? Likely not. Are liberals likely to purchase an AI tutor trained on content that includes Thomas Sowell, Adam Smith and Dennis Prager? Likely not. But, there are likely viable markets for both of those AI tutors tuned in those ways transparently. The reality is, however, that a single company can pretty easily gather a set of data and then, develop AI tutors from that data which cater to multiple slices of the market by simply filtering training and output content for a particular model consistent with the desires of those they are marketing to.
This kind of market splintering, initially, will undoubtedly push the transparency that legislation like that posited above would target. The market itself is likely to dictate whether consumers want more or less transparency about how the various AI models they consume are trained and how their output is filtered.
The interesting question is whether state and federal legislators will seek through legislation or regulatory agencies to impose censorship justified by reasons of national security, etc. A recent district court TRO decision dealing with claims that the federal government influenced social media companies to censor user information (recently overturned by the Fifth Circuit Court of Appeals) is an example of these exceptions. While the stories surrounding that decision debated whether the federal government overstepped or not, what was largely a passing mention was that even in that decision the court seemed to recognize that censorship by the government might be permissible in the areas such as national security and voting integrity.
We are already working on next week’s post which will be a round up of current local, state, federal and regulatory movement toward developing AI regulation. Should AI regulation be centralized and unified in the federal government or enabled by each state as it sees fit for its residents? How does the U.S. negotiate, or not, with its allies to ensure that international businesses (Google, Facebook, Twitter, Apple) are not regulated by the most restrictive European Union instincts resulting in the U.S. market only able access that which the European Union government deems permissible. Check out next week and see what you think then.
The pending class action lawsuits against OpenAI, Microsoft and others, if they get past the motions to dismiss will undoubtedly require in discovery the disclosure of just this information.