Voice cloning is a technology allowing the the creation of a computer-generated replica of an individual's voice. As you might expect, AI and machine learning models are the stars involved in producing these clones. The models are trained to capture the unique tonal, rhythmic, and expressive qualities of a person's speech, enabling synthesis of new audio recordings sounding in many cases indistinguishable from the original speaker. This tool can be used to record content in your own voice and have it registered on that recording in the voice of the person whose clone you have created. It can also be used in real time. Meaning, it can be connected to a phone, for example, enabling a person to speak to another person in their own voice, but the voice, intonation, accent, etc. heard by the other person on the phone is that of the cloned voice, not the actual speaker.
But Why?
What could possibly be useful or even utilitarian about cloning my own voice or anyone else’s for that matter? The applications for voice cloning range from customer service automation and virtual personal assistants to more specialized uses such as dubbing in films, audiobooks and even in aiding those who have lost their ability to speak due to illness or injury.
Apple’s upcoming IOS upgrade will include the ability of phone owners to record just 15 minutes of their own voice to create an on-phone clone they can use.
While voice cloning opens new frontiers in human-machine interaction especially for users with disabilities, it also raises important ethical and legal questions, especially when considering its potential uses in criminal activities and law enforcement.
Why Now?
Surely this concept is not new in 2023 or even 2000, so why is it now having somewhat of a popular day? While the idea of cloning someone’s voice is as old as impersonators who made careers in comedy, Vegas or in films and television doing just that, it is just in the last few years that machine learning models have been developed that make the cloning of a voice accessible to the average software developer. Not too long ago, creating a convincing voice clone required vast amounts of data and extensive time commitments, making it largely inaccessible for mainstream use. However, today's state-of-the-art algorithms can produce highly accurate voice clones with significantly less data and in much shorter timeframes. This accelerated pace of development is democratizing access to voice cloning technology, but it also raises the stakes in terms of its potential misuse. As the technology becomes more sophisticated and accessible, it becomes increasingly crucial to examine its ethical implications and to develop safeguards against its malicious applications. What is the theme of this blog….trade offs.
That Whole Sword Thing and Its Two Edges
Voice cloning technology offers benefits and risks. On one hand, it holds the promise of revolutionizing industries and improving quality of life; for example, offering personalized customer service experiences, aid those with speech impairments and applications in entertainment. The technology also poses significant ethical and security concerns. Its potential for misuse is vast (just like Deepfake Videos) ranging from identity theft and financial fraud to disinformation campaigns and cyberbullying. The very attributes that make voice cloning groundbreaking—its accuracy and accessibility—also make it a potent tool for nefarious activities.
Who Is That, Was It You?
Voice cloning technology is a powerful tool for identity theft, enabling criminals to impersonate individuals in the most convincing manner - sounding exactly like them in conversations. In scenarios where voice recognition is used for authentication—such as in customer service helplines for banks or social services—a cloned voice could grant unauthorized access to sensitive personal information and financial resources. In a world increasingly reliant on multi-factor authentication, voice cloning could emerge as a stealthy technique for bypassing one of these crucial layers of security, adding another vector for identity theft that is hard to detect and counter. The familiar phrase “con-man” or “con-job” is the abbreviation of the word “confidence.” What makes a recipient of a phone call the most confident they can be to divulge private information than hearing the caller’s voice and recognizing it as their business colleague, supervisor or longtime friend?
Scammers can employ this technology to mimic the voices of trusted figures in a person's life, such as family members or business associates, to deceive them into transferring money or divulging confidential information. Picture a scenario where an elderly individual receives a call from someone who sounds exactly like their grandchild, claiming to be in an emergency situation and in need of immediate financial assistance. The emotional and psychological impact of hearing a loved one's voice could overpower rational skepticism, making these kinds of scams highly effective. Who is liable in addition to the criminal themselves? Is it the person who invented or released as open source the technology enabling the deception?
The World of Complete And Plausible Deniability
The impact of voice cloning technology extends beyond financial crimes to the sphere of public information and discourse. Voice cloning can be utilized to create audio deepfakes—fraudulent audio recordings that can be paired with manipulated video to create convincingly realistic, yet entirely fake, messages from public figures. Such deepfakes can be weaponized to spread disinformation, sway public opinion, and even influence elections. The widespread proof of the sophistication of such technology will also provoke denials by persons caught saying something they regret. How are we the public to know whether candidate A actually said the unflattering thing that is only heard on an audio file criss-crossing the Internet? Much has been written about the erosion of public trust in institutions since the Covid crisis emerged. Just imagine the acceleration of that trust erosion when no one admits anything unflattering even when their apparent voice is heard saying it.
What Evidence You Got?
Well, Evidence Rule 901 has a section 5 that is about to get a lot more attention.
(5) Opinion About a Voice. An opinion identifying a person’s voice — whether heard firsthand or through mechanical or electronic transmission or recording — based on hearing the voice at any time under circumstances that connect it with the alleged speaker.
Really? So, depending on whether a witness is persuaded that the voice on a recording is of someone they know and whose voice they can recognize that evidence is admissible. Uh, well, that’s going to have to be updated. Who is working on this rule of evidence? I suspect no one, but please alert me if I am wrong on that.
Section 1 of the rule also provides a yawning opening for just about anyone with mal-intent to roll right through.
(1) Testimony of a Witness with Knowledge. Testimony that an item is what it is claimed to be.
Using that rule, I can listen to a recording of someone’s voice I know and declare under oath, “Yes, that is [fill in name of person whose voice I can identify]’s voice.” But, with voice cloning, I am as likely to be wrong as right. It’s a coin flip and the rules of evidence have not evolved to handle that reality. Consider the issues being considered in a murder case. Justice for the victim as well as the defendant should he be wrongfully charged. Do any of us feel comfortable with a judge simply admitting a critical audio recording that pushes the balance of the jury’s consideration in either direction based upon a person’s testimony claiming they can recognize that voice?
Police Can Lie….What Other Deception Is Allowed?
In Frazier v. Cupp, 394 U.S. 731 (1969), the Supreme Court permitted admission of evidence in a murder case that was, in part, obtained by interrogators lying to suspects. The case does not outline which forms of deception are inherently permitted. Later cases, however, have interpreted the decision as affirmatively permitting deception by interrogators and others to extract confessions or damaging admissions by suspects or from witnesses.
It’s not a leap to imagine what voice cloning brings to law enforcement’s arsenal. I have mentioned what I will posit in this paragraph to several non-lawyer friends and their universal response is, “that isn’t legal.” When pressed, however, they had no cases or statutes to support that notion. They just all reflexively thought this was a bridge too far to permit law enforcement to use voice cloning to deceive suspects or witnesses. It’s coming. Without a doubt, it will be deployed and then challenged and the Supreme Court will be challenged to accept the case to resolve it. Let the creative writer in me give you the scenario.
To Catch A Murderer
Police have a murder. They have a suspect. Their evidence is circumstantial but sufficient to arrest and detain the suspect. He is in the county jail awaiting trial. As with all such jails, phone calls received by inmates are recorded and all parties to such calls are informed ahead of time. Prosecutors request and receive regular data dumps of who the suspect is speaking with, their phone numbers and then listen to the calls to determine the relationship of that caller to the inmate. What do they now have after the suspect is in the jail for a month or two? They have data. Specifically, they have way more than 15 minutes of voice data from that suspect.
Some IT nerd (maybe one who writes a blog about the law and AI) is contacted to create a voice clone using that data. It’s a trivially easy task for a developer using freely available open source tools. That developer creates a tool enabling an investigator to place calls to some or all of those incoming callers, with his voice, intonation, etc. sounding exactly like the suspect. Even using keywords or slang that the suspect was recording using during the calls. The recipient of one of those calls, believing they are speaking to the suspect, responds in the following way:
Investigator (sounding exactly like the suspect): “Hey, what’s up?”
Suspect girlfriend: “I’m good.”
Investigator: “Things are good here, I got someone’s cell phone. I don’t have a lot of time I don’t want to get caught using it.”
Girlfriend: “Okay.”
Investigator: “I need you to do something for me.”
Girlfriend: “Okay, what do you need?”
Investigator: “Remember where I told you I hid that thing, that thing from the case?” (Referring to the murder weapon)
Girlfriend: “Yeah.”
Investigator: “I need you to go there and get it for me.”
Girlfriend: “And do what with it?”
Investigator: “Take it down by E.9th. Walk to the edge there and throw it as hard as you can into the lake. Look, I have to go. Please do this for me. I will try and call you tomorrow.”
Girlfriend: “Okay, babe, I got you.”
Law enforcement who are surveilling the girlfriend simply follow her to where she knows the murder weapon was disposed of and when she is ready to park near the lake to throw it in to the water, they grab her up. Not only is she now an accessory after the fact, but they just solved their otherwise shaky case because they have the murder weapon and the proof that the suspect committed the crime with it and his girlfriend knew where he disposed of it.
The Defense Attorney later in a pre-trial hearing to try and exclude that evidence from the trial: “That is, the way they obtained this evidence, the method is just, well, it’s just not right your honor.” That is likely the most cogent argument to make. The girlfriend was not the suspect. She was not represented by counsel. If you need both parties to record calls, the investigator can simply not record the call posing as the suspect. This day is coming if it is not already here and we just have not heard of the case yet.
Undercover Voice
Voice cloning technology could offer a significant advantage to law enforcement in undercover operations. Traditional undercover work comes with a host of dangers, including the risk of exposure or identity discovery. With voice cloning, agents could convincingly assume the vocal identity of individuals within a criminal organization, thereby gaining access to secured communications or confidential plans without arousing suspicion. This not only increases the likelihood of successful infiltration but also substantially mitigates the risks faced by undercover officers, enhancing both efficiency and safety.
In hostage situations, time and psychological nuance are of the essence. Voice cloning could offer an unprecedented tool for negotiation by mimicking the voice of the kidnapper. By using cloned voices to simulate conversations between kidnappers or to issue simulated demands or statements, law enforcement could confuse or mislead the criminal, thereby gaining tactical advantages that could help safely secure the hostages' release. The use of voice cloning in such high-stakes situations could offer a unique set of possibilities for resolving crises effectively.
Training
Training exercises are crucial for preparing law enforcement personnel for the complexity and unpredictability of real-world operations. Voice cloning technology could be used to simulate various scenarios involving criminals, thereby providing a more immersive and realistic training environment. For example, a training program could incorporate voice-cloned dialogues from actual past incidents to recreate high-pressure situations like hostage negotiations or terrorist incidents. By training with these highly realistic scenarios, law enforcement agencies could significantly improve the preparedness and efficacy of their personnel.
Say 'I Do' Before 'I Clone'
While voice cloning has the potential to revolutionize law enforcement techniques, it also opens a Pandora's box of ethical concerns. The very act of cloning someone's voice without explicit consent could be seen by the courts as an infringement on personal privacy and autonomy. Additionally, the use of voice cloning in sensitive situations, such as hostage negotiations, raises questions about the ethical boundaries of deception. Could misleading a criminal through voice cloning be considered entrapment or coercion?
As noted above with Evid.R. 901, the existing legal landscape is not equipped to govern the intricate nuances of voice cloning technology, especially when deployed by law enforcement agencies. Legislators would need to craft specific laws that address the consent required for voice cloning, as well as the permissible contexts in which these cloned voices can be used. Rules would also need to be established concerning the admissibility of voice-cloned evidence in court. Given the international reach of many criminal activities, international treaties and agreements may also be required to govern cross-border usage of voice cloning in law enforcement.
There Is No Conclusion
As with so many considerations of the collision of rapidly advanced technologies like voice cloning and the law, there is no basis to say, “this is what it is going to be like” or even, this is how court’s will interpret this. None of us know. We just know that the current legal framework is not ready for this technology (or many others). That void of rules means that people (you, me, law enforcement, everyone) is going to be guided by their own ethics and situational imperatives and do what they do. Courts will eventually weigh in (and perhaps legislators) but before they do a whole host of cases will be decided (some wrongly with the worst of consequences) amidst the admissibility of not of evidence that turns out to be a voice clone or derived from deceiving someone with a voice clone.
We are not on the brink of this technology….it is already here. As noted above, creating a voice clone is so trivial Apple is going to have the technology embedded in millions of iPhones likely within weeks of this post. That means, for others to implement this same technology is going to be close to zero cost as well. This is not merely a technological issue but a societal one, affecting everything from individual privacy to national security. As the lines between genuine and cloned voices blur, the challenges for both law enforcement and the public will only grow more complex.