Audio deepfake


Audio deepfake technology, also referred to as voice cloning or deepfake audio, is an application of artificial intelligence designed to generate speech that convincingly mimics specific individuals, often synthesizing phrases or sentences they have never spoken. Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating audiobooks and assisting individuals who have lost their voices due to medical conditions. Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding text-to-speech systems, and advanced speech translation services.

Incidents of fraud

Audio deepfakes, referred to as audio manipulations beginning in the early 2020s, are becoming widely accessible using simple mobile devices or personal computers. These tools have also been used to spread misinformation using audio. This has led to cybersecurity concerns among the global public about the side effects of using audio deepfakes, including its possible role in disseminating misinformation and disinformation in audio-based social media platforms. People can use them as a logical access voice spoofing technique, where they can be used to manipulate public opinion for propaganda, defamation, or terrorism. Vast amounts of voice recordings are daily transmitted over the Internet, and spoofing detection is challenging. Audio deepfake attackers have targeted individuals and organizations, including politicians and governments.
In 2019, scammers using AI impersonated the voice of the CEO of a German energy company and directed the CEO of its UK subsidiary to transfer. In early 2020, the same technique impersonated a company director as part of an elaborate scheme that convinced a branch manager to transfer $35 million.
According to a 2023 global McAfee survey, one person in ten reported having been targeted by an AI voice cloning scam; 77% of these targets reported losing money to the scam. Audio deepfakes could also pose a danger to voice ID systems currently used by financial institutions. In March 2023, the United States Federal Trade Commission issued a warning to consumers about the use of AI to fake the voice of a family member in distress asking for money.
In October 2023, during the start of the British Labour Party's conference in Liverpool, an audio deepfake of Labour leader Keir Starmer was released that falsely portrayed him verbally abusing his staffers and criticizing Liverpool. That same month, an audio deepfake of Slovak politician Michal Šimečka falsely claimed to capture him discussing ways to rig the upcoming election.
During the campaign for the 2024 New Hampshire Democratic presidential primary, over 20,000 voters received robocalls from an AI-impersonated President Joe Biden urging them not to vote. The New Hampshire attorney general said this violated state election laws, and alleged involvement by Life Corporation and Lingo Telecom. In February 2024, the United States Federal Communications Commission banned the use of AI to fake voices in robocalls. That same month, political consultant Steve Kramer admitted that he had commissioned the calls for $500. He said that he wanted to call attention to the need for rules governing the use of AI in political campaigns. In May, the FCC said that Kramer had violated federal law by spoofing the number of a local political figure, and proposed a fine of $6 million. Four New Hampshire counties indicted Kramer on felony counts of voter suppression, and impersonating a candidate, a misdemeanor.

Categories

Audio deepfakes can be divided into three different categories:

Replay-based

Replay-based deepfakes are malicious works that aim to reproduce a recording of the interlocutor's voice.
There are two types: far-field detection and cut-and-paste detection. In far-field detection, a microphone recording of the victim is played as a test segment on a hands-free phone. On the other hand, cut-and-paste involves faking the requested sentence from a text-dependent system. Text-dependent speaker verification can be used to defend against replay-based attacks. A current technique that detects end-to-end replay attacks is the use of deep convolutional neural networks.

Synthetic-based

The category based on speech synthesis refers to the artificial production of human speech, using software or hardware system programs. Speech synthesis includes text-to-speech, which aims to transform the text into acceptable and natural speech in real-time, making the speech sound in line with the text input, using the rules of linguistic description of the text.
A classical system of this type consists of three modules: a text analysis model, an acoustic model, and a vocoder. The generation usually has to follow two essential steps. It is necessary to collect clean and well-structured raw audio with the transcripted text of the original speech audio sentence. Second, the text-to-speech model must be trained using these data to build a synthetic audio generation model.
Specifically, the transcribed text with the target speaker's voice is the input of the generation model. The text analysis module processes the input text and converts it into linguistic features. Then, the acoustic module extracts the parameters of the target speaker from the audio data based on the linguistic features generated by the text analysis module. Finally, the vocoder learns to create vocal waveforms based on the parameters of the acoustic features. The final audio file is generated, including the synthetic simulation audio in a waveform format, creating speech audio in the voice of many speakers, even those not in training.
The first breakthrough in this regard was introduced by WaveNet, a neural network for generating raw audio waveforms capable of emulating the characteristics of many different speakers. This network has been overtaken over the years by other systems which synthesize highly realistic artificial voices within everyone’s reach.
Text-to-speech is highly dependent on the quality of the voice corpus used to realize the system, and creating an entire voice corpus is expensive. Another disadvantage is that speech synthesis systems do not recognize periods or special characters. Also, ambiguity problems are persistent, as two words written in the same way can have different meanings.

Imitation-based

Audio deepfake based on imitation is a way of transforming an original speech from one speaker - the original - so that it sounds spoken like another speaker - the target one. An imitation-based algorithm takes a spoken signal as input and alters it by changing its style, intonation, or prosody, trying to mimic the target voice without changing the linguistic information. This technique is also known as voice conversion.
This method is often confused with the previous synthetic-based method, as there is no clear separation between the two approaches regarding the generation process. Indeed, both methods modify acoustic-spectral and style characteristics of the speech audio signal, but the Imitation-based usually keeps the input and output text unaltered. This is obtained by changing how this sentence is spoken to match the target speaker's characteristics.
Voices can be imitated in several ways, such as using humans with similar voices that can mimic the original speaker. In recent years, the most popular approach involves the use of particular neural networks called generative adversarial networks due to their flexibility as well as high-quality results.
Then, the original audio signal is transformed to say a speech in the target audio using an imitation generation method that generates a new speech, shown in the fake one.

Detection methods

The audio deepfake detection task determines whether the given speech audio is real or fake.
Recently, this has become a hot topic in the forensic research community, trying to keep up with the rapid evolution of counterfeiting techniques.
In general, deepfake detection methods can be divided into two categories based on the aspect they leverage to perform the detection task. The first focuses on low-level aspects, looking for artifacts introduced by the generators at the sample level. The second, instead, focus on higher-level features representing more complex aspects as the semantic content of the speech audio recording.
Many machine learning models have been developed using different strategies to detect fake audio. Most of the time, these algorithms follow a three-steps procedure:
  1. Each speech audio recording must be preprocessed and transformed into appropriate audio features;
  2. The computed features are fed into the detection model, which performs the necessary operations, such as the training process, essential to discriminate between real and fake speech audio;
  3. The output is fed into the final module to produce a prediction probability of the Fake class or the Real one. Following the ASVspoof challenge nomenclature, the Fake audio is indicated with the term "Spoof," the Real instead is called "Bonafide."
Over the years, many researchers have shown that machine learning approaches are more accurate than deep learning methods, regardless of the features used. However, the scalability of machine learning methods is not confirmed due to excessive training and manual feature extraction, especially with many audio files. Instead, when deep learning algorithms are used, specific transformations are required on the audio files to ensure that the algorithms can handle them.
There are several open-source implementations of different detection methods, and usually many research groups release them on a public hosting service like GitHub.

Open challenges and future research direction

The audio deepfake is a very recent field of research. For this reason, there are many possibilities for development and improvement, as well as possible threats that adopting this technology can bring to our daily lives. The most important ones are listed below.