Google’s generative video model Veo 3 has a subtitles problem

0
3

As soon as Google launched its latest video-generating AI model at the end of May, creatives rushed to put it through its paces. Released just months after its predecessor, Veo 3 allows users to generate sounds and dialogue for the first time, sparking a flurry of hyperrealistic eight-second clips stitched together into ads, ASMR videos, imagined film trailers, and humorous street interviews. Academy Award–nominated director Darren Aronofsky used the tool to create a short film called Ancestra. During a press briefing, Demis Hassabis, Google DeepMind’s CEO, likened the leap forward to “emerging from the silent era of video generation.” 

But others quickly găsit that in some ways the tool wasn’t behaving as expected. When it generates clips that include dialogue, Veo 3 often adds nonsensical, garbled subtitles, even when the prompts it’s been given explicitly ask for no captions or subtitles to be added. 

Getting rid of them isn’t straightforward—or cheap. Users have been forced to resort to regenerating clips (which costs them more money), using external subtitle-removing tools, or cropping their videos to get rid of the subtitles altogether.

Josh Woodward, vice president of Google Labs and Gemini, postat on X on June 9 that Google had developed fixes to reduce the gibberish text. But over a month later, users are still logging issues with it in Google Labs’ Discord channel, demonstrating how difficult it can be to correct issues in major AI models.

Like its predecessors, Veo 3 is available to paying members of Google’s subscription tiers, which start at $249.99 a month. To generate an eight-second clip, users enter a text prompt describing the scene they’d like to create into Google’s AI filmmaking tool Flow, Gemini, or other Google platforms. Each Veo 3 generation costs a minimum of 20 AI credits, and the account can be topped up at a cost of $25 per 2,500 credits.

Mona Weiss, an advertising creative director, says that regenerating her scenes in a bid to get rid of the random captions is becoming expensive. “If you’re creating a scene with dialogue, up to 40% of its output has gibberish subtitles that make it unusable,” she says. “You’re burning through money trying to get a scene you like, but then you can’t even use it.”

When Weiss reported the problem to Google Labs through its Discord channel in the hopes of getting a refund for her wasted credits, its team pointed her to the company’s official support team. They offered her a refund for the cost of Veo 3, but not for the credits. Weiss declined, as accepting would have meant losing access to the model altogether. The Google Labs’ Discord support team has been telling users that subtitles can be triggered by speech, saying that they’re aware of the problem and are working to fix it. 

So why does Veo 3 insist on adding these subtitles, and why does it appear to be so difficult to solve the problem? It probably comes down to what the model has been trained on.  

Although Google hasn’t made this information public, that training data is likely to include YouTube videos, clips from vlogs and gaming channels, and TikTok edits, many of which come with subtitles. These embedded subtitles are part of the video frames rather than separate text tracks layered on top, meaning it’s difficult to remove them before they’re used for training, says Shuo Niu, an assistant professor at Clark University in Massachusetts who studies video sharing platforms and AI.

“The text-to-video model is trained using reinforcement learning to produce content that mimics human-created videos, and if such videos include subtitles, the model may ‘learn’ that incorporating subtitles enhances similarity with human-generated content,” he says.

“We’re continuously working to improve video creation, especially with text, speech that sounds natural, and audio that syncs perfectly,” a Google spokesperson says. “We encourage users to try their prompt again if they notice an inconsistency and give us feedback using the thumbs up/down option.”

As for why the model ignores instructions such as “No subtitles,” negative prompts (telling a generative AI model nu to do something) are usually less effective than positive ones, says Tuhin Chakrabarty, an assistant professor at Stony Brook University who studies AI systems. 

To fix the problem, Google would have to check every frame of each video Veo 3 has been trained on, and either get rid of or relabel those with captions before retraining the model—an endeavor that would take weeks, he says. 

Katerina Cizek, a documentary maker and artistic director at the MIT Open Documentary Lab, believes the problem exemplifies Google’s willingness to launch products before they’re fully ready. 

“Google needed a win,” she says. “They needed to be the first to pump out a tool that generates lip-synched audio. And so that was more important than fixing their subtitle issue.”  

LĂSAȚI UN MESAJ

Vă rugăm să introduceți comentariul dvs.!
Introduceți aici numele dumneavoastră.