Google Unveils Gemini 1.5, Meta Introduces Predictive Visual Machine Learning Model V-JEPA

Google says Gemini 1.5 will have a limited version with a context window of up to 1 million tokens.

Written by Akash Dutta, Edited by David Delima | Updated: 16 February 2024 13:59 IST

Highlights

Google’s Gemini 1.5 model is built on Transformer and MoE architecture
It can process 1 hour of video or over 7,00,000 words in one go
Meta’s V-JEPA model helps machines learn by watching videos

Meta’s V-Jepa is a non-generative model that learns by predicting missing or masked parts of a video

Photo Credit: Google

Google and Meta made notable artificial intelligence (AI) announcements on Thursday, unveiling new models with significant advancements. The search giant unveiled Gemini 1.5, an updated AI model that comes with long-context understanding across different modalities. Meanwhile, Meta announced the release of its Video Joint Embedding Predictive Architecture (V-JEPA) model, a non-generative teaching method for advanced machine learning (ML) through visual media. Both products offer newer ways of exploring AI capabilities. Notably, OpenAI also introduced its first text-to-video generation model Sora on Thursday.

Google Gemini 1.5 model details

Demis Hassabis, CEO of Google DeepMind, announced the release of Gemini 1.5 via a blog post. The newer model is built on the Transformer and Mixture of Experts (MoE) architecture. While it is expected to have different versions, currently, only the Gemini 1.5 Pro model has been released for early testing. Hassabis said that the mid-size multimodal model can perform tasks at a similar level to Gemini 1.0 Ultra which is the company's largest generative model and is available as the Gemini Advanced subscription with Google One AI Premium plan.

The biggest improvement with Gemini 1.5 is its capability to process long-context information. The standard Pro version comes with a 1,28,000 token context window. In comparison, Gemini 1.0 had a context window of 32,000 tokens. Tokens can be understood as entire parts or subsections of words, images, videos, audio or code, which act as building blocks for processing information by a foundation model. “The bigger a model's context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful,” Hassabis explained.

Why Apple May Have Paused Development of Its First Foldable Phone

Alongside the standard Pro version, Google is also releasing a special model with a context window of up to 1 million tokens. This is being offered to a limited group of developers and its enterprise clients in a private preview. While there is no dedicated platform for it, it can be tried out via Google's AI Studio, a cloud console tool for testing generative AI models, and Vertex AI. Google says this version can process one hour of video, 11 hours of audio, codebases with over 30,000 lines of code, or over 7,00,000 words in one go.

Meta V-JEPA details

In a post on X (formerly known as Twitter), Meta publicly released V-JEPA. It is not a generative AI model, but a teaching method that enables ML systems to understand and model the physical world by watching videos. The company called it an important step towards advanced machine intelligence (AMI), a vision of one of the three 'Godfathers of AI', Yann LeCun.

In essence, it is a predictive analysis model, that learns entirely from visual media. It can not only understand what's going on in a video but also predict what comes next. To train it, the company claims to have used a new masking technology, where parts of the video were masked in both time and space. This means that some frames in a video were entirely removed, while some other frames had blacked-out fragments, which forced the model to predict both the current frame as well as the next frame. As per the company, the model was able to do both efficiently. Notably, the model can predict and analyse videos of up to 10 seconds in length.

OpenAI Unveils AI Video Generator Sora That Can Render Minute-Long Clips

“For example, if the model needs to be able to distinguish between someone putting down a pen, picking up a pen, and pretending to put down a pen but not actually doing it, V-JEPA is quite good compared to previous methods for that high-grade action recognition task,” Meta said in a blog post.

At present, the V-JEPA model only uses visual data, which means the videos do not contain any audio input. Meta is now planning to incorporate audio alongside video in the ML model. Another goal for the company is to improve its capabilities in longer videos.

Is the Samsung Galaxy Z Flip 5 the best foldable phone you can buy in India right now? We discuss the company's new clamshell-style foldable handset on the latest episode of Orbital, the Gadgets 360 podcast. Orbital is available on Spotify, Gaana, JioSaavn, Google Podcasts, Apple Podcasts, Amazon Music and wherever you get your podcasts.

Affiliate links may be automatically generated - see our ethics statement for details.

Google Unveils Gemini 1.5, Meta Introduces Predictive Visual Machine Learning Model V-JEPA

Google Gemini 1.5 model details

Meta V-JEPA details

Related Stories