Google DeepMind introduced the Gemini Robotics-ER 1.5 and Gemini Robotics 1.5 AI models.
The Gemini Robotics-ER 1.5 available to developers via the Gemini API in Google AI Studio
Photo Credit: Google
Google DeepMind on Thursday unveiled two new artificial intelligence (AI) models in the Gemini Robotics family. Dubbed Gemini Robotics-ER 1.5 and Gemini Robotics 1.5, the two models work in tandem to power general-purpose robots. Compared to any embodied AI models created by the Mountain View-based tech giant, these models offer higher reasoning, vision, and action capabilities across various real-world scenarios. The ER 1.5 model is designed to be the planner or orchestrator, whereas the 1.5 model can perform tasks based on natural language instructions.
In a blog post, DeepMind introduced and detailed the two new Gemini Robotics models that are designed for general-purpose robots operating in the physical world. Generative AI technology has brought about a major breakthrough in robotics, replacing the traditional interface to communicate with a robot with natural language instructions.
However, when it comes to implementing AI models as the brain of a robot, many challenges remain. For instance, the large language models themselves struggle to understand the spatial and temporal dimensions or make precise movements for different object shapes. This issue existed because a single AI model was both thinking up the plan and executing the plan, making the process error-prone and laggy.
Google's solution to this problem is a two-model setup. Here, the Gemini Robotics-ER 1.5, a vision-language model (VLM), comes with advanced reasoning and tool-calling capabilities. It can create multi-step plans for a task. The company says the model excels in making logical decisions within physical environments, and can natively call tools like Google Search to search for information. It is also said to achieve state-of-the-art (SOTA) performance on various spatial understanding benchmarks.
After the plan has been created, the Gemini Robotics 1.5 springs into action. The vision-language-action (VLA) model can turn visual information and instructions into motor commands, enabling a robot to perform tasks. The model first thinks and creates the most efficient path towards completing an action, and then executes it. It can also explain its thinking process in natural language to bring more transparency.
Google claims this system will allow robots to better understand complex and multi-step commands and then execute them in a single flow. For instance, if a user asks a robot to sort multiple objects into the correct compost, recycling and trash bins, the AI system can first search for local recycling guidelines on the Internet, analyse the objects in front, make a plan to sort them, and execute the action.
Notably, the tech giant says that the AI models were designed to work in robots of any shape and size due to their high spatial understanding and wider adaptability. Currently, the orchestrator Gemini Robotics-ER 1.5 is available to developers via the Gemini application programming interface (API) in Google AI Studio. The VLA model, on the other hand, is only available to select partners.
For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.