MeSSrro/Shutterstock" src="https://s.yimg.com/ny/api/res/1.2/qt3aHW.9gxqsQVYvYACQKQ-/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMw-/https://media.zenfs.com/en/the_conversation_464/4d731e34c495a44 f1e01931fa8c3e715″ data src="https://s.yimg.com/ny/api/res/1.2/qt3aHW.9gxqsQVYvYACQKQ-/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMw-/https://media.zenfs.com/en/the_conversation_464/4d731e34c495a44f 1e01931fa8c3e715″/>
Google Deepmind recently announced Gemini, its new AI model that will compete with OpenAI's ChatGPT. While both models are examples of "generative AI," which learn to find patterns from input training information to generate new data (images, words, or other media), ChatGPT is a large language model (LLM) that focuses on producing text.
In the same way that ChatGPT is a conversational web app based on the neural network known as GPT (trained on massive amounts of text), Google has a conversational web app called Bard that was based on a model called LaMDA (trained on dialogue). But Google is now upgrading that based on Gemini.
What sets Gemini apart from previous generative AI models such as LaMDA is that it is a "multimodal model". This means that it works directly with multiple input and output modes: it supports not only text input and output, but also images, audio and video. Accordingly, a new acronym arises: LMM (large multimodal model), not to be confused with LLM.
In September, OpenAI announced a model called GPT-4Vision that can also work with images, audio and text. However, it is not a fully multi-modal model as Gemini promises to be.
For example, while ChatGPT-4, which is powered by GPT-4V, can work with audio input and generate speech output, OpenAI has confirmed that this is done by converting speech to text on input using a different deep learning model called Whisper. ChatGPT-4 also converts text to speech when output using a different model, meaning GPT-4V itself works purely with text.
Similarly, ChatGPT-4 can produce images, but it does so by generating text prompts that are passed to a separate deep learning model called Dall-E 2, which converts text descriptions into images.
Google, on the other hand, designed Gemini to be "natively multimodal." This means that the core model directly processes and can output a range of input types (audio, images, video and text).
The story continues
The verdict
The distinction between these two approaches may seem academic, but it is important. The overall conclusion from Google's technical report and other qualitative tests to date is that the current publicly available version of Gemini, called Gemini 1.0 Pro, is generally not as good as GPT-4, and is more similar in capabilities to GPT 3.5.
Google also announced a more powerful version of Gemini, called Gemini 1.0 Ultra, and presented some results showing that it is more powerful than GPT-4. However, this is difficult to assess for two reasons. The first reason is that Google has not yet released Ultra, so the results cannot currently be independently validated.
The second reason why it is difficult to assess Google's claims is that Google has chosen to release a somewhat misleading demonstration video, see below. The video shows the Gemini model providing interactive and fluid commentary on a live video stream.
However, as Bloomberg initially reported, the demonstration in the video was not conducted in real time. For example, the model had learned a number of specific tasks in advance, such as the three-cup and ball trick, where Gemini keeps track of which cup the ball is under. To do this, it featured a series of still images in which the presenter's hands lay on the cups being exchanged.
Promising future
Despite these issues, I believe that Gemini and large multimodal models are an extremely exciting step forward for generative AI. That's both because of their future capabilities and the competitive landscape of AI tools. As I noted in a previous article, GPT-4 was trained on approximately 500 billion words - essentially all publicly available, good-quality text.
The performance of deep learning models is generally determined by the increasing complexity of the model and the amount of training data. This has led to the question of how to achieve further improvements as new training data for language models is running low. However, multimodal models open up vast new reserves of training data - in the form of images, audio and videos.
AIs like Gemini, which can be trained directly on all this data, will likely have much greater capabilities in the future. For example, I would expect models trained on video to develop sophisticated internal representations of what is called "naive physics." This is the basic understanding that humans and animals have about causality, motion, gravity, and other physical phenomena.
I'm also excited about what this means for the competitive landscape of AI. Despite the emergence of many generative AI models, OpenAI's GPT models have been dominant over the past year, demonstrating a level of performance that other models have not been able to approach.
Google's Gemini marks the emergence of a major competitor that will push the field forward. Of course, OpenAI is almost certainly working on GPT-5, and we can expect it to be multimodal as well and demonstrate notable new capabilities.
Read more: Google's Gemini AI signals the next big leap for the technology: analyzing real-time informationThat said, I'm curious about the emergence of very large multimodal models that are open-source and non-commercial, which I hope will be on the way in the coming years.
I also like some of the features of Gemini's implementation. For example, Google announced a version called Gemini Nano, which is much lighter and can run directly on mobile phones.
These types of lightweight models reduce the impact of AI computing on the environment and have many advantages from a privacy perspective. I am sure that this development will lead to competitors following suit.
This article is republished from The Conversation under a Creative Commons license. Read the original article.