Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many exciting announcements at this year’s I/O.
PaLM 2
Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:
Bard is an early experiment that lets you collaborate with generative AI and helps to boost productivity, accelerate ideas and fuel curiosity. It builds on advances in deep learning efficiency and leverages reinforcement learning from human feedback to provide more relevant responses and increase the model’s ability to follow instructions. Bard is now available in 180 countries, where users can interact with it in English, Japanese and Korean, and thanks to the multilingual capabilities afforded by PaLM 2, support for 40 languages is coming soon.
With Search Generative Experience we’re taking more of the work out of searching, so you’ll be able to understand a topic faster, uncover new viewpoints and insights, and get things done more easily. As part of this experiment, you’ll see an AI-powered snapshot of key information to consider, with links to dig deeper.
MakerSuite is an easy-to-use prototyping environment for the PaLM API, powered by PaLM 2. In fact, internal user engagement with early prototypes of MakerSuite accelerated the development of our PaLM 2 model itself. MakerSuite grew out of research focused on prompting tools, or tools explicitly designed for customizing and controlling LLMs. This line of research includes PromptMaker (precursor to MakerSuite), and AI Chains and PromptChainer (one of the first research efforts demonstrating the utility of LLM chaining).
Project Tailwind also made use of early research prototypes of MakerSuite to develop features to help writers and researchers explore ideas and improve their prose; its AI-first notebook prototype used PaLM 2 to allow users to ask questions of the model grounded in documents they define.
Med-PaLM 2 is our state-of-the-art medical LLM, built on PaLM 2. Med-PaLM 2 achieved 86.5% performance on U.S. Medical Licensing Exam–style questions, illustrating its exciting potential for health. We’re now exploring multimodal capabilities to synthesize inputs like X-rays.
Codey is a version of PaLM 2 fine-tuned on source code to function as a developer assistant. It supports a broad range of Code AI features, including code completions, code explanation, bug fixing, source code migration, error explanations, and more. Codey is available through our trusted tester program via IDEs (Colab, Android Studio, Duet AI for Cloud, Firebase) and via a 3P-facing API.
Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.
PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages.
Imagen
Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:
Image generation in Google Slides and Android’s Generative AI wallpaper are powered by our text-to-image generation features.
Google Cloud’s Vertex AI enables image generation, image editing, image upscaling and fine-tuning to help enterprise customers meet their business needs.
I/O Flip, a digital take on a classic card game, features Google developer mascots on cards that were entirely AI generated. This game showcased a fine-tuning technique called DreamBooth for adapting pre-trained image generation models. Using just a handful of images as inputs for fine-tuning, it allows users to generate personalized images in minutes. With DreamBooth, users can synthesize a subject in diverse scenes, poses, views, and lighting conditions that don’t appear in the reference images.
I/O Flip presents custom card decks designed using DreamBooth.
Phenaki
Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.
ARCore and the Scene Semantic API
Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.
Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors.
Chirp
Chirp is Google’s family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.
MusicLM
At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.
MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.
Universal Translator dubbing
Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.
AI for global societal good
We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:
Traffic engineers use our Green Light recommendations to reduce stop-and-go traffic at intersections and improve the flow of traffic in cities from Bangalore to Rio de Janeiro and Hamburg. Green Light models each intersection, analyzing traffic patterns to develop recommendations that make traffic lights more efficient — for example, by better synchronizing timing between adjacent lights, or adjusting the “green time” for a given street and direction.
We’ve also expanded global coverage on the Flood Hub to 80 countries, as part of our efforts to predict riverine floods and alert people who are about to be impacted before disaster strikes. Our flood forecasting efforts rely on hydrological models informed by satellite observations, weather forecasts and in-situ measurements.
Technologies for inclusive and fair ML applications
With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:
The release of the Monk Skin Tone Examples (MST-E) Dataset to help practitioners gain a deeper understanding of the MST scale and train human annotators for more consistent, inclusive, and meaningful skin tone annotations. You can read more about this and other developments on our website. This is an advancement on the open source release of the Monk Skin Tone (MST) Scale we launched last year to enable developers to build products that are more inclusive and that better represent their diverse users.
A new Kaggle competition (open until August 10th) in which the ML community is tasked with creating a model that can quickly and accurately identify American Sign Language (ASL) fingerspelling — where each letter of a word is spelled out in ASL rapidly using a single hand, rather than using the specific signs for entire words — and translate it into written text. Learn more about the fingerspelling Kaggle competition, which features a song from Sean Forbes, a deaf musician and rapper. We also showcased at I/O the winning algorithm from the prior year’s competition powers PopSign, an ASL learning app for parents of deaf or hard of hearing children created by Georgia Tech and Rochester Institute of Technology (RIT).
Building the future of AI together
It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you’re as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.