Pixit Pulse: The Weekly Generative AI Wave

AI News #72

Geschrieben von Pix | May 27, 2024 8:05:26 AM

Google DeepMind Unveils Veo, The Next-Gen Video Creator

Story: Google DeepMind has unveiled Veo, their most capable generative video model. Veo is designed to create high-definition 1080p videos that can extend beyond a minute, capturing a wide range of cinematic and visual styles. This model can not only generate new videos but also edit existing ones.

Key Findings:

  • Editing Capabilities: Veo allows for advanced video editing, including masked editing and integrating new elements into existing videos based on user commands

  • Enhanced Prompt Understanding: The model is supposed to excel at interpreting natural language prompts and combining them with visual references (i.e. images) to generate coherent and detailed videos

  • Video Consistency: Utilizing latent diffusion transformers (and more), Veo reduces inconsistencies across video frames, maintaining visual coherence and reducing flicker or unexpected changes

  • Responsible Design: Videos generated by Veo are watermarked using SynthID to ensure authenticity and mitigate privacy, copyright, and bias risks

Pixit‘s Two Cents: Over a year ago, we were amused and slightly disturbed by a video of Will Smith eating spaghetti, created by one of the first text-to-video models (watch here). Fast forward to today, and we now have the capability to generate realistic, ultra-detailed videos thanks to advancements from Google and OpenAI. We’re eager to start creating videos ourselves as soon as Google grants us access.

Meta Introduces CM3leon: A Powerful Model for Text-to-Image and Image-to-Text Generation

Story: Meta has announced CM3leon, a state-of-the-art generative AI model that excels in both text-to-image and image-to-text generation. They are leveraging a unique training recipe adapted from text-only language models. The model's name, pronounced like "chameleon," reflects its versatility and ability to seamlessly transition between visual and textual modalities. They are achieving state-of-the-art performance in text-to-image generation while requiring five times less compute than previous transformer-based methods.

Key Findings:

  • Multimodal Capabilities: CM3leon is a single foundation model that excels in both text-to-image and image-to-text generation.

  • Unique Training Recipe: CM3leon is the first multimodal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage.

  • State-of-the-Art Performance: Despite being trained with five times less compute than previous transformer-based methods, CM3leon achieves state-of-the-art performance for text-to-image generation.

  • Tokenizer-Based Transformers: CM3leon's training recipe demonstrates that tokenizer-based transformers can be trained as efficiently as existing generative diffusion-based models, paving the way for further advancements in multimodal AI.

  • Addressing Bias and Transparency: As generative models like CM3leon become increasingly sophisticated, Meta acknowledges the importance of addressing potential biases present in training data and emphasizes the need for transparency in accelerating progress.

Pixit‘s Two Cents: They achieve state-of-the-art performance in text-to-image generation while requiring significantly less compute than previous methods, that really shows the potential for more efficient and effective generative models. The model's unique training recipe, adapted from text-only language models, gives a glimpse into the ingenuity and innovation behind its development. It’s most fascinating to see how all of it is possible through one model. My personal favorite are the image editing capabilities they show on their demo site.

Google Introduces Generative AI Features for Shopping and Marketing

Story: Google has announced a suite of generative AI features designed to greatly improve the way businesses and marketers engage with customers and create compelling ad campaigns. These new tools, powered by Google's advanced AI technology, aim to streamline product discovery, enhance ad creation, and deliver more personalized shopping experiences. By leveraging the power of generative AI, Google is empowering businesses to create high-quality, visually appealing ads that resonate with their target audience, while also making it easier for shoppers to find the products they love.

Key Findings:

  • Advanced AI Algorithms: Google's generative AI features are powered by state-of-the-art algorithms, they allow for background replacement and virtual try ons of fashion items.

  • Multimodal AI Capabilities: The new tools leverage multimodal AI, combining text, images, and other data types to generate comprehensive and engaging ad campaigns that effectively showcase products and services. For example putting in a mood image and then generating backgrounds in the same style.

  • Scalable and Efficient: Google's generative AI infrastructure is designed to be scalable and efficient, allowing businesses to generate high-quality ads and product descriptions at scale, without compromising on performance or quality.

  • Integrating automatic 3D generation: Google also allows for entering a few high quality images of a product to generate a 3D spinning animation of shoes for example.

Pixit‘s Two Cents: Google's generative AI features for shopping and marketing are an interesting technological advancement, leveraging state-of-the-art algorithms and multimodal AI capabilities to greatly improve the the e-commerce and ad experience. They have the potential to shape the market for product photography.

Small Bites, Big Stories: