AI News #72

Geschrieben von Pix | May 27, 2024 8:05:26 AM

Google DeepMind Unveils Veo, The Next-Gen Video Creator

Story: Google DeepMind has unveiled Veo, their most capable generative video model. Veo is designed to create high-definition 1080p videos that can extend beyond a minute, capturing a wide range of cinematic and visual styles. This model can not only generate new videos but also edit existing ones.

Key Findings:

Editing Capabilities: Veo allows for advanced video editing, including masked editing and integrating new elements into existing videos based on user commands
Enhanced Prompt Understanding: The model is supposed to excel at interpreting natural language prompts and combining them with visual references (i.e. images) to generate coherent and detailed videos
Video Consistency: Utilizing latent diffusion transformers (and more), Veo reduces inconsistencies across video frames, maintaining visual coherence and reducing flicker or unexpected changes
Responsible Design: Videos generated by Veo are watermarked using SynthID to ensure authenticity and mitigate privacy, copyright, and bias risks

Pixit‘s Two Cents: Over a year ago, we were amused and slightly disturbed by a video of Will Smith eating spaghetti, created by one of the first text-to-video models (watch here). Fast forward to today, and we now have the capability to generate realistic, ultra-detailed videos thanks to advancements from Google and OpenAI. We’re eager to start creating videos ourselves as soon as Google grants us access.

Meta Introduces CM3leon: A Powerful Model for Text-to-Image and Image-to-Text Generation

Story: Meta has announced CM3leon, a state-of-the-art generative AI model that excels in both text-to-image and image-to-text generation. They are leveraging a unique training recipe adapted from text-only language models. The model's name, pronounced like "chameleon," reflects its versatility and ability to seamlessly transition between visual and textual modalities. They are achieving state-of-the-art performance in text-to-image generation while requiring five times less compute than previous transformer-based methods.

Key Findings:

Multimodal Capabilities: CM3leon is a single foundation model that excels in both text-to-image and image-to-text generation.
Unique Training Recipe: CM3leon is the first multimodal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage.
State-of-the-Art Performance: Despite being trained with five times less compute than previous transformer-based methods, CM3leon achieves state-of-the-art performance for text-to-image generation.
Tokenizer-Based Transformers: CM3leon's training recipe demonstrates that tokenizer-based transformers can be trained as efficiently as existing generative diffusion-based models, paving the way for further advancements in multimodal AI.
Addressing Bias and Transparency: As generative models like CM3leon become increasingly sophisticated, Meta acknowledges the importance of addressing potential biases present in training data and emphasizes the need for transparency in accelerating progress.

Pixit‘s Two Cents: They achieve state-of-the-art performance in text-to-image generation while requiring significantly less compute than previous methods, that really shows the potential for more efficient and effective generative models. The model's unique training recipe, adapted from text-only language models, gives a glimpse into the ingenuity and innovation behind its development. It’s most fascinating to see how all of it is possible through one model. My personal favorite are the image editing capabilities they show on their demo site.

Google Introduces Generative AI Features for Shopping and Marketing

Story: Google has announced a suite of generative AI features designed to greatly improve the way businesses and marketers engage with customers and create compelling ad campaigns. These new tools, powered by Google's advanced AI technology, aim to streamline product discovery, enhance ad creation, and deliver more personalized shopping experiences. By leveraging the power of generative AI, Google is empowering businesses to create high-quality, visually appealing ads that resonate with their target audience, while also making it easier for shoppers to find the products they love.

Key Findings:

Advanced AI Algorithms: Google's generative AI features are powered by state-of-the-art algorithms, they allow for background replacement and virtual try ons of fashion items.
Multimodal AI Capabilities: The new tools leverage multimodal AI, combining text, images, and other data types to generate comprehensive and engaging ad campaigns that effectively showcase products and services. For example putting in a mood image and then generating backgrounds in the same style.
Scalable and Efficient: Google's generative AI infrastructure is designed to be scalable and efficient, allowing businesses to generate high-quality ads and product descriptions at scale, without compromising on performance or quality.
Integrating automatic 3D generation: Google also allows for entering a few high quality images of a product to generate a 3D spinning animation of shoes for example.

Pixit‘s Two Cents: Google's generative AI features for shopping and marketing are an interesting technological advancement, leveraging state-of-the-art algorithms and multimodal AI capabilities to greatly improve the the e-commerce and ad experience. They have the potential to shape the market for product photography.

Small Bites, Big Stories:

Prompt Engineering Made Easy with Anthropic AI: Anthropic has developed a tool to simplify the process of prompt engineering for you. The tool will help to produce production-ready prompts that will help many users to save time and increase quality.
Google’s invisible AI watermark will help identify generative text and video: Google is not only injecting watermarks in videos, images, and audio using SynthID, but the tool can also detect digitally generated content.
ElevenLabs Launches AI-Voiced Screen Reader App: The app can read web pages and documents aloud in 11 different voices.
Hugging Face is sharing $10 million worth of compute to help beat the big AI companies: HuggingFace - a tool which we are using a lot at Pixit - is committing $10 million in free GPUs to help developers create AI tools.
Claude is now available in Europe: Finally! Anthropic’s Claude is now available in Europe. You can access the tool here. You can access the tool for free and upgrading to Claude Pro (including Claude 3 Opus) costs ~18€.
**Emu Edit: Precise Image Editing via Recognition and Generation Tasks:** Researchers have developed a multi-task image editing model that excels in instruction-based image editing (e.g., region-based editing, free-form editing). A significant innovation in Emu Edit is the use of learned task embeddings, which guide the model in generating the correct edit type from the instructions.
**Adobe brings Firefly AI-powered Generative Remove to Lightroom:** Adobe added a Generative Remove feature for Lightroom that allows to remove objects of out images. This feature is leveraging Adobe Firefly.
OpenAI’s Long-Term AI Risk Team Has Disbanded: The team behind OpenAI’s (inter alia Ilya Sutskever) focus on existential dangers of AI has either resigned or been absorbed into other research groups.
**Introducing Copilot+ PCs:** Microsoft announced so called Copilot + PCs that are build essentially for the development of AI applications. The PCs are “the fastest, most intelligent Windows PCs ever build” and are supposed to generate and refine AI images in “near real-time” directly on the device. They start at $999.
DINO 1.5 Pro: Smarter and Faster Object Detection: DINO 1.5 Pro, an enhanced version of the DINO object detection model, offers improved accuracy and speed, making it a powerful tool for various computer vision applications.
Anthropic Maps Millions of Concepts in Claude Sonnet's AI Model: Anthropic researchers have identified how millions of concepts are represented inside Claude Sonnet, one of their deployed large language models.
TikTok Leverages Generative AI to Boost Its Ads Business: TikTok introduces generative AI tools to simplify ad creation, enabling businesses to quickly generate engaging ad content tailored to their target audience.
Scarlett Johansson Threatens Legal Action Against OpenAI Over AI Voice Allegedly Modeled on Hers: Actress Scarlett Johansson intends to sue OpenAI due to the perceived similarity between her voice in the film "Her" and their latest AI voice, "Sky."

Vollständigen Beitrag anzeigen