Skip to main content
Made with ❤️ by Pixit
Made with ❤️ by Pixit

OpenAI Releases Text-To-Video Model Sora

Openai-Sora

Story: OpenAI releases Text-to-Video (Image) model ‘Sora’ and the results are mind-blowing. The diffusion model generates videos that last up to a minute with high fidelity and accurate details of subjects, backgrounds, complex scenes, and specific types of motions. The model can (a) generate videos all at once (1920x1080p or 10280x1920p), (b) extend videos, and (c) create high-resolution images (2048x2048p). It still has some weaknesses but compared with other Text-to-Video models such as Runway, Pika 1.0, or Google Lumiere, the results are simply stunning.

Key Findings:

  • Access: For now, Sora will be available to the Red Teaming Network only, which includes domain experts in areas like misinformation, hateful content, bias and more. These experts will access critical areas for harms or risks

  • Capabilities: According to OpenAI, Sora is able to generate “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background”

  • Weaknesses: The current model, however, suffers from accurately simulating the physics of many basic interactions (e.g. glass shattering) as well as understanding spatial details of a prompt (mixing up left and right)

  • Safety: OpenAI plans to adhere to the C2PA standard (reported here) to address provenance and authenticity. Further, the prompts will be evaluated with respect to violence, sexual content, and more (like it is already the case in ChatGPT)

  • Technical details: You can read the technical report here. Sora is a diffusion model that generates realistic videos from noisy videos (like in the diffusion process manner) by using patches - small parts of an image (introduced in Vision Transformer). Further, OpenAI used videos in native aspect ratio instead of resizing the videos to a certain sizes and the team used the same recaptioning technique from DALL-E 3 (reported here)

Pixit‘s Two Cents: For half a year, we’re watching how text-to-video models emerged from Runway’s Gen-1 model, to Rephrase.ai, to ByteDance’s MagicVideo-V2 and to Google’s Lumiere - but OpenAI did it again. The results the team published on their website (although cherry-picked for sure) are mind-blowing and much better compared to everything we have seen so far! We can’t wait to use and evaluate Sora!


Stability AI Introduces a More Efficient Diffusion Model Stable Cascade

model-overview

Story: Stability AI releases Stable Cascade a new Text-to-Image model that is much more efficient with respect to training while generating superior images. The model is build upon the Würstchen architecture, which was partially designed by a German undergrad student. This architecture decompresses the images into much smaller representations compared to other latent diffusion models (like Stable Diffusion). While other models use a compression in the range of 4x - 8x spatial compression, Würstchen uses a 42x spatial compression! That is why inference is twice as fast compared to SDXL.


Key Findings:

  • Technical details: The trick of the architecture is to not only use the text embedding (the prompt) but also a noise-free image embedding generated by a separate diffusion model. For more details, see here and here

  • Capabilities: The model can generate not only images but also image variations, image-to-image generations, upscaling (2x), inpainint, outpainting, and canny edges

  • Open Source: Stability AI releases the model under a non-commercial license as well as scripts for finetuning, ControlNet, and LoRA on GitHub.

  • Inference speed: Although Stable Cascade contains 1.4 billion parameters more than Stable Diffusion XL, it is twice as fast during inference

Pixit‘s Two Cents: What is we use the findings from the Würstchen architecture also in text-to-video models like Sora? It will definitely boost training and inference speed because video generation is much more intensive than images.


Introducing Google Gemini 1.5: A Leap in AI Efficiency and Performance

google-gemini15

Story: Google has announced Gemini 1.5, a next-generation AI model that marks a significant upgrade over its predecessor, Gemini 1.0. With an emphasis on efficiency and a groundbreaking ability to understand long contexts (we’re talking up to a million tokens) across various modalities, Gemini 1.5 showcases Google's commitment to advancing their AI capabilities in this AI arms race. This new model is optimized for a wide range of tasks and introduces an experimental feature that significantly extends its context window.


Key Findings:

  • Revolutionary Mixture-of-Experts Architecture: Gemini 1.5 integrates a new Mixture-of-Experts (MoE) architecture, enhancing its efficiency in training and serving, providing a solid foundation for its high performance.

  • Pioneering Long-Context Processing: The model's ability to process up to 1 million tokens represents a significant leap in AI's capability to understand and synthesize information from extensive documents and multimedia content.

  • Commitment to Ethical AI Deployment: Google's rigorous ethics and safety testing for Gemini 1.5 underlines their commitment to responsible AI development, ensuring the model aligns with global ethical standards.

  • Innovative Accessibility and Pricing: By offering Gemini 1.5 to developers and enterprises with varied pricing based on context window sizes, Google is democratizing access to cutting-edge AI technology.

  • Groundbreaking Performance Metrics: Gemini 1.5 sets new standards in performance benchmarks, outpacing its predecessors in accuracy and processing speed across a range of AI tasks.

Pixit‘s Two Cents: The introduction of Gemini 1.5 is a testament to the continuous evolution of Google’s AI technology. Especially because they just had a mojor release a week ago with their 1.0 Ultra model. For us at Pixit, the advancements in long-context understanding Google is making as well as the efficiency in AI models like Gemini 1.5 are most interesting since they will allow for many much more powerful applications. Also it shows that we are still in an era where big leaps instead of small incremental updates are happening.


Small Bites, Big Stories:

Tags:
Pix
Post by Pix
Feb 19, 2024 8:46:48 AM