Pixit Pulse: The Weekly Generative AI Wave

AI News #64

Geschrieben von Pix | Apr 1, 2024 7:27:45 AM

DeepMind's SAFE AI Outperforms Humans in Fact-Checking, Reducing Costs

 

Story: Google's DeepMind has introduced a new AI system called Search-Augmented Factuality Evaluator (SAFE) that demonstrates superior performance compared to human fact-checkers when assessing the accuracy of information generated by large language models. The system breaks down text into individual facts and uses Google Search results to verify each claim, utilizing a multi-step reasoning process. While the researchers found that using SAFE was approximately 20 times more cost-effective than human fact-checkers, some experts argue that further benchmarking against expert human fact-checkers is necessary to truly demonstrate superhuman performance.

Key Findings:

  • Superior Performance Compared to Human Fact-Checkers: SAFE consistently outperforms human fact-checkers in evaluating the accuracy of information generated by large language models, demonstrating its potential to revolutionize the fact-checking process and improve the reliability of AI-generated content.

  • Multi-Step Verification Process Ensures Accuracy: The AI system employs a sophisticated approach to fact-checking by breaking down text into individual facts and using Google Search results to verify each claim, ensuring a thorough and comprehensive evaluation of the information's accuracy.

  • Cost-Effective Solution for Fact-Checking at Scale: Using SAFE for fact-checking is approximately 20 times cheaper than employing human fact-checkers, making it a highly cost-effective solution for verifying the accuracy of large volumes of AI-generated content.

  • Expert Benchmarking Needed to Validate Superhuman Claims: While SAFE demonstrates impressive performance, critics argue that benchmarking against expert human fact-checkers, rather than crowdsourced workers, is necessary to truly validate claims of superhuman performance and ensure the system's reliability.

  • Human Rater Details Crucial for Contextualizing Results: The specific details of the human raters involved in the study, such as their qualifications and compensation, are essential for properly contextualizing the results and understanding the extent of SAFE's superior performance.

  • Growing Importance as Language Models Advance: As the volume of information generated by language models continues to grow, cost-effective and reliable fact-checking solutions like SAFE will become increasingly important for maintaining the integrity and trustworthiness of online content.

Pixit‘s Two Cents: While SAFE shows promise in automating the fact-checking process and reducing costs, it's essential to take claims of "superhuman" performance with a grain of salt. As AI researcher Gary Marcus points out, the system needs to be compared against expert human fact-checkers, not just crowdsourced workers, to truly validate its capabilities. Nonetheless, as language models continue to generate vast amounts of information, cost-effective solutions like SAFE will play a crucial role in maintaining the integrity of online content. Just don't expect it to replace your favorite fact-checking website anytime soon!

Databricks Launches DBRX, Setting New Standards for Open Source AI Efficiency and Performance

Story: Databricks, a leading enterprise software company, has announced the release of DBRX, a new open source artificial intelligence model that sets a new standard for open source AI efficiency and performance. The model, containing 132 billion parameters, outperforms leading open source alternatives like Llama 2-70B and Mixtral on key benchmarks measuring language understanding, programming ability, and math skills. While not matching the raw power of OpenAI's GPT-4, Databricks executives pitched DBRX as a significantly more capable alternative to GPT-3.5 at a fraction of the cost.

Key Findings:

  • Setting New Standards: DBRX sets a new state-of-the-art for established open LLMs across a range of standard benchmarks, providing the open community and enterprises with capabilities previously limited to closed model APIs.

  • Superior Performance: DBRX outperforms leading open source alternatives like Llama 2-70B and Mixtral on key benchmarks measuring language understanding, programming ability, and math skills, surpassing GPT-3.5 and being competitive with Gemini 1.0 Pro.

  • Efficiency Improvements: DBRX advances the state-of-the-art in efficiency among open models thanks to its fine-grained mixture-of-experts (MoE) architecture, with inference up to 2x faster than LLaMA2-70B and about 40% of the size of Grok-1 in terms of both total and active parameter-counts.

  • Accessible and Customizable: Databricks aims to drive broader adoption of its novel architecture by open-sourcing DBRX, while also supporting the company's primary business of building and hosting custom AI models trained on clients' private datasets.

  • Leveraging In-House Tools: To build DBRX, Databricks leveraged the same suite of tools available to its customers, including Unity Catalog for data management and governance, Lilac AI for data exploration, Apache Spark and Databricks notebooks for data processing and cleaning, and optimized versions of their open-source training libraries.

  • Easy Integration: DBRX can be easily integrated using the Databricks Mosaic AI Foundation Model APIs, with pay-as-you-go pricing, a chat interface, and options for production applications with performance guarantees, support for finetuned models, and additional security and compliance features

Pixit‘s Two Cents: Databricks' release of DBRX is a bold move that challenges the dominance of big tech companies in the AI race. By open-sourcing a state-of-the-art model that outperforms leading alternatives, Databricks is positioning itself as a leader in cutting-edge AI research while also supporting its core business. The accessibility and customizability of DBRX could be a game-changer for enterprises seeking to harness the power of AI while maintaining control over their proprietary data. However, the true test of DBRX's impact will be in its adoption and the value it creates for Databricks' customers. As the AI landscape continues to evolve rapidly, it will be exciting to see how DBRX stacks up against the competition.

Microsoft's 11-by-11 Tipping Point: The Key to Building an AI Habit at Work

Story: Microsoft has been closely studying how people are using AI at work since introducing Copilot to their earliest customers. By analyzing user behavior and identifying early adoption patterns, they have discovered the "11-by-11 tipping point" – the magic formula for unlocking the value of AI and building a habit that can transform organizations. According to a survey of 1,300 Copilot for Microsoft 365 users across various functions and industries, a time savings of just 11 minutes a day over 11 weeks of usage is the key to seeing significant improvements in productivity, work enjoyment, work-life balance, and the ability to attend fewer meetings.

Key Findings:

  • The Magic Number: A time savings of just 11 minutes a day was the threshold where users started to see value from AI, although most people actually saved more time each day, with the most efficient users saving 30 minutes or more.

  • The Breakthrough Moment: After 11 weeks of usage, most people reported that Copilot improved four key areas at work: productivity, work enjoyment, work-life balance, and the ability to attend fewer meetings.

  • Unlocking Copilot Value: The 11-by-11 tipping point suggests that in a little less than a business quarter, most Copilot users at a company can form an AI habit that can power the organization to new heights.

  • Easy Wins: To help people reach the 11-by-11 tipping point, Microsoft recommends finding easy wins that immediately save 11 minutes a day, such as using AI to recap missed meetings instead of taking notes or listening to recordings.

  • Driving AI Adoption: By sharing these findings with leaders looking to drive AI adoption within their organizations, Microsoft aims to provide valuable insights into the early behaviors and factors that influence successful AI implementation.

  • Transforming the Way We Work: As more organizations reach the 11-by-11 tipping point and build AI habits, the potential for transforming the way we work and unlocking new levels of productivity and innovation becomes increasingly apparent.

Pixit‘s Two Cents: Microsoft's 11-by-11 tipping point is a fascinating insight into the psychology of AI adoption and habit formation in the workplace. By quantifying the time savings and duration needed to see real benefits from AI tools like Copilot, Microsoft has provided a roadmap for organizations looking to drive successful AI implementation. The idea that just 11 minutes a day over 11 weeks can lead to significant improvements in productivity, work enjoyment, and work-life balance is both encouraging and achievable. By sharing these insights and best practices, Microsoft is helping to pave the way for a future where AI and human intelligence work together seamlessly to drive innovation and success.

Small Bites, Big Stories:

  • Introducing Stable Code Instruct 3B: Stability AI releases Stable Code Instruct 3B, an instruction-tuned code language model that outperforms larger models in various coding tasks and is available for commercial use.

  • 16 Changes to the Way Enterprises Are Building and Buying Generative AI: In 2024, the revenue opportunity for generative AI in the enterprise is expected to be multiples larger than the billion-dollar consumer spend in 2023, with enterprises building their own use cases and experimenting with novel applications.

  • Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering: Gaussian Frosting is a novel mesh-based representation that captures complex volumetric effects and flat surfaces, allowing for efficient rendering, editing, and animation of 3D graphics.

  • Humes’s Empathic Voice Interface (EVI): Hume AI introduces EVI, a voice interface that aims to provide a more human-like conversational experience compared to ChatGPT adding a layer of social intelligence to AI voices.

  • AI21 Labs presents Jamba: A new approach to instruction tuning that improves the performance and usability of language models for a wide range of tasks without the need for task-specific fine-tuning. AI21 claims that Jamba is the first production-grade Mamba-based model.

  • Grok 1.5: Improved Reasoning, Multilingual Support, and More: xAI announces Grok 1.5, an update to their Grok AI model that brings enhanced reasoning capabilities, multilingual support, and other improvements to better serve users' needs.

  • SambaNova Systems new promising LLM systems: SambaNova Systems introduces groundbreaking AI models that achieve remarkable accuracy while maintaining lightning-fast performance, setting new standards for AI efficiency and effectiveness across various applications.