Large Language Models Benchmarks

Advanced AI Language Model Outperforms Physicians in Reasoning Tasks

Large language model outperformed physicians in diagnostic reasoning tasks, highlighting potential for AI in clinical care.

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

Frontier AI models corrupt 25% of document content in multi-step workflows — rewriting rather than deleting, which makes the ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

VentureBeat

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Sapient Intelligence launches HRM-Text, challenging the LLM monopoly with a brain-inspired foundation model trained on up to 1000x fewer tokens

Sapient Intelligence, an AGI research company, announces the launch of HRM-Text, an ultra-lean 1-billion-parameter reasoning language model, to deliver competitive reasoning and general performance ...

Bloomberg L.P.

Show inaccessible results

Advanced AI Language Model Outperforms Physicians in Reasoning Tasks

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

AI Benchmarks Are Broken : The Leaderboard Illusion

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

With AI models clobbering every benchmark, it's time for human evaluation

Sapient Intelligence launches HRM-Text, challenging the LLM monopoly with a brain-inspired foundation model trained on up to 1000x fewer tokens

Introducing BloombergGPT, Bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance

How Large Language Models Are Reshaping Health Prediction & Clinical Decision Making

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models