GAUSS Project Blog

Introduction

Large Language Models (LLMs) have made remarkable progress in mathematics — spanning the full range from K–12 curriculum problems to Olympiad-level challenges, and even to research-level problems.

But existing benchmarks don’t tell the whole story. They mainly measure final accuracy — whether the model arrived at the right answer. What they miss is how the model got there. Did it recall the correct theorem? Apply a systematic strategy? Use intuition to check plausibility? Or did it simply memorize a pattern?

To push beyond these limits, we built GAUSS (General Assessment of Underlying Structured Skills in Mathematics). Developed at Hyperbolic Labs, GAUSS is a skill-based benchmark that evaluates not just outcomes, but the cognitive abilities underlying mathematical reasoning.


Why GAUSS?

1. Low Skill Resolution

Traditional benchmarks have low skill resolution — they only report whether the final answer is correct. But solving a math problem usually involves multiple skills: recalling knowledge, performing symbolic computations, constructing proofs, or generalizing across domains.

GAUSS disentangles these skills, tagging each problem with the exact abilities it requires. This allows researchers to see a model’s skill profiles, not just a single score.

2. Saturation

Many widely used datasets (e.g. GSM8K, MATH) are already saturated — top models achieve near-perfect scores, leaving little room to measure progress. GAUSS addresses this with fresh, carefully curated problems from Olympiads, graduate coursework, and research-level sources.

3. Contamination

Existing benchmarks often contain problems already seen in training data, inflating results. GAUSS minimizes contamination by drawing from diverse and novel sources, ensuring that evaluation reflects math abilities, not memorization.


The GAUSS Framework

GAUSS organizes mathematical ability into three domains and twelve skills:

  • Mathematical Knowledge and Understanding
    • Memory of Math Knowledge
    • Understanding of Knowledge and Theories
    • Computational and Analytical Skills
  • Problem Solving and Communication
    • Problem-Solving Framework
    • Logical Thinking and Reasoning
    • Writing and Presentation
  • Learning, Meta Skills, and Creativity
    • Learning New Knowledge
    • Intuition
    • Meta Skills
    • Mathematical Modeling
    • Generalization
    • Creativity

This structure provides a comprehensive breakdown of mathematical cognition, from foundational recall to creative problem posing.


What GAUSS Reveals

Applied to GPT-5-thinking, GAUSS highlights a highly uneven skill profile:

  • Strengths: subject taxonomy, evaluating flawed arguments, plausibility checks, reading and summarizing advanced material, and posing new problems.
  • Weaknesses: understanding and applying theorems (0%), symbolic computation (0%), structured problem-solving strategies (0%), geometric intuition (0%), mathematical modeling (0%) and generalization (17%).

Compared to o4-mini-high, GPT-5 thinking shows clear gains in taxonomy, argument evaluation, plausibility check, and reading, yet its core theoretical understanding remains fragile.

Strengths of GPT-5 Thinking
Taxonomy of Knowledge:

GPT-5-thinking is much stronger at identifying the correct mathematical domain of a problem. This makes it reliable for automated tagging and classification.

Evaluating Arguments:

It shows clear gains in checking whether reasoning is valid. The model spots missing assumptions, invalid inductive steps, or hidden gaps that GPT-o4-mini-high often overlooked. This makes it useful as a logical consistency checker in auto-evaluation.

Plausibility Check:

Shows progress in using heuristic shortcuts—for example, performing a plausibility check—to narrow down candidate answers before committing to a formal solution.

Reading New Material :

Demonstrates strong ability to parse unseen definitions, new notations, and multi-page contexts; this also supports literature summarization and rapid knowledge assimilation.

Posing Problems:

GPT-5 shows strong ability to pose new problems after reading or summarizing material. Instead of only consuming definitions and theorems, it can generate follow-up questions, exercises. This capability could be leveraged as an automatic prompting tool, helping guide learners and researchers.

Weaknesses of GPT-5 Thinking

Despite these strengths, GPT-5-thinking performs poorly in several foundational areas:

Understanding of knowledge and theories (0%):

struggles to comprehend and apply theorems, even when directly relevant.

Computational and analytical skills (0%):

fails at symbolic manipulation and step-by-step computations.

Application of problem-solving strategies (0%):

lacks systematic use of standard frameworks such as induction, contradiction, or constructive methods.

Geometric intuition (0%):

fails to recognize or leverage spatial structures, visual heuristics, or geometric reformulations of problems

Generalization (17%):

weak at extending solutions beyond the immediate case, indicating limited abstraction ability.

Comparison with o4-mini-high

o4-mini-high failed but GPT-5 thinking passed the following problems.

Taxonomy of subject
Checking existing arguments
Proof writing
Performing plausibility check
Reading advanced materials
Example responses of o4-mini-high and GPT-5 thinking

In the Evaluating Existing Arguments task, GPT-5 Thinking successfully identified the flaw in the proof, while o4-mini-high failed to detect it. Below we present a detailed case study, including the problem statement, both model responses, and the evaluation rubric.

Response of o4-mini-high:

Response of GPT-5 thinking:


Looking Ahead

We plan to release a comprehensive GAUSS system that combines new features with a structured development pipeline:

  • Problem and dataset curation: define standardized problem formats and rubrics for each GAUSS skill category, and curate problem sets with canonical solutions to build a reliable evaluation corpus.
  • Radar-style skill charts to visualize strengths and weaknesses of LLMs across the GAUSS skill breakdown.
  • Community crowdsourcing to expand coverage across mathematical domains and enrich the benchmark.
  • AI evaluation system: design and implement an AI evaluator that can automatically verify the correctness and quality of model responses.
  • Exploratory model training: use the GAUSS dataset and AI evaluator as supervision signals to train next-generation math AI models.
  • Adversarial/cooperative training loop: integrate problem generation, problem solving, and evaluation into a tri-party adversarial/cooperative pipeline (e.g., GAN, GAIL style), where AI systems can co-evolve as generators, solvers, and evaluators.

Conclusion

Ultimately, GAUSS goes beyond a benchmark. It marks a methodological shift — from asking “Did the model solve the problem?” to “How did the model reason through it?”

By exposing both strengths and brittle edges, GAUSS provides a roadmap for building the next generation of AI systems — systems that go beyond producing answers to demonstrate real reasoning, learning, and discovery.


About Authors