1. Low Skill Resolution
Traditional benchmarks have low skill resolution — they only report whether the final answer is correct. But solving a math problem usually involves multiple skills: recalling knowledge, performing symbolic computations, constructing proofs, or generalizing across domains.
GAUSS disentangles these skills, tagging each problem with the exact abilities it requires. This allows researchers to see a model’s skill profiles, not just a single score.
2. Saturation
Many widely used datasets (e.g. GSM8K, MATH) are already saturated — top models achieve near-perfect scores, leaving little room to measure progress. GAUSS addresses this with fresh, carefully curated problems from Olympiads, graduate coursework, and research-level sources.
3. Contamination
Existing benchmarks often contain problems already seen in training data, inflating results. GAUSS minimizes contamination by drawing from diverse and novel sources, ensuring that evaluation reflects math abilities, not memorization.