Agent evaluation platform

AnyInt Agent Benchmark

Explore agent performance across harness, base model, and task category to compare capability, efficiency, and reliability.

Tasks84
Agent units24
Categories11
Harnesses4

Quick start

Select a dimension to explore scores

Start with a harness, base model, or task category. The workspace will switch between rankings, a radar view, and detailed analysis.

Dataset composition

Task category distribution

Software Engineering16 · 19.0%
Office & White Collar14 · 16.7%
Natural Science12 · 14.3%
Media & Content Production11 · 13.1%
Cybersecurity8 · 9.5%
Finance8 · 9.5%
Robotics5 · 6.0%
Manufacturing3 · 3.6%
Energy3 · 3.6%
Mathematics2 · 2.4%
Healthcare2 · 2.4%