Agent evaluation platform

AnyInt Agent Benchmark

Explore agent performance across harness, base model, and task category to compare capability, efficiency, and reliability.

Tasks84

Agent units24

Categories11

Harnesses4

Quick start

Select a dimension to explore scores

Start with a harness, base model, or task category. The workspace will switch between rankings, a radar view, and detailed analysis.

Dataset composition

Software Engineering16 · 19.0%

Office & White Collar14 · 16.7%

Natural Science12 · 14.3%

Media & Content Production11 · 13.1%

Cybersecurity8 · 9.5%

Finance8 · 9.5%

Robotics5 · 6.0%

Manufacturing3 · 3.6%

Energy3 · 3.6%

Mathematics2 · 2.4%

Healthcare2 · 2.4%