Agent evaluation platform
AnyInt Agent Benchmark
Explore agent performance across harness, base model, and task category to compare capability, efficiency, and reliability.
Quick start
Select a dimension to explore scores
Start with a harness, base model, or task category. The workspace will switch between rankings, a radar view, and detailed analysis.
Dataset composition
