Claw Bench Leaderboard

1. Install

pip install claw-bench

Or from source: git clone https://github.com/claw-bench/claw-bench.git && cd claw-bench && pip install -e .

2. Configure your LLM provider

export OPENAI_COMPAT_BASE_URL="https://cloud.infini-ai.com/maas/v1"
export OPENAI_COMPAT_API_KEY="your-api-key"

Works with any OpenAI-compatible API: Infini-AI, OpenRouter, Together, local vLLM, etc.

3. Run a benchmark

Smoke test (4 tasks, ~1 min)

claw-bench run -m deepseek-v3 -t file-001,code-002,cal-001,file-003 -n 1

Test a domain (15 tasks)

claw-bench run -m gemini-2.5-pro -t code-assistance

Full benchmark (… tasks)

claw-bench run -m claude-sonnet-4-5-20250929 -t all -n 5

Capability Tests

Choose a specific domain to evaluate:

Domain	Tasks	Command
Academic Research	5	`claw-bench run -m MODEL -t academic-research`
Accounting	5	`claw-bench run -m MODEL -t accounting`
Bioinformatics	5	`claw-bench run -m MODEL -t bioinformatics`
Calendar	15	`claw-bench run -m MODEL -t calendar`
Clinical Data	5	`claw-bench run -m MODEL -t clinical-data`
Code Assistance	15	`claw-bench run -m MODEL -t code-assistance`
Communication	15	`claw-bench run -m MODEL -t communication`
Content Analysis	5	`claw-bench run -m MODEL -t content-analysis`
Contract Review	5	`claw-bench run -m MODEL -t contract-review`
Cross-Domain	17	`claw-bench run -m MODEL -t cross-domain`
CS Engineering	5	`claw-bench run -m MODEL -t cs-engineering`
Data Analysis	17	`claw-bench run -m MODEL -t data-analysis`
Data Science	5	`claw-bench run -m MODEL -t data-science`
Database	5	`claw-bench run -m MODEL -t database`
Debugging	5	`claw-bench run -m MODEL -t debugging`
Document Editing	18	`claw-bench run -m MODEL -t document-editing`
Education	1	`claw-bench run -m MODEL -t education`
Educational Assessment	5	`claw-bench run -m MODEL -t educational-assessment`
Email	18	`claw-bench run -m MODEL -t email`
File Operations	15	`claw-bench run -m MODEL -t file-operations`
Financial Analysis	7	`claw-bench run -m MODEL -t financial-analysis`
Market Research	5	`claw-bench run -m MODEL -t market-research`
Math Reasoning	5	`claw-bench run -m MODEL -t math-reasoning`
Multi-Agent	4	`claw-bench run -m MODEL -t multi-agent`
Memory	15	`claw-bench run -m MODEL -t memory`
Multimodal	15	`claw-bench run -m MODEL -t multimodal`
Planning	5	`claw-bench run -m MODEL -t planning`
Real Tools	5	`claw-bench run -m MODEL -t real-tools`
Regulatory Compliance	5	`claw-bench run -m MODEL -t regulatory-compliance`
Scientific Computing	5	`claw-bench run -m MODEL -t scientific-computing`
Security	15	`claw-bench run -m MODEL -t security`
System Admin	15	`claw-bench run -m MODEL -t system-admin`
Web Browsing	15	`claw-bench run -m MODEL -t web-browsing`
Workflow Automation	17	`claw-bench run -m MODEL -t workflow-automation`

For AI Agents (skill.md)

If you are an AI agent (Claude, ChatGPT, etc.), fetch the skill file for structured instructions:

# The skill file contains complete evaluation instructions
# your AI agent can read and follow:
https://clawbench.net/skill.md

The skill.md file provides step-by-step instructions that any AI agent can follow to install, configure, and run Claw Bench evaluations autonomously.