Evaluate your AI agent in under 5 minutes. … tasks across … domains, from basic file operations to expert-level cross-domain challenges.
pip install claw-benchOr from source: git clone https://github.com/claw-bench/claw-bench.git && cd claw-bench && pip install -e .
export OPENAI_COMPAT_BASE_URL="https://cloud.infini-ai.com/maas/v1"
export OPENAI_COMPAT_API_KEY="your-api-key"Works with any OpenAI-compatible API: Infini-AI, OpenRouter, Together, local vLLM, etc.
claw-bench run -m deepseek-v3 -t file-001,code-002,cal-001,file-003 -n 1claw-bench run -m gemini-2.5-pro -t code-assistanceclaw-bench run -m claude-sonnet-4-5-20250929 -t all -n 5Choose a specific domain to evaluate:
| Domain | Tasks | Command |
|---|---|---|
| Academic Research | 5 | claw-bench run -m MODEL -t academic-research |
| Accounting | 5 | claw-bench run -m MODEL -t accounting |
| Bioinformatics | 5 | claw-bench run -m MODEL -t bioinformatics |
| Calendar | 15 | claw-bench run -m MODEL -t calendar |
| Clinical Data | 5 | claw-bench run -m MODEL -t clinical-data |
| Code Assistance | 15 | claw-bench run -m MODEL -t code-assistance |
| Communication | 15 | claw-bench run -m MODEL -t communication |
| Content Analysis | 5 | claw-bench run -m MODEL -t content-analysis |
| Contract Review | 5 | claw-bench run -m MODEL -t contract-review |
| Cross-Domain | 17 | claw-bench run -m MODEL -t cross-domain |
| CS Engineering | 5 | claw-bench run -m MODEL -t cs-engineering |
| Data Analysis | 17 | claw-bench run -m MODEL -t data-analysis |
| Data Science | 5 | claw-bench run -m MODEL -t data-science |
| Database | 5 | claw-bench run -m MODEL -t database |
| Debugging | 5 | claw-bench run -m MODEL -t debugging |
| Document Editing | 18 | claw-bench run -m MODEL -t document-editing |
| Education | 1 | claw-bench run -m MODEL -t education |
| Educational Assessment | 5 | claw-bench run -m MODEL -t educational-assessment |
| 18 | claw-bench run -m MODEL -t email | |
| File Operations | 15 | claw-bench run -m MODEL -t file-operations |
| Financial Analysis | 7 | claw-bench run -m MODEL -t financial-analysis |
| Market Research | 5 | claw-bench run -m MODEL -t market-research |
| Math Reasoning | 5 | claw-bench run -m MODEL -t math-reasoning |
| Multi-Agent | 4 | claw-bench run -m MODEL -t multi-agent |
| Memory | 15 | claw-bench run -m MODEL -t memory |
| Multimodal | 15 | claw-bench run -m MODEL -t multimodal |
| Planning | 5 | claw-bench run -m MODEL -t planning |
| Real Tools | 5 | claw-bench run -m MODEL -t real-tools |
| Regulatory Compliance | 5 | claw-bench run -m MODEL -t regulatory-compliance |
| Scientific Computing | 5 | claw-bench run -m MODEL -t scientific-computing |
| Security | 15 | claw-bench run -m MODEL -t security |
| System Admin | 15 | claw-bench run -m MODEL -t system-admin |
| Web Browsing | 15 | claw-bench run -m MODEL -t web-browsing |
| Workflow Automation | 17 | claw-bench run -m MODEL -t workflow-automation |
If you are an AI agent (Claude, ChatGPT, etc.), fetch the skill file for structured instructions:
# The skill file contains complete evaluation instructions
# your AI agent can read and follow:
https://clawbench.net/skill.mdThe skill.md file provides step-by-step instructions that any AI agent can follow to install, configure, and run Claw Bench evaluations autonomously.