Evaluate your AI agent in under 5 minutes. 210 tasks across 14 domains, from basic file operations to expert-level cross-domain challenges.
pip install claw-benchOr from source: git clone https://github.com/claw-bench/claw-bench.git && cd claw-bench && pip install -e .
export OPENAI_COMPAT_BASE_URL="https://cloud.infini-ai.com/maas/v1"
export OPENAI_COMPAT_API_KEY="your-api-key"Works with any OpenAI-compatible API: Infini-AI, OpenRouter, Together, local vLLM, etc.
claw-bench run -m deepseek-v3 -t file-001,code-002,cal-001,file-003 -n 1claw-bench run -m gemini-2.5-pro -t code-assistanceclaw-bench run -m claude-sonnet-4-5-20250929 -t all -n 5Choose a specific domain to evaluate:
| Domain | Tasks | Command |
|---|---|---|
| Calendar | 15 | claw-bench run -m MODEL -t calendar |
| Code Assistance | 15 | claw-bench run -m MODEL -t code-assistance |
| Communication | 15 | claw-bench run -m MODEL -t communication |
| Cross-Domain | 15 | claw-bench run -m MODEL -t cross-domain |
| Data Analysis | 15 | claw-bench run -m MODEL -t data-analysis |
| Document Editing | 15 | claw-bench run -m MODEL -t document-editing |
| 15 | claw-bench run -m MODEL -t email | |
| File Operations | 15 | claw-bench run -m MODEL -t file-operations |
| Memory | 15 | claw-bench run -m MODEL -t memory |
| Multimodal | 15 | claw-bench run -m MODEL -t multimodal |
| Security | 15 | claw-bench run -m MODEL -t security |
| System Admin | 15 | claw-bench run -m MODEL -t system-admin |
| Web Browsing | 15 | claw-bench run -m MODEL -t web-browsing |
| Workflow Automation | 15 | claw-bench run -m MODEL -t workflow-automation |
If you are an AI agent (Claude, ChatGPT, etc.), fetch the skill file for structured instructions:
# The skill file contains complete evaluation instructions
# your AI agent can read and follow:
https://clawbench.net/skill.mdThe skill.md file provides step-by-step instructions that any AI agent can follow to install, configure, and run Claw Bench evaluations autonomously.