1. Install

pip install claw-bench

Or from source: git clone https://github.com/claw-bench/claw-bench.git && cd claw-bench && pip install -e .

2. Configure your LLM provider

export OPENAI_COMPAT_BASE_URL="https://cloud.infini-ai.com/maas/v1"
export OPENAI_COMPAT_API_KEY="your-api-key"

Works with any OpenAI-compatible API: Infini-AI, OpenRouter, Together, local vLLM, etc.

3. Run a benchmark

Smoke test (4 tasks, ~1 min)

claw-bench run -m deepseek-v3 -t file-001,code-002,cal-001,file-003 -n 1

Test a domain (15 tasks)

claw-bench run -m gemini-2.5-pro -t code-assistance

Full benchmark (… tasks)

claw-bench run -m claude-sonnet-4-5-20250929 -t all -n 5

Capability Tests

Choose a specific domain to evaluate:

DomainTasksCommand
Academic Research5claw-bench run -m MODEL -t academic-research
Accounting5claw-bench run -m MODEL -t accounting
Bioinformatics5claw-bench run -m MODEL -t bioinformatics
Calendar15claw-bench run -m MODEL -t calendar
Clinical Data5claw-bench run -m MODEL -t clinical-data
Code Assistance15claw-bench run -m MODEL -t code-assistance
Communication15claw-bench run -m MODEL -t communication
Content Analysis5claw-bench run -m MODEL -t content-analysis
Contract Review5claw-bench run -m MODEL -t contract-review
Cross-Domain17claw-bench run -m MODEL -t cross-domain
CS Engineering5claw-bench run -m MODEL -t cs-engineering
Data Analysis17claw-bench run -m MODEL -t data-analysis
Data Science5claw-bench run -m MODEL -t data-science
Database5claw-bench run -m MODEL -t database
Debugging5claw-bench run -m MODEL -t debugging
Document Editing18claw-bench run -m MODEL -t document-editing
Education1claw-bench run -m MODEL -t education
Educational Assessment5claw-bench run -m MODEL -t educational-assessment
Email18claw-bench run -m MODEL -t email
File Operations15claw-bench run -m MODEL -t file-operations
Financial Analysis7claw-bench run -m MODEL -t financial-analysis
Market Research5claw-bench run -m MODEL -t market-research
Math Reasoning5claw-bench run -m MODEL -t math-reasoning
Multi-Agent4claw-bench run -m MODEL -t multi-agent
Memory15claw-bench run -m MODEL -t memory
Multimodal15claw-bench run -m MODEL -t multimodal
Planning5claw-bench run -m MODEL -t planning
Real Tools5claw-bench run -m MODEL -t real-tools
Regulatory Compliance5claw-bench run -m MODEL -t regulatory-compliance
Scientific Computing5claw-bench run -m MODEL -t scientific-computing
Security15claw-bench run -m MODEL -t security
System Admin15claw-bench run -m MODEL -t system-admin
Web Browsing15claw-bench run -m MODEL -t web-browsing
Workflow Automation17claw-bench run -m MODEL -t workflow-automation

For AI Agents (skill.md)

If you are an AI agent (Claude, ChatGPT, etc.), fetch the skill file for structured instructions:

# The skill file contains complete evaluation instructions
# your AI agent can read and follow:
https://clawbench.net/skill.md

The skill.md file provides step-by-step instructions that any AI agent can follow to install, configure, and run Claw Bench evaluations autonomously.