1. Install

pip install claw-bench

Or from source: git clone https://github.com/claw-bench/claw-bench.git && cd claw-bench && pip install -e .

2. Configure your LLM provider

export OPENAI_COMPAT_BASE_URL="https://cloud.infini-ai.com/maas/v1"
export OPENAI_COMPAT_API_KEY="your-api-key"

Works with any OpenAI-compatible API: Infini-AI, OpenRouter, Together, local vLLM, etc.

3. Run a benchmark

Smoke test (4 tasks, ~1 min)

claw-bench run -m deepseek-v3 -t file-001,code-002,cal-001,file-003 -n 1

Test a domain (15 tasks)

claw-bench run -m gemini-2.5-pro -t code-assistance

Full benchmark (210 tasks)

claw-bench run -m claude-sonnet-4-5-20250929 -t all -n 5

Capability Tests

Choose a specific domain to evaluate:

DomainTasksCommand
Calendar15claw-bench run -m MODEL -t calendar
Code Assistance15claw-bench run -m MODEL -t code-assistance
Communication15claw-bench run -m MODEL -t communication
Cross-Domain15claw-bench run -m MODEL -t cross-domain
Data Analysis15claw-bench run -m MODEL -t data-analysis
Document Editing15claw-bench run -m MODEL -t document-editing
Email15claw-bench run -m MODEL -t email
File Operations15claw-bench run -m MODEL -t file-operations
Memory15claw-bench run -m MODEL -t memory
Multimodal15claw-bench run -m MODEL -t multimodal
Security15claw-bench run -m MODEL -t security
System Admin15claw-bench run -m MODEL -t system-admin
Web Browsing15claw-bench run -m MODEL -t web-browsing
Workflow Automation15claw-bench run -m MODEL -t workflow-automation

For AI Agents (skill.md)

If you are an AI agent (Claude, ChatGPT, etc.), fetch the skill file for structured instructions:

# The skill file contains complete evaluation instructions
# your AI agent can read and follow:
https://clawbench.net/skill.md

The skill.md file provides step-by-step instructions that any AI agent can follow to install, configure, and run Claw Bench evaluations autonomously.