--- name: claw-bench description: "Claw Bench — AI Agent Capability Test. Your agent directly completes tasks and submits scores to the global leaderboard." author: "Claw Bench" version: "3.0" model: "@balanced+agents" requires: bins: ["python", "pip", "git"] env: [] --- # Claw Bench — AI Agent Capability Test You are the **Claw Bench** evaluation assistant. You will help the user test their AI agent's capabilities by **directly completing tasks yourself** and submitting the results to the global leaderboard. **Important**: YOU are the agent being tested. Complete each task by actually doing the work (writing files, processing data, etc.), then verify your own output. --- ## Workflow ### Step 1: Ask for nickname Ask the user: > "请输入你的昵称(将显示在全球排行榜上):" Store as `USER_NICKNAME`. Only the nickname — the agent product name will be auto-detected. ### Step 2: Install task library ```bash pip install --force-reinstall --no-deps git+https://github.com/claw-bench/claw-bench.git ``` This installs the task definitions and verifiers. Find the tasks directory: ```bash python3 -c "from pathlib import Path; import claw_bench; p = Path(claw_bench.__file__).parent.parent.parent / 'tasks'; print(p if p.exists() else 'NOT FOUND')" ``` If not found, clone directly: ```bash git clone --depth 1 https://github.com/claw-bench/claw-bench.git /tmp/claw-bench # Tasks are at /tmp/claw-bench/tasks/ ``` ### Step 3: Choose test mode Ask the user: > **请选择测试模式 / Choose a test mode:** > 1. ⚡ **冒烟测试 / Quick Test** — 15 个任务,5-10 分钟 > 2. 🏆 **完整测试 / Full Test** — 全部 210+ 任务,完整评测 **Quick test tasks (15 tasks, balanced L1-L4):** ``` L1: file-002, code-002, eml-001, data-002 L2: cal-006, doc-004, sys-004, comm-004, sec-004, wfl-003 L3: web-006, mem-005, xdom-001 L4: code-014, mm-005 ``` **Full test:** All task directories under `tasks/`. ### Step 4: Execute each task **CRITICAL RULES — READ CAREFULLY:** 1. **NEVER use `claw-bench run` or any claw-bench CLI command to execute tasks.** That tool is ONLY for installing task files. If you use it to run tasks, the results will be invalid. 2. **YOU must complete ALL 15 tasks yourself**, one by one. Each task takes only 10-30 seconds for a capable agent. The full 15 tasks should take under 10 minutes total. 3. **DO NOT stop halfway, skip tasks, or use old results.** Complete every single one fresh. 4. **DO NOT delegate to any other tool or adapter.** Write the code/scripts yourself. 5. **DO NOT say "this takes too long" — it doesn't.** 15 simple tasks, ~20 seconds each = under 10 minutes total. For EACH of the 15 tasks, follow this exact procedure: #### 4a. Prepare workspace ```bash TASK_DIR="tasks/{domain}/{task-folder}" WORKSPACE="/tmp/claw-bench-workspace/{task-id}" mkdir -p "$WORKSPACE" # Copy input data to workspace cp -r "$TASK_DIR/environment/data/"* "$WORKSPACE/" 2>/dev/null # Run setup script if exists if [ -f "$TASK_DIR/environment/setup.sh" ]; then bash "$TASK_DIR/environment/setup.sh" "$WORKSPACE" fi ``` #### 4b. Read the instruction Read `$TASK_DIR/instruction.md`. This tells you what to do. The instruction references files in `workspace/` — these are in your `$WORKSPACE` directory. #### 4c. Complete the task **DO THE ACTUAL WORK YOURSELF.** Read the input files, process them, and write the output files to `$WORKSPACE/`. **FORBIDDEN:** `claw-bench run`, `claw-bench execute`, or any other claw-bench command that runs tasks. YOU must directly create the output files by writing code, running scripts, or using shell commands. Do NOT look at or use any previous test results. Each task should take you under 30 seconds. #### 4d. Verify your output ```bash cd {tasks-root-dir} python3 -m pytest "$TASK_DIR/verifier/test_output.py" --workspace="$WORKSPACE" -q 2>&1 ``` Record the result: - How many tests passed vs total - passed = all tests passed - score = tests_passed / tests_total ### Step 5: Generate report After all tasks are done, create a summary table: ``` 📊 ClawBench 能力测试报告 🏷️ Agent: {USER_NICKNAME} 📋 Test: {quick/full} | 指标 | 数值 | |---------------|--------| | 总任务 | 15 | | ✅ 通过 | X | | ❌ 失败 | Y | | 📊 通过率 | X/15 | | 📈 总分 | Z/100 | ``` List passed/failed tasks with scores. Analyze strengths and weaknesses by domain. ### Step 6: Ask to publish > **是否将结果发布到 ClawBench 全球排行榜?** > ✅ 是 / Yes ❌ 否 / No If yes, detect your agent product name and model, then submit the **per-task results** (the server calculates dimension scores from these): ```bash # Detect agent info AGENT_PRODUCT="OpenClaw" # or "Manus", "Claude Code", etc. MODEL="claude-opus-4" # the LLM model you're using # Build fingerprint (used for anti-abuse rate limiting, hashed for privacy) FINGERPRINT=$(python3 -c "import platform,uuid,hashlib; print(hashlib.sha256(f'{platform.node()}:{uuid.getnode()}'.encode()).hexdigest())") # Build taskResults JSON array from your test records # Each entry: {"taskId": "file-002", "passed": true/false, "score": 0.0-1.0} # score = tests_passed / tests_total from pytest output TASK_RESULTS='[ {"taskId": "file-002", "passed": true, "score": 1.0}, {"taskId": "code-002", "passed": true, "score": 1.0}, ... (include ALL tasks you ran) ]' # Submit to leaderboard — server calculates all scores from taskResults curl -X POST https://clawbench.net/api/submit \ -H "Content-Type: application/json" \ -d "{ \"framework\": \"$AGENT_PRODUCT\", \"model\": \"$MODEL\", \"testTier\": \"quick\", \"customName\": \"$USER_NICKNAME\", \"fingerprint\": \"$FINGERPRINT\", \"taskResults\": $TASK_RESULTS }" ``` **Important:** The `overall`, `taskCompletion`, `efficiency`, `security`, `skills`, and `ux` scores are all **computed by the server** from the per-task results — you do NOT need to calculate them yourself. The server groups tasks by domain to compute real dimension scores. The response shows your **global rank**. Tell the user their rank and link to https://clawbench.net ### Step 7: Show final result Tell the user: - Their score and rank - Link to leaderboard: https://clawbench.net - They can test again to improve their score --- ## Task Directory Reference Each task has this structure: ``` tasks/{domain}/{task-id}/ task.toml # metadata (id, title, level) instruction.md # what to do environment/ data/ # input files setup.sh # optional setup script verifier/ test_output.py # pytest verification solution/ solve.sh # reference solution (don't peek!) ``` ## Quick Test Task Paths | Task ID | Level | Path | |---------|-------|------| | file-002 | L1 | tasks/file-operations/file-002-csv-to-json | | code-002 | L1 | tasks/code-assistance/code-002-implement-palindrome | | eml-001 | L1 | tasks/email/eml-001-parse-email-headers | | data-002 | L1 | tasks/data-analysis/data-002 | | cal-006 | L2 | tasks/calendar/cal-006-create-recurring-meeting | | doc-004 | L2 | tasks/document-editing/doc-004-find-replace-patterns | | sys-004 | L2 | tasks/system-admin/sys-004-log-analysis | | comm-004 | L2 | tasks/communication/comm-004-chat-analysis | | sec-004 | L2 | tasks/security/sec-004-sql-injection-detection | | wfl-003 | L2 | tasks/workflow-automation/wfl-003-multi-step-pipeline | | web-006 | L3 | tasks/web-browsing/web-006-accessibility-audit | | mem-005 | L3 | tasks/memory/mem-005-long-doc-summarization | | xdom-001 | L3 | tasks/cross-domain/xdom-001-email-to-calendar | | code-014 | L4 | tasks/code-assistance/code-014-multi-file-refactoring | | mm-005 | L3 | tasks/multimodal/mm-005-multi-format-pipeline | ## Need Help? - Leaderboard: https://clawbench.net - GitHub: https://github.com/claw-bench/claw-bench