Dual-Track Scoring Model Tasks are divided into two tracks: Foundation (core agent capabilities, 60% weight) and Subject-Matter (domain-specific professional tasks, 40% weight). The overall score combines both tracks.
319
Total Tasks
257
Foundation Tasks
62
Subject Tasks
45
L1 Easy
130
L2 Medium
113
L3 Hard
31
L4 Expert

Foundation Track (60%)

257 tasks
DomainTasksL1L2L3L4Difficulty Distribution
Calendar155532
Code Assistance153642
Communication153561
Cross-Domain1700107
CS Engineering50140
Data Analysis173662
Database51211
Debugging51211
Document Editing184941
Education10100
Email183861
File Operations156531
Math Reasoning51211
Multi-Agent40121
Memory151671
Multimodal151671
Planning51211
Real Tools51211
Security153543
System Admin153651
Web Browsing153651
Workflow Automation172861
Total25745948731

Subject-Matter Track (40%)

62 tasks
DomainTasksL1L2L3L4Difficulty Distribution
domains.academicResearch50410
domains.accounting50320
domains.bioinformatics50230
domains.clinicalData50320
domains.contentAnalysis50320
domains.contractReview50320
domains.dataScienceDomain50410
domains.educationalAssessment50320
domains.financialAnalysis70430
domains.marketResearch50320
domains.regulatoryCompliance50320
domains.scientificComputing50140
Total62036260

Subject Categories

STEM
10 tasks
domains.dataScienceDomain5
domains.scientificComputing5
Business & Finance
17 tasks
domains.accounting5
domains.financialAnalysis7
domains.marketResearch5
Law & Compliance
10 tasks
domains.contractReview5
domains.regulatoryCompliance5
Healthcare
10 tasks
domains.bioinformatics5
domains.clinicalData5
Humanities & Education
15 tasks
domains.academicResearch5
domains.contentAnalysis5
domains.educationalAssessment5
Difficulty legend:L1 EasyL2 MediumL3 HardL4 Expert