ICML Submission

AgentSelectBench

A unified-supervision benchmark for narrative query-to-agent recommendation: given a free-form natural-language request, rank deployable agent configurations.

Systematically converts heterogeneous evaluation artifacts (LLM leaderboards, tool-use benchmarks) into query-conditioned, positive-only interactions for training and evaluating agent recommenders at scale.

40+ Data SourcesTop-10 EvaluationMultiple Baselines
0
Queries
0
Agents
0
Interactions
0+
Data Sources

Benchmark Overview

AgentSelectBench comprises three complementary dataset parts, systematically covering LLM-only, toolkit-only, and compositional agent configurations.

Overview of AgentSelect - Three benchmark parts showing LLM-only, toolkit-only, and compositional agents

Figure 1: Overview of AgentSelect. Arrows show the flow; icons indicate backbone LLMs, tools, and composed agents.

45.9%
Part I: LLM-only Agents

Query-conditioned supervision derived from LLM evaluations/leaderboards (tools absent). Positives are constructed as top-k preferred backbones per query.

23,073
Queries
231
Agents
MMLUBBHMATHGPQAMUSRHumaneval
30.3%
Part II: Toolkit-only Agents

Tool-use benchmarks provide the required/reference toolkit for each query; we treat each query's toolkit as the positive target (backbone fixed to a placeholder).

76,197
Queries
47,949
Agents
ToolGenArenaUltraToolToolHopAPIBank
23.7%
Part III: Compositional Agents

Realistic (M, T) configurations by retrieving relevant components and composing them into candidate agents, yielding pseudo-positive interactions.

11,909
Queries
59,541
Agents
GAIAMTU_MSTSDeepResearchMTU_MMTNMTU_SMTN

Benchmark Characteristics

Interactive visualizations of the benchmark distribution across queries, agents, tools, and interactions.

Query Distribution

Note: Part I = 21,606 + 1,467 queries

Agent Distribution

Note: Part I = 173 + 58 agents

Tool Distribution

# Unique Tools Used by Agents

Positive Interaction Distribution

Total: 251,103 interactions

Complete Distribution Overview (Figure 3)
Overview of Benchmark Characteristics and Distribution

Agent Configuration

Each agent is represented as a capability profile with backbone LLM, toolkit, and configuration settings.

MBackbone LLM
Name: Qwen2.5 Instruct-72B
Description: Released in September 2024, Qwen2.5 pushes the context window to 128k and can generate passages up to 8k tokens. Relative to Qwen2 it shows significant improvements...
TToolkit
Name: family_relation_finder
Description: A tool designed to find and analyze familial relationships...
Name: genealogy_query
Description: Focusing on identifying various familial connections of historical or contemporary figures.
Name: extract_last_name
Description: Extracting the last name from a full name string...
Name: advanced_character_counter
Description: Counting occurrences of specified characters in given strings...
CConfiguration

Session History:

add_history_to_messages: True
read_chat_history: True
read_tool_call_history: True

Memory Management:

enable_agentic_memory: True
enable_user_memories: True
enable_session_summaries: True

Knowledge Integration:

knowledge_base: Some Built Vector Databases

Table 1: Example of Agent Configuration - Stored as YAML configuration to keep agents deployable.

Leaderboard Results

Query-to-agent recommendation results on Parts I-III. Language embedding models marked with * are trained with in-domain supervision (fine-tuned).

All metrics are calculated based on top-10 recommendations
Part I: LLM-only AgentsTop-10 Positives
MethodCategory
Prec.
Rec.
F1
NDCG
MRR
1GenRecGenerative0.92150.92550.92300.94040.9925
2NGCFGNN0.88640.89590.88990.91250.9669
3MFCF0.92000.92980.92370.93390.9631
LightGCNGNN0.86420.87300.86750.88200.9244
KGATGNN0.85950.86880.86300.87330.9234
DNN (Bert*)DNN0.73360.74240.73690.75500.8889
SimGCLGNN0.80500.81370.80830.82900.8868
DNN (Bert)DNN0.72570.73450.72900.74690.8787
LightFMCF0.46790.47310.46980.52690.8010
TwoTower (TFIDF)TwoTower0.68310.69260.68670.71110.8003
TwoTower (BGEM3)TwoTower0.70650.71630.71020.70710.7820
DNN (TFIDF)DNN0.29710.30290.29930.31930.5743
EasyRec*LLM0.25650.26320.25900.27080.4969
KaLM-v2.5*LLM0.28500.29500.28880.27870.4052
BGE-Rerank*Rerank0.13700.13730.13710.14680.3560
BGE-RerankRerank0.02650.02750.02690.02830.0689
EasyRecLLM0.01500.01500.01500.01550.0353
KaLM-v2.5LLM0.01700.01700.01700.01640.0321
GenRec
Best on Part I (MRR: 0.9925)
TwoTower (BGEM3)
Best on Part II (nDCG: 0.9665)
TwoTower (BGEM3)
Best on Part III (nDCG: 0.8501)

Getting Started

Quick setup guide to run AgentSelectBench baselines and experiments.

1. Clone Repository
# Clone from anonymous repository
# https://anonymous.4open.science/r/AgentMatch-F950
cd AgentSelectBench
2. Install Dependencies
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

pip install -r requirements.txt
3. Run Baselines
python run_bpr_mf_knn.py \
  --data_root /path/to/dataset_root \
  --device cuda:0 \
  --epochs 5 --batch_size 4096 --factors 128 --neg_per_pos 1 \
  --knn_N 3 --eval_cand_size 100 --score_mode dot
Evaluation

Evaluation Protocol

  • Part I: Top-10 positives
  • Part II: Top-1 positives
  • Part III: Top-5 positives
  • Ranking cutoff: Top-10
Metrics

Reported Metrics

  • Precision@10
  • Recall@10
  • F1@10
  • nDCG@10
  • MRR@10
Project Structure

Key Directories

  • agent_rec/data/
  • agent_rec/features/
  • agent_rec/models/
  • agent_rec/eval/
  • scripts/
Citation

If you find AgentSelectBench useful, please cite our work:

@article{agentselect2025,
  title={AgentSelect: A Unified Benchmark for Query-to-Agent Recommendation},
  author={Anonymous},
  journal={ICML 2025 Submission},
  year={2025}
}