AgentTuning

Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instructiontuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs’ agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct dataset and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks.

Overall Results

AgentTuning represents the very first attempt to instruction-tune LLMs using interaction trajectories across multiple agent tasks. Evaluation results indicate that AgentTuning enables the agent capabilities of LLMs with robust generalization on unseen agent tasks while remaining good on general language abilities. We have open-sourced the AgentInstruct dataset and AgentLM.

Method

An overview of AgentInstruct and AgentTuning. The construction of AgentInstruct, consisting of instruction generation, trajectory interaction, and trajectory filter. AgentLM is finetuned using a mixture of AgentInstruct and general-domain instructions.

Our Dataset: AgentInstruct

Overview of our **AgentInstruct** dataset which includes 1,866 trajectories from 6 agents tasks.
Task	Inst. From	# Inst.	# Filt. Traj.	Avg # Filt. Traj. Turns	Ratio
ALFWorld	Train split	954	336	13.52	35.2%
WebShop	Train split	1,485	351	3.68	23.6%
Mind2Web	Train split	23,378	122	1.00	0.52%
Knowledge Graph	Train split	2,501	324	6.04	13.0%
Operating System	Self-Instruct	647	195	3.85	30.1%
Database	Self-Instruct	1,074	178	2.13	16.6%
Database	Task Deri.	5,302	360	2.03	6.79%
AgentInstruct	-	35,341	1,866	5.24	5.28%

Detailed Results

Main results of **AgentTuning**. Model significantly outperforms Llama 2 across different scales, excelling in both held-in and held-out tasks, without compromising its performance on general tasks. Overall stands for score calculated from a weighted average of all tasks within the same category. (API-based models and open-source models are compared separately. **bold**: the best in API-based models and open-source models; underline: the second best in open-source models)
Type	Task	API-based		Llama 2 (chat)			AgentLM
Type	Task	GPT-3.5	GPT-4	7B	13B	70B	7B	13B	70B
Held-in Tasks	ALFWorld	14.0	78.0	2.0	2.0	6.0	84.0	76.0	86.0
	WebShop	67.2	58.6	4.4	7.2	1.5	63.6	70.8	64.9
	Mind2Web	15.7	22.6	3.7	2.3	0.2	6.4	8.4	13.5
	KG	27.2	52.1	0.0	0.0	0.0	18.1	26.8	47.0
	OS	32.6	36.8	8.3	9.0	9.0	17.4	18.1	21.5
	Database	15.0	33.7	0.3	1.3	9.3	30.6	33.7	37.7
	Overall	1.59	2.75	0.19	0.20	0.27	1.96	2.11	2.55
Held-out Tasks	SciWorld	21.2	36.4	5.9	6.4	7.9	13.7	18.0	20.8
	MiniWoB++	66.7	69.4	0.0	19.6	0.7	28.9	31.1	60.7
	WebArena	4.56	6.28	1.23	1.11	0.62	0.74	1.60	3.81
	HotpotQA	37.4	52.1	22.6	25.2	37.5	22.3	29.6	41.6
	ReWOO	71.0	79.7	48.3	48.7	55.1	50.9	55.7	66.0
	DCG	24.5	50.0	0.0	0.0	5.0	7.0	2.5	23.5
	Overall	1.49	2.13	0.38	0.49	0.51	0.67 (+76%)	0.78 (+57%)	1.40 (+176%)
General Tasks	MMLU	70.0	86.4	48.0	54.3	62.1	48.7	53.6	59.5
	HumanEval	48.1	67.0	13.9	18.4	30.8	15.4	14.8	28.7
	GSM8K	57.1	87.1	27.7	37.5	54.7	24.6	32.4	59.7
	MT-Bench	7.94	8.99	6.26	6.65	6.85	6.11	6.57	7.26
	Overall	1.15	1.53	0.63	0.74	0.95	0.62 (-1%)	0.69 (-7%)	0.96 (+1%)

Error Analysis

To delve into error analysis, we selected three tasks from the held-in set (ALFWorld, WebShop, Knowledge Graph) and identified common error types using a rule-based approach, such as invalid actions and repeated generations. The results can be seen above.

Overall, the original Llama2 exhibited more elementary mistakes like repetition or taking invalid actions. In contrast, GPT-3.5 and especially GPT-4 made fewer of such errors. However, the AgentLM noticeably reduced these basic errors. We speculate that while Llama 2 chat inherently possesses agent capabilities, its poor performance might be due to a lack of aligned training on agent data; the AgentTuning effectively activated its agent potential.

Case Study

Comparison case study on ALFWorld and Knowledge Graph between Llama-2-70b-chat and AgentLM-70B. (a) For the ALFWorld task, Llama-2-70b-chat repeated the same action ultimately failing to complete the task, while Agent-70B adjusted its actions after a failure. (b) For the Knowledge Graph task, Llama-2-70b-chat refused to fix the function call and instead demanded the user to implement the function upon encountering a error. In contrast, AgentLM-70B provided the correct function call.

Reference

Please kindly cite our paper if you use our model, data, code or results:

@misc{zeng2023agenttuning,
      title={AgentTuning: Enabling Generalized Agent Abilities for LLMs}, 
      author={Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang},
      year={2023},
      eprint={2310.12823},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

AgentTuning: Enabling Generalized Agent Abilities For LLMs

Abstract