AgentTuning: Enabling Generalized Agent Abilities For LLMs

Tsinghua University, §Zhipu.AI
*Equal contribution

1Work is done during the internship in Zhipu.AI of Mingdao Liu, Rui Lu, Bowen Wang

Abstract

Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instructiontuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs’ agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct dataset and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks.

Overall Results


AgentTuning represents the very first attempt to instruction-tune LLMs using interaction trajectories across multiple agent tasks. Evaluation results indicate that AgentTuning enables the agent capabilities of LLMs with robust generalization on unseen agent tasks while remaining good on general language abilities. We have open-sourced the AgentInstruct dataset and AgentLM.

Method


MY ALT TEXT

An overview of AgentInstruct and AgentTuning. The construction of AgentInstruct, consisting of instruction generation, trajectory interaction, and trajectory filter. AgentLM is finetuned using a mixture of AgentInstruct and general-domain instructions.

Our Dataset: AgentInstruct


Overview of our AgentInstruct dataset which includes 1,866 trajectories from 6 agents tasks.
Task Inst. From # Inst. # Filt. Traj. Avg # Filt. Traj. Turns Ratio
ALFWorld Train split 954 336 13.52 35.2%
WebShop Train split 1,485 351 3.68 23.6%
Mind2Web Train split 23,378 122 1.00 0.52%
Knowledge Graph Train split 2,501 324 6.04 13.0%
Operating System Self-Instruct 647 195 3.85 30.1%

Database
Self-Instruct 1,074 178 2.13 16.6%
Task Deri. 5,302 360 2.03 6.79%
AgentInstruct - 35,341 1,866 5.24 5.28%

Detailed Results


Main results of AgentTuning. Model significantly outperforms Llama 2 across different scales, excelling in both held-in and held-out tasks, without compromising its performance on general tasks. Overall stands for score calculated from a weighted average of all tasks within the same category. (API-based models and open-source models are compared separately. bold: the best in API-based models and open-source models; underline: the second best in open-source models)

Type

Task
API-based Llama 2 (chat) AgentLM
GPT-3.5 GPT-4 7B 13B 70B 7B 13B 70B





Held-in Tasks
ALFWorld 14.0 78.0 2.0 2.0 6.0 84.0 76.0 86.0
WebShop 67.2 58.6 4.4 7.2 1.5 63.6 70.8 64.9
Mind2Web 15.7 22.6 3.7 2.3 0.2 6.4 8.4 13.5
KG 27.2 52.1 0.0 0.0 0.0 18.1 26.8 47.0
OS 32.6 36.8 8.3 9.0 9.0 17.4 18.1 21.5
Database 15.0 33.7 0.3 1.3 9.3 30.6 33.7 37.7
Overall 1.59 2.75 0.19 0.20 0.27 1.96 2.11 2.55






Held-out Tasks
SciWorld 21.2 36.4 5.9 6.4 7.9 13.7 18.0 20.8
MiniWoB++ 66.7 69.4 0.0 19.6 0.7 28.9 31.1 60.7
WebArena 4.56 6.28 1.23 1.11 0.62 0.74 1.60 3.81
HotpotQA 37.4 52.1 22.6 25.2 37.5 22.3 29.6 41.6
ReWOO 71.0 79.7 48.3 48.7 55.1 50.9 55.7 66.0
DCG 24.5 50.0 0.0 0.0 5.0 7.0 2.5 23.5
Overall 1.49 2.13 0.38 0.49 0.51 0.67
(+76%)
0.78
(+57%)
1.40
(+176%)




General Tasks
MMLU 70.0 86.4 48.0 54.3 62.1 48.7 53.6 59.5
HumanEval 48.1 67.0 13.9 18.4 30.8 15.4 14.8 28.7
GSM8K 57.1 87.1 27.7 37.5 54.7 24.6 32.4 59.7
MT-Bench 7.94 8.99 6.26 6.65 6.85 6.11 6.57 7.26
Overall 1.15 1.53 0.63 0.74 0.95 0.62
(-1%)
0.69
(-7%)
0.96
(+1%)

Error Analysis


error-analysis

To delve into error analysis, we selected three tasks from the held-in set (ALFWorld, WebShop, Knowledge Graph) and identified common error types using a rule-based approach, such as invalid actions and repeated generations. The results can be seen above.

Overall, the original Llama2 exhibited more elementary mistakes like repetition or taking invalid actions. In contrast, GPT-3.5 and especially GPT-4 made fewer of such errors. However, the AgentLM noticeably reduced these basic errors. We speculate that while Llama 2 chat inherently possesses agent capabilities, its poor performance might be due to a lack of aligned training on agent data; the AgentTuning effectively activated its agent potential.

Case Study


Comparison case study on ALFWorld and Knowledge Graph between Llama-2-70b-chat and AgentLM-70B. (a) For the ALFWorld task, Llama-2-70b-chat repeated the same action ultimately failing to complete the task, while Agent-70B adjusted its actions after a failure. (b) For the Knowledge Graph task, Llama-2-70b-chat refused to fix the function call and instead demanded the user to implement the function upon encountering a error. In contrast, AgentLM-70B provided the correct function call.

Reference

Please kindly cite our paper if you use our model, data, code or results:

@misc{zeng2023agenttuning,
      title={AgentTuning: Enabling Generalized Agent Abilities for LLMs}, 
      author={Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang},
      year={2023},
      eprint={2310.12823},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}