Using GPT Function Calls to Build LLM Agents

At a glance, the new function call feature for GPT promises to greatly simplify building LLM agents and plugins, over using existing frameworks like Langchain Agents. It also has some glaring issues that require workarounds.

But overall, the function calls feature has numerous benefits over the current paradigms of agent frameworks / JSON / Yaml parsing - and will most likely become the new best practice for building agents.

In this post, I will try to provide a balanced review of the new feature, specifically:

motivate why a fine tuned LLM for tool-use is needed for building LLM agents
explain the main benefits of using function calls and when to use them, especially over Langchain agents, etc.
highlight some shortcomings and limitations of function calls. Notably, I want to stress that tool parameter hallucinations problem is NOT solved with Function Calls feature, even by OpenAI’s own admission in their documentation.

Motivating the function calls feature

First, let’s motivate the Function Call feature.

By function, we mean any arbitrary program - but in this context, we mean tools and APIs that LLMs can utilize to enhance its capabilities (like a Python function to calculate a number, or a NodeJS wrapper that calls Google Calendar API).

To get LLMs (e.g. GPT-4) to use tools / functions / APIs (they are interchangeable concepts), there are typically two approaches:

The first approach uses a pretraine LLM with no task-specific finetuning to figure out how to use tools (tool selection and detection), as well as parse input arguments to the tools (tool learning)
- this approach is used by Langchain Agents (React, etc), HuggingGPT, etc.
The second approach uses a LLM finetuned for tool-use, namely to directly decode which tool to use, and with what request parameters
- used by Toolformer, Gorilla, as well as the new Function Call feature from OpenAI

The first approach has some serious reliability issues. How it works is this: we first stick a list of tool names into the prompt, and ask LLM to “figure out” which tool to use (or if at all) purely via prompt engineering. Then we may use a separate LLM call to “extract” input parameters to the tool - often in json format. This two-phase approach is taken by Langchain / React agents, HuggingGPT, etc.

Here’s an example - we stick a list of tool names (in orange) into the prompt, and let LLM decide which tool to use or if at all. If LLM decides tool use is appropriate, it’s asked to return a tool name.

prompt = '''
	You are an AI assistant with the following tools at your disposal:
   - calendarAPI: update calendar with a new appointment
   - calculator: do math
   ...

  If a tool is needed to answer the user input, respond with tool name.
  Otherwise, respond directly.

  Input: **Book a tennis lesson for Wednesday.**
  Response:
'''

tool_token = LLM(prompt) # returns 'calendarAPI`

The tool name returned by LLM (tool_token) represents an API call that needs to be made e.g. calendarAPI This output is sent to a separate dispatcher function (below) that 1) parses API call parameters, and 2) actually makes the API call and returns the response.

tool_token = LLM(prompt)
params = **parse_args**(tool_token, userinput) # parse API call params

print(params) # { "dayOfWeek": "Monday", "name": "tennis lesson" }

if "calendarAPI" in tool_token:
	return dispatch(CalendarAPI, **params)  # actually call the API
...

Why do we need a separate step to parse tool / function arguments? Because LLMs struggle with doing both in a single step:

1. figuring out which tool to use, and
1. parsing out the request parameters or function inputs - e.g. calendarAPI('tennis lesson', 'Wednesday')

Generally speaking, LLMs (including GPT-4) struggle with reliability, which necessitates the extra LLM call:

hallucinating tool parameters, or
being overly eager to use tools even when unnecessary

But this two-stage approach has its downsides:

requirement of another LLM call to parse out request parameters perfectly in a separate pass
- added latency
feeling of general hackiness (having to stick a list of tool names into the prompt feels not kosher)

Wouldn’t it be nice if LLM could - in one shot - 1) figure out which tool to use, and 2) parse what parameters to pass to the tool?

That’s what OpenAI’s function call feature does, as well as approach taken by ToolFormer and Gorilla projects.

How Function Call Works

Instead of using a basic LLM to figure out how to use tools, we fine-tune a tool-using LLM that can reliably spit out both 1) the tool name, and 2) request parameters to the tool… in one-go.

Here’s how it works: we introduce two new parameters to the ChatCompletion endpoint - functions and function_call.

The functions parameter takes a description of the tool and a JSON schema that describes what the parameters are, and their types (e.g. string). This schema is used by the tool-specific finetuned model (behind GPT) to perform a constrained decoding of request parameters and tool name, according to the provided schema. This special decoding only happens if the LLM deems tool use to be appropriate. (to disable tool decoding, set function_call param to ‘none’)

Because we are now outputting JSON via explicitly decoding to meet that schema, we are far more likely to get desired output compared to using LLMs to output a JSON without any constraints.

Below example shows how we use JSON schema to tell GPT about our calendarAPI and how to extract request parameters (in orange) - that we need calendarItem and dayOfWeek, and expect both to be strings.

response = openai.ChatCompletion.create(
        ...
        functions=[
            {
                "name": "calendarApi",
                "description": "Read and update the latest Google Calendar state to make bookings",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "calendarItem": {
                            "type": "string",
                            "description": "Calendar item, e.g. 'Hair Appointment'",
                        },
                        "dayOfWeek": {"type": "string", "enum": [
														"Monday", "Tuesday", "Wednesday", "Thursday", "Friday"
													]
												},
                    },
                    "required": ["calendarItem", "dayOfWeek"],
                },
            }
        ],
    )

If we print the response, we see the following - function_call field contains both the invoked tool name, as well as the request parameters - all in one shot!

  {
	  'index': 0,
      'function_call': {
				  'name': 'calendarAPI',
			    'arguments': '{
						'calendarItem': 'tennis lesson',
		        'dayOfWeek': 'Monday'
					}
       },
	    'finish_reason': 'function_call'
   }

With function calls, we can remove the parse_args step, and just go straight to dispatch which actually does the API call. This is a marginal benefit because it saves us a LLM call, and also simplifies our code.

Main Benefits of Using Function Calls for GPT

The new function calls feature may drastically simplify building reliable agents. And this could establish the new best practice for building LLM agents.

First, because the new GPT models are fine tuned to detect tool use, it is by definition better at tool discrimination (determining when and when not to use tools), and tool selection(picking which specific tool to use). This greatly increases reliability and reduces the likelihood of GPT being overly eager to use ANY tool even if it’s inappropriate.

Second, request parameter parsing is natively supported, and we no longer need a separate LLM call(s) to reliably extract request parameters. The ability to provide a JSON Schema reduces the likelihood of hallucinating request parameters, i.e. LLMs misunderstanding the structure of the request and/or conjure non-existent request body parameter names and their types. This leads to unnecessary HTTP request failures, which improves worst-case latency and overall improves UX.

Third, this makes LLM agents feel less “hacky” and much “cleaner”. Having to inject tool descriptions and tool outputs into the prompt bloated the context, and muddled conversation history with unnecessary mechanisms like “observation”, “thought”, etc (see React prompting from Langchain). The new function calls feature provides an explicit function role that separates all function call artifacts from the rest of conversation history.

Concerns and Issues Remain

That said, there are still some issues and concerns - some major.

I have some doubts as to whether object decoding works 100% perfectly (spoiler alert: it’s not), because even by OpenAI’s admission - tool parameter hallucinations can STILL occur (see the screen shot from the documentation below, which admit that hallucinations can still happen). And this is not surprising, given ChatGPT plugins which presumably use this finetuned tool using LLM still makes many errors when extracting request bodies.

That said, I’m not sure whether the new releases of GPT3.5 and GPT4 use different function dispatching models
Luckily, the GPT-4 version of ChatCompletion API also supports function call - so maybe we can

2. Cost. The tool specification info goes into the system message and counts towards the total context - so costs can add up if you have a lot of tools. Of course, OpenAI still slashed API cost by 25% which probably offsets any cost inflation - but this could be problematic if you are using many tools with lengthy tool descriptions.

Assuming each function description is about 300 tokens, a LLM call with 5 candidate functions will contain 1500 (5 * 300) tokens of extra baggage in the system prompt. This obviously adds to latency, which is not great.

That said, you can strategically decide when to not pass function calls to ChatCompletion

3. Can it scale to 100’s of tools? Naturally, this paradigm of passing function specifications to system prompt does not scale when you have 100’s of tools. In this case, you still need to do a filter of your tools via semantic search, etc, then pass a more targeted functions list to the prompt.

4. Why not pass OpenAPI specs, as opposed to JSONSchemas. As it stands, the function call feature expects us to pass a manifest with JSONSchema for parameters. This can be inconvenient if the tools we want to use are REST APIs with multiple paths and very complex parameters, etc.

Additional tooling is required then to translate our REST API specification files (e.g. swagger) to these manifests - but this isn’t a big deal. It’d be nice, however, if users were allowed to directly pass in swagger specs (just like ChatGPT plugins) - effectively providing feature-parity with ChatGPT Plugins.