“Just Add Tools” - GPT Functions Now Natively Support Task Planning

GPT Function Calls got a lot of hype for its native support of JSON and tool selection. But the real innovation is that GPT-4 can now natively plan tasks and understand task-dependencies, without any prompt engineering. This sets the new standard for building agents and significantly decreases value prop of using many popular libraries like Langchain or AutoGPT.

What do we mean by native task planning?

Basically, it means that GPT-4 now “just knows” which actions to take in what order - without prompting. It does this completely as a side effect of being finetuned to pick tools. Picking tools in the presence of N-number of tools - by definition - requires prioritization and understanding dependencies.

For example, say GPT is handling the user input - “can you order eggs if we are out?”. In this case, GPT will reliably output check_fridge function call, followed by order_groceries (assuming these functions are provided).

This behavior of sequential invocation of multiple tools (in appropriate order) is also shown in ChatGPT plugins, which presumably shares the same tool-using LLM. Overall, this paints an exciting future for GPT to drastically simplify building complex agents.

Why this is cool and new?

Until now, the common method of getting LLMs to plan actions was through explicit instruction and providing demonstrations. Frameworks like React relied on using the context as a “scratch pad” to nudge the LLM to planning tasks, using tools, etc.

We no longer need such hacks, because the new fine tuned model behind GPT just “gets it”. Not to mention very reliable, since it can almost perfectly output JSONs.

Let’s dive deeper with some sample code.

Grocery Ordering Bot

Let’s make the above grocery ordering bot example concrete through code.

Suppose we had two tools: order_groceries, check_fridge. Also, let’s assume we have a function grocery_chatbot which basically yields GPT’s response via a Python generator, like this:

# Create a generator object
conversation = grocery_chatbot(
	query="Can you order eggs if we are out",
  tools=[order_groceries, check_fridge]
)
print(next(conversation))  # check_fridge JSON
print(next(conversation))  # order_groceries JSON (depending on check_fridge() result)
...

In this case, where tools=[order_groceries, check_fridge], GPT will output check_fridge followed by order_groceries if needed.

However, if we gave GPT another tool called check_wallet, GPT will magically understand that money is needed to order groceries, so the call order becomes: check_fridge > check_wallet > order_groceries

# Create a generator object
conversation = grocery_chatbot(
	query="Can you order eggs if we are out",
  tools=[order_groceries, check_fridge, check_wallet]
)
print(next(conversation))  # check_fridge JSON
print(next(conversation))  # check_wallet JSON (depending on check_fridge() result)
print(next(conversation))  # order_groceries JSON (depending on check_wallet() result)
...

In other words, without any explicit mention of requiring money to buy groceries, or instructing to check wallet, GPT will “just do it”. Sometimes this might be desirable behavior, and sometimes not. This could probably be controlled through tweaking tool descriptions.

But this is impressive nonetheless, and may open the door to building extremely complex agents with very little “dialog tree” specified up front.

Also, this shows that the intelligence and robustness of task planning then is constrained by the set of functions provided to GPT (e.g. if it can’t check the fridge, then it won’t).

It remains to be seen how this model performs in the presence of 10+ tools, but since tool descriptions count against system prompt, some type of a tool-retrieval process via semantic search may be needed.

As an aside - grocery_chatbot function has a while loop that’s very readable and self-contained to pass functions to GPT, and handle function_call outputs. Having this loop explicitly exposed (and not hidden in some massive class definition like Langchain) is a breath of fresh air.

def grocery_chatbot(
        query="Can you order eggs?",
        tools=functions_list
):
		...
    while True:  # agent loop. we break if no function calls are emitted,
				# indicating that GPT is ready to generate a response.
        response = openai.ChatCompletion.create(
                model="gpt-4-0613",
                messages= messages,
                functions=functions_list, # this is from above
                function_call="auto"
        )

				...

        yield assistant_message  # this

      if assistant_message.get("function_call"):
        ...  # call function here (order_groceries, check_fridge, etc)
      else:
				break # if no function call sensed, break out of while loop

Also, tool specification itself is a delight with Function Calls. There’s no need to define separate Tool class's like with Langchain, which adds extreme amounts of boilerplate. Using json schemas feels like a natural abstraction, given that most functions will be calling REST APIs anyways.

functions_list = [
    {
        "name": "order_groceries",
        "description": "Function for ordering groceries.",
        "parameters": ... # json schema omitted
    }, {
        "name": "check_fridge",
        "description": "Function for checking the fridge for stock.",
        "parameters": ...
    }, {
        "name": "check_wallet",
        "description": "Function for checking whether there's any money to buy things",
        "parameters": ...
    }
]

Implications

The obvious implication is that Function Calls drastically simplify building agents. It was fairly simple before, but I believe function calls / toolformer style finetuned models will be the new best practice for building agents.

“Just Add Tools”: Now basically anyone can build simple agents by just passing a list of functions / tools. That’s because GPT-4 handles 1) plan tasks, 2) select tools, 3) invoke tools, and 4) generate response all in a single LLM call. You don’t need multiple LLM calls and prompt engineering to achieve agentic behavior.
- It’s a task planner + intent Classifier + tool selector + json outputter … in one.
- GPT now intuits “implicit tasks” and “task dependencies” as well.
- It can detect termination conditions on its own… without needing to instruct it to explicitly “observe” or “think” (RIP React)
Feels less hacky: Frameworks like React / Langchain used the prompt and context as a “scratch pad” to steer LLM output. But this always felt too hacky. With function calls, such hacks are unnecessary.
Less prompt engineering needed: Less prompting means fewer tokens used, and ultimately less $ wasted. RIP prompt engineers.
Much better latency and end-user UX: Consolidating all those steps in a single LLM call means much better UX and speed & responsiveness. More reliability in outputting JSON also helps, because fewer LLM calls are needed to correct errors.

In sum, it greatly lowers the barrier for building agents by eliminating much of the prompt engineering hacks.