Skip to content

elliottburris/Composio-Function-Calling-Benchmark

 
 

Repository files navigation

Composio Logo

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors License Badge Visitor Badge

Function Calling Benchmark by Composio

Welcome to the official GitHub repository for the Composio's Function Calling Benchmark. This repository contains a benchmark of 50 function calling problems, each of which is designed to be solved using one of the 8 function schemas provided, which are inspired from some of ClickUp's integration endpoints.

Overview

The benchmark is designed to test the ability of various models to correctly call functions based on given prompts, and solve the situation in a ClickUp workspace using one of the given functions. Each question in the benchmark presents a scenario that requires the use of a specific function to solve. The function schemas provided outline the structure and parameters of the functions that can be used.

Note that, a speciality of this benchmark is, problems are designed to test the abilities of the models to handle real world API structurs, and performance against differnet optimizations.

Publications

Repository Structure

  • prompts/: Propmts used to check & modify the Problems and Schema.
  • clickup_space_benchmark.json: The problems and correct solutions.
  • clickup_space_schema.json: Function Schema's that the LLMs use to solve the problems of the Benchmark.
  • *.ipynb(in relevant branches): Different optimization techniques, applied to the LLMs to check their performance against the Benchmark.

We did the all experimentations on notebooks now, as it is easier to keep track of the results.

Running the Benchmark

We have tested different function calling models, Resut notebooks of which are stored in each seperate branch.

Currently we have experimented with:

  • gpt-4o - OpenAI - branch
  • gpt-4-turbo-preview - OpenAI - branch
  • gpt-4-turbo - OpenAI - branch
  • gpt-4-0125-preview - OpenAI - branch
  • claude-3-haiku-20240307 - Anthropic - branch
  • claude-3-sonnet-20240229 - Anthropic - branch
  • claude-3-opus-20240229 - Anthropic - branch

We are planning to add these models in future:

Experiments

All these different optimizations has been tested with the models, and each of the techniques are explained here.

Screenshot 2024-05-14 at 12 50 49 AM

All previous Models:

Optimization Approach gpt-4-turbo-preview gpt-4-turbo gpt-4-0125-preview claude-3-haiku-20240307 claude-3-sonnet-20240229 claude-3-opus-20240229
1 No System Prompt 0.36 0.36 0.353 0.48 0.6 0.42
2 Flattening Schema 0.527 0.487 0.533 0.5 0.58 0.5
3 Flattened Schema +
Simple System Prompt
0.553 0.533 0.54 0.54 0.6 0.54
4 Flattened Schema +
Focused System Prompt
0.633 0.633 0.64 0.54 0.54 0.54
5 Flattened Schema +
Focused System Prompt +
Function Name Optimized
0.553 0.607 0.587 0.52 0.62 0.52
6 Flattened Schema +
Focused System Prompt +
Function Description Optimized
0.633 0.66 0.673 0.52 0.6 0.52
7 Flattened Schema +
Focused System Prompt containing Schema summary
0.64 0.553 0.64 0.46 0.62 0.46
8 Flattened Schema +
Focused System Prompt containing Schema summary +
Function Name Optimized
0.70 0.707 0.686 0.5 0.64 0.46
9 Flattened Schema +
Focused System Prompt containing Schema summary +
Function Description Optimized
0.687 0.707 0.68 0.5 0.6 0.6
10 Flattened Schema +
Focused System Prompt containing Schema summary +
Function and Parameter Descriptions Optimized
0.767 0.767 0.787 0.58 0.74 0.58
11 Flattened Schema +
Focused System Prompt containing Schema summary +
Function and Parameter Descriptions Optimized +
Function Call examples added
0.693 0.6 0.707 0.6 0.76 0.64
12 Flattened Schema +
Focused System Prompt containing Schema summary +
Function and Parameter Descriptions Optimized +
Function Parameter examples added
0.787 0.693 0.787 0.68 0.76 0.66

Contributing

We welcome contributions to this repository. If you have a model that you would like to test against the benchmark, feel free to open a pull request. If you encounter any issues while using the benchmark, please open an issue.

License

This project is licensed under the terms of the MIT license.

About Composio

Composio is an organization dedicated to advancing the field of artificial intelligence. We create benchmarks, develop models, and build tools to push the boundaries of what is possible in AI. Follow us on Twitter for updates on our latest projects.


© 2024 Composio, All Rights Reserved.

About

Function Calling Benchmark & Testing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%