Skip to content

Commit

Permalink
Merge branch 'All-Hands-AI:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
RainRat authored Dec 4, 2024
2 parents 2cc537e + 8f47547 commit b142e26
Show file tree
Hide file tree
Showing 32 changed files with 91 additions and 149 deletions.
2 changes: 1 addition & 1 deletion Development.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ poetry run pytest ./tests/unit/test_*.py
To reduce build time (e.g., if no changes were made to the client-runtime component), you can use an existing Docker container image by
setting the SANDBOX_RUNTIME_CONTAINER_IMAGE environment variable to the desired Docker image.
Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.14-nikolaik`
Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.15-nikolaik`
## Develop inside Docker container
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,16 @@ See the [Installation](https://docs.all-hands.dev/modules/usage/installation) gu
system requirements and more information.

```bash
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik

docker run -it --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.14
docker.all-hands.dev/all-hands-ai/openhands:0.15
```

You'll find OpenHands running at [http://localhost:3000](http://localhost:3000)!
Expand Down
2 changes: 1 addition & 1 deletion compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ services:
image: openhands:latest
container_name: openhands-app-${DATE:-}
environment:
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.14-nikolaik}
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.15-nikolaik}
- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234}
- WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
ports:
Expand Down
2 changes: 1 addition & 1 deletion containers/dev/compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ services:
- BACKEND_HOST=${BACKEND_HOST:-"0.0.0.0"}
- SANDBOX_API_HOSTNAME=host.docker.internal
#
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.14-nikolaik}
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.15-nikolaik}
- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234}
- WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
ports:
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/usage/how-to/cli-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ LLM_API_KEY="sk_test_12345"
```bash
docker run -it \
--pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik \
-e SANDBOX_USER_ID=$(id -u) \
-e WORKSPACE_MOUNT_PATH=$WORKSPACE_BASE \
-e LLM_API_KEY=$LLM_API_KEY \
Expand All @@ -59,7 +59,7 @@ docker run -it \
-v /var/run/docker.sock:/var/run/docker.sock \
--add-host host.docker.internal:host-gateway \
--name openhands-app-$(date +%Y%m%d%H%M%S) \
docker.all-hands.dev/all-hands-ai/openhands:0.14 \
docker.all-hands.dev/all-hands-ai/openhands:0.15 \
python -m openhands.core.cli
```

Expand Down
4 changes: 2 additions & 2 deletions docs/modules/usage/how-to/headless-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ LLM_API_KEY="sk_test_12345"
```bash
docker run -it \
--pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik \
-e SANDBOX_USER_ID=$(id -u) \
-e WORKSPACE_MOUNT_PATH=$WORKSPACE_BASE \
-e LLM_API_KEY=$LLM_API_KEY \
Expand All @@ -54,6 +54,6 @@ docker run -it \
-v /var/run/docker.sock:/var/run/docker.sock \
--add-host host.docker.internal:host-gateway \
--name openhands-app-$(date +%Y%m%d%H%M%S) \
docker.all-hands.dev/all-hands-ai/openhands:0.14 \
docker.all-hands.dev/all-hands-ai/openhands:0.15 \
python -m openhands.core.main -t "write a bash script that prints hi"
```
6 changes: 3 additions & 3 deletions docs/modules/usage/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,16 @@
The easiest way to run OpenHands is in Docker.

```bash
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik

docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.14-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.14
docker.all-hands.dev/all-hands-ai/openhands:0.15
```

You can also run OpenHands in a scriptable [headless mode](https://docs.all-hands.dev/modules/usage/how-to/headless-mode), as an [interactive CLI](https://docs.all-hands.dev/modules/usage/how-to/cli-mode), or using the [OpenHands GitHub Action](https://docs.all-hands.dev/modules/usage/how-to/github-action).
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/usage/runtimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ some flags being passed to `docker run` that make this possible:

```
docker run # ...
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.11-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik \
-v /var/run/docker.sock:/var/run/docker.sock \
# ...
```
Expand Down
7 changes: 3 additions & 4 deletions evaluation/benchmarks/EDA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,10 @@ This folder contains evaluation harness for evaluating agents on the Entity-dedu

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Start the evaluation


```bash
export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
Expand Down Expand Up @@ -37,7 +35,8 @@ For example,
```

## Reference
```

```bibtex
@inproceedings{zhang2023entity,
title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/agent_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This folder contains evaluation harness for evaluating agents on the [AgentBench

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Start the evaluation

Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/aider_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Hugging Face dataset based on the

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local
Please follow instruction [here](../../README.md#setup) to setup your local
development environment and LLM.

## Start the evaluation
Expand Down
7 changes: 4 additions & 3 deletions evaluation/benchmarks/biocoder/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@ Implements evaluation of agents on BioCoder from the BioCoder benchmark introduc

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## BioCoder Docker Image

In the openhands branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenHands environment. In the Docker image are testing scripts (`/testing/start_test_openhands.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.

**Before first execution, pull our Docker image with the following command**

```bash
docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
```
Expand All @@ -19,7 +20,6 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode

## Start the evaluation


```bash
./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```
Expand Down Expand Up @@ -47,7 +47,8 @@ with current OpenHands version, then your command would be:
```

## Reference
```

```bibtex
@misc{tang2024biocoder,
title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},
Expand Down
5 changes: 2 additions & 3 deletions evaluation/benchmarks/bird/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Implements evaluation of agents on BIRD introduced in [Can LLM Already Serve as

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Run Inference on Bird

Expand All @@ -22,8 +22,7 @@ like to evaluate. It could also be a release tag like `0.6.2`.

For each problem, OpenHands is given a set number of iterations to fix the failing code. The history field shows each iteration's response to correct its code that fails any test case.


```
```json
{
"task_id": "0",
"instruction": "You are a SQL expert and need to complete the following text-to-SQL tasks.\n\nCREATE TABLE frpm\n(\n CDSCode TEXT not null\n primary key,\n `Academic Year` TEXT null,\n `County Code` TEXT null,\n `District Code` INTEGER null,\n `School Code` TEXT null,\n `County Name` TEXT null,\n `District Name` TEXT null,\n `School Name` TEXT null,\n `District Type` TEXT null,\n `School Type` TEXT null,\n `Educational Option Type` TEXT null,\n `NSLP Provision Status` TEXT null,\n `Charter School (Y/N)` INTEGER null,\n `Charter School Number` TEXT null,\n `Charter Funding Type` TEXT null,\n IRC INTEGER null,\n `Low Grade` TEXT null,\n `High Grade` TEXT null,\n `Enrollment (K-12)` REAL null,\n `Free Meal Count (K-12)` REAL null,\n `Percent (%) Eligible Free (K-12)` REAL null,\n `FRPM Count (K-12)` REAL null,\n `Percent (%) Eligible FRPM (K-12)` REAL null,\n `Enrollment (Ages 5-17)` REAL null,\n `Free Meal Count (Ages 5-17)` REAL null,\n `Percent (%) Eligible Free (Ages 5-17)` REAL null,\n `FRPM Count (Ages 5-17)` REAL null,\n `Percent (%) Eligible FRPM (Ages 5-17)` REAL null,\n `2013-14 CALPADS Fall 1 Certification Status` INTEGER null,\n foreign key (CDSCode) references schools (CDSCode)\n);\n\nCREATE TABLE satscores\n(\n cds TEXT not null\n primary key,\n rtype TEXT not null,\n sname TEXT null,\n dname TEXT null,\n cname TEXT null,\n enroll12 INTEGER not null,\n NumTstTakr INTEGER not null,\n AvgScrRead INTEGER null,\n AvgScrMath INTEGER null,\n AvgScrWrite INTEGER null,\n NumGE1500 INTEGER null,\n-- PctGE1500 double null,\n foreign key (cds) references schools (CDSCode)\n);\n\nCREATE TABLE schools\n(\n CDSCode TEXT not null\n primary key,\n NCESDist TEXT null,\n NCESSchool TEXT null,\n StatusType TEXT not null,\n County TEXT not null,\n District TEXT not null,\n School TEXT null,\n Street TEXT null,\n StreetAbr TEXT null,\n City TEXT null,\n Zip TEXT null,\n State TEXT null,\n MailStreet TEXT null,\n MailStrAbr TEXT null,\n MailCity TEXT null,\n MailZip TEXT null,\n MailState TEXT null,\n Phone TEXT null,\n Ext TEXT null,\n Website TEXT null,\n OpenDate DATE null,\n ClosedDate DATE null,\n Charter INTEGER null,\n CharterNum TEXT null,\n FundingType TEXT null,\n DOC TEXT not null,\n DOCType TEXT not null,\n SOC TEXT null,\n SOCType TEXT null,\n EdOpsCode TEXT null,\n EdOpsName TEXT null,\n EILCode TEXT null,\n EILName TEXT null,\n GSoffered TEXT null,\n GSserved TEXT null,\n Virtual TEXT null,\n Magnet INTEGER null,\n Latitude REAL null,\n Longitude REAL null,\n AdmFName1 TEXT null,\n AdmLName1 TEXT null,\n AdmEmail1 TEXT null,\n AdmFName2 TEXT null,\n AdmLName2 TEXT null,\n AdmEmail2 TEXT null,\n AdmFName3 TEXT null,\n AdmLName3 TEXT null,\n AdmEmail3 TEXT null,\n LastUpdate DATE not null\n);\n\n-- External Knowledge: Eligible free rate for K-12 = `Free Meal Count (K-12)` / `Enrollment (K-12)`\n\n-- Using valid SQLite and understanding External Knowledge, answer the following questions for the tables provided above.\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\nQuestion: What is the highest eligible free rate for K-12 students in the schools in Alameda County?\n\n\nPlease write the SQL in one line without line breaks.And write a new python file named 0.py to call the SQL you wrote.You need to follow the code template below:\n\n\n import sqlite3\n def execute_sql(db_path, sql):\n with sqlite3.connect(db_path) as conn:\n cursor = conn.cursor()\n cursor.execute(sql)\n result = cursor.fetchall()\n return result\n\n if __name__ == '__main__':\n sql = \"\" # filling your SQL here\n db_path = \"california_schools/california_schools.sqlite\"\n print(db_path)\n result = execute_sql(db_path, sql)\n print(result)\n \n\nEnvironment has been set up for you to start working.You may assume all necessary tools are installed.\n\nIMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\nYou SHOULD INCLUDE PROPER INDENTATION in your edit commands.\nWhen you think you have fixed the issue through code changes, please finish the interaction using the "finish" tool.\n",
Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/browsing_delegation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ If so, the browsing performance upper-bound of CodeActAgent will be the performa

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Run Inference

Expand Down
5 changes: 2 additions & 3 deletions evaluation/benchmarks/commit0_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,18 @@ This folder contains the evaluation harness that we built on top of the original

The evaluation consists of three steps:

1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-openhands-and-your-llm).
1. Environment setup: [install python environment](../../README.md#development-environment), [configure LLM config](../../README.md#configure-openhands-and-your-llm).
2. [Run Evaluation](#run-inference-on-commit0-instances): Generate a edit patch for each Commit0 Repo, and get the evaluation results

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## OpenHands Commit0 Instance-level Docker Support

OpenHands supports using the Commit0 Docker for **[inference](#run-inference-on-commit0-instances).
This is now the default behavior.


## Run Inference on Commit0 Instances

Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the Commit0 set you are running on) for the [instance-level docker image](#openhands-commit0-instance-level-docker-support).
Expand Down
4 changes: 3 additions & 1 deletion evaluation/benchmarks/gaia/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@ This folder contains evaluation harness for evaluating agents on the [GAIA bench

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Run the evaluation

We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).
Please accept the terms and make sure to have logged in on your computer by `huggingface-cli login` before running the evaluation.

Expand Down Expand Up @@ -41,6 +42,7 @@ For example,
## Get score

Then you can get stats by running the following command:

```bash
python ./evaluation/benchmarks/gaia/get_score.py \
--file <path_to/output.json>
Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/gorilla/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This folder contains evaluation harness we built on top of the original [Gorilla

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Run Inference on APIBench Instances

Expand Down
13 changes: 9 additions & 4 deletions evaluation/benchmarks/gpqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Implements the evaluation of agents on the GPQA benchmark introduced in [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2308.07124).

This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.

- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
Expand All @@ -11,20 +12,24 @@ This code implements the evaluation of agents on the GPQA Benchmark with Open Bo
Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.

Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa

- <https://arxiv.org/pdf/2311.12022>
- <https://paperswithcode.com/dataset/gpqa>
- <https://github.com/idavidrein/gpqa>

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Run Inference on GPQA Benchmark

'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
From the root of the OpenHands repo, run the following command:

```bash
./evaluation/benchmarks/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
```

You can replace `model_config_name` with any model you set up in `config.toml`.

- `model_config_name`: The model configuration name from `config.toml` that you want to evaluate.
Expand Down
Loading

0 comments on commit b142e26

Please sign in to comment.