Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility #43

Merged
merged 67 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d5d067e
core functions
recursix Sep 4, 2024
df2aaeb
switch to dask
recursix Sep 4, 2024
edb162c
removing joblib dependency and adding dask
recursix Sep 4, 2024
82ff348
fixing imports
ThibaultLSDC Sep 4, 2024
0dbdd98
handles multiple backends
recursix Sep 11, 2024
7da5cac
ensure asyncio loop creation
recursix Sep 11, 2024
25e241a
more tests
recursix Sep 11, 2024
01c8652
setting dashboard address to None
recursix Sep 11, 2024
c6370bd
minor
recursix Sep 11, 2024
775f135
Merge branch 'main' into switch-to-dask
recursix Sep 11, 2024
0198811
Merge branch 'main' into switch-to-dask
recursix Sep 12, 2024
7ad0e67
Finally found a way to make it work
recursix Sep 16, 2024
a396d9a
initial reproducibility files
recursix Sep 16, 2024
3db84f7
Seems to be superflus
recursix Sep 19, 2024
ed9e568
adding a reproducibility journal
recursix Sep 19, 2024
85ac6fa
minor update
recursix Sep 19, 2024
ad5110e
more robust
recursix Sep 19, 2024
baf9afa
adding reproducibility tools
recursix Sep 19, 2024
b0268b6
fix white listing
recursix Sep 20, 2024
bb7ddb0
minor
recursix Sep 20, 2024
8b4884f
minor
recursix Sep 20, 2024
e685f10
minor
recursix Sep 20, 2024
c99bdf7
Merge branch 'main' into reproducibility
recursix Sep 20, 2024
ac8b7f8
minor
recursix Sep 20, 2024
295f010
minor fix
recursix Sep 20, 2024
5ac4a7c
more tests
recursix Sep 20, 2024
d4cf969
more results yay
recursix Sep 20, 2024
1dc720b
disabling this test
recursix Sep 20, 2024
82f6181
update
recursix Sep 20, 2024
eb871ac
update
recursix Sep 20, 2024
fa0c489
black
recursix Sep 20, 2024
abd3212
maybe fixing github workflow ?
ThibaultLSDC Sep 20, 2024
4ebee28
make get_git_username great again
recursix Sep 20, 2024
58f5ec7
trigger change
recursix Sep 20, 2024
37bbb5f
Merge branch 'reproducibility' of github.com:ServiceNow/AgentLab into…
recursix Sep 20, 2024
f621648
new browsergym
recursix Sep 20, 2024
60a1b22
GPT-4o result (and new comment column)
recursix Sep 21, 2024
dd9aa0d
Seems like there was a change to 4o flags, trying these
recursix Sep 21, 2024
54ea0af
minor comment
recursix Sep 21, 2024
24214e5
better xray
recursix Sep 21, 2024
b8da07b
minor fix
recursix Sep 21, 2024
1ecaf9b
addming a comment field
recursix Sep 21, 2024
5aba9bc
new agent
recursix Sep 21, 2024
fe561b9
Merge branch 'main' into reproducibility
recursix Sep 21, 2024
7bf424e
another test with GPT-4o
recursix Sep 21, 2024
7e0ab03
adding llama3 from openrouter
recursix Sep 21, 2024
03eae32
fix naming
recursix Sep 21, 2024
796c37e
unused import
recursix Sep 23, 2024
8fc49e9
new summary tools and remove "_args" from columns in results
recursix Sep 23, 2024
7e2afd3
add Llama
recursix Sep 23, 2024
f08e47b
initial code for reproducibility agent
recursix Sep 23, 2024
326710a
Merge branch 'main' into reproducibility
recursix Sep 23, 2024
f7494cb
adjust inspect results
recursix Sep 25, 2024
37d8961
Merge branch 'main' into reproducibility
recursix Sep 25, 2024
4066da3
infer from benchmark
recursix Sep 26, 2024
ef204d3
fix reproducibility agent
recursix Sep 26, 2024
5112abe
prevent the repro_dir to be an index variable
recursix Sep 26, 2024
5325c69
updating repro agent stats
recursix Sep 27, 2024
02e028f
Merge branch 'main' into reproducibility
recursix Sep 27, 2024
d8ad4bd
Reproducibility agent
recursix Oct 1, 2024
fe27819
instructions to setup workarena
recursix Oct 1, 2024
4a8f078
fixing tests
ThibaultLSDC Oct 1, 2024
6474558
handles better a few edge cases
recursix Oct 1, 2024
42fdcf1
Merge branch 'reproducibility' of github.com:ServiceNow/AgentLab into…
recursix Oct 1, 2024
628d1c8
default progress function to None
recursix Oct 2, 2024
69f147a
minor formatting
recursix Oct 2, 2024
146ad62
minor
recursix Oct 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 27 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,33 @@ export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/"
```
</details>

<details>

<summary>WorkArena</summary>

See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started)

At a glance:
* [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance.
* Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>`
* Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that
all variables are in single quotes unless you happen to have a password with a
single quote in it)
```bash
export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/'
export SNOW_INSTANCE_UNAME='admin'
export SNOW_INSTANCE_PWD='<your-instance-password>'
```

```bash
pip install browsergym-workarena
playwright install
workarena-install
```


</details>

<details>
<summary>WebArena on AWS</summary>
TODO
Expand All @@ -65,17 +92,7 @@ TODO
</details>


<details>

<summary>WorkArena</summary>

```bash
export SNOW_INSTANCE_URL="https://<your-instance-number>.service-now.com/"
export SNOW_INSTANCE_UNAME="admin"
export SNOW_INSTANCE_PWD=<your-instance-password>
```

</details>


## Launch experiments
Expand Down
10 changes: 6 additions & 4 deletions reproducibility_journal.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
git_user,agent_name,benchmark,benchmark_version,date,avg_reward,std_err,n_err,n_completed,os,python_version,playwright_version,agentlab_version,agentlab_git_hash,agentlab__local_modifications,browsergym_version,browsergym_git_hash,browsergym__local_modifications
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob_tiny_test,0.6.3,2024-09-19_21-07-34,0.75,0.217,0,4/4,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,c99bdf74c98f323cc6a646467ba5f21154b6fd18,,0.6.4,b73531271d2ce688c104eb4dfba2819583f1ba36,
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob_tiny_test,0.6.3,2024-09-19_21-28-58,1.0,0.0,0,4/4,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,c99bdf74c98f323cc6a646467ba5f21154b6fd18," M: reproducibility_journal.csv
git_user,agent_name,benchmark,benchmark_version,date,avg_reward,std_err,n_err,n_completed,comment,os,python_version,playwright_version,agentlab_version,agentlab_git_hash,agentlab__local_modifications,browsergym_version,browsergym_git_hash,browsergym__local_modifications
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob_tiny_test,0.6.3,2024-09-19_21-07-34,0.75,0.217,0,4/4,,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,c99bdf74c98f323cc6a646467ba5f21154b6fd18,,0.6.4,b73531271d2ce688c104eb4dfba2819583f1ba36,
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob_tiny_test,0.6.3,2024-09-19_21-28-58,1.0,0.0,0,4/4,,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,c99bdf74c98f323cc6a646467ba5f21154b6fd18," M: reproducibility_journal.csv
M: src/agentlab/experiments/task_collections.py",0.6.4,b73531271d2ce688c104eb4dfba2819583f1ba36,
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob,0.6.3,2024-09-20_07-16-21,0.546,0.02,0,625/625,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,295f01005faf8f2c73a31be6a18cec19d563b54b,,0.6.4,b73531271d2ce688c104eb4dfba2819583f1ba36,
recursix,GenericAgent-gpt-4o-mini-2024-07-18,miniwob,0.6.3,2024-09-20_07-16-21,0.546,0.02,0,625/625,,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,295f01005faf8f2c73a31be6a18cec19d563b54b,,0.6.4,b73531271d2ce688c104eb4dfba2819583f1ba36,
recursix,GenericAgent-gpt-4o-2024-05-13,miniwob,0.6.3,2024-09-20_22-09-43,0.656,0.019,0,625/625,,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,f6216486d5faac2c8b3fb0a63e114e5a4bafde47,,0.6.4,8cef8fe34940ff490d0cc06b0c8f100180d09d43,
recursix,GenericAgent-gpt-4o-2024-05-13,miniwob,0.6.3,2024-09-21_12-04-39,0.656,0.019,0,625/625,None,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.2,1.39.0,0.2.1,fe561b93c5f053e9f9625358862f542523b5e14a,,0.7.0,ed6d6992ef64bfb91aca7002d33cb6ed5ec031ef,
11 changes: 8 additions & 3 deletions src/agentlab/agents/dynamic_prompting.py
Original file line number Diff line number Diff line change
Expand Up @@ -577,9 +577,14 @@ def _parse_answer(self, text_answer):
ans_dict = {"action": code, "parse_error": str(e)}

try:
# just check if action can be mapped to python code but keep action as is
# the environment will be responsible for mapping it to python
self.action_set.to_python_code(ans_dict["action"])
if ans_dict["action"] == "None":
# Used by reproducibility agent for backward compatibility of
# traces missing LLM's response in chat messages.
ans_dict["action"] = None
else:
# just check if action can be mapped to python code but keep action as is
# the environment will be responsible for mapping it to python
self.action_set.to_python_code(ans_dict["action"])
except Exception as e:
raise ParseError(
f"Error while parsing action\n: {e}\n"
Expand Down
4 changes: 2 additions & 2 deletions src/agentlab/agents/generic_agent/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from .agent_configs import (
AGENT_3_5,
AGENT_8B,
AGENT_70B,
AGENT_LLAMA3_70B,
AGENT_CUSTOM,
RANDOM_SEARCH_AGENT,
AGENT_4o,
Expand All @@ -14,7 +14,7 @@
"AGENT_4o",
"AGENT_4o_MINI",
"AGENT_4o_VISION",
"AGENT_70B",
"AGENT_LLAMA3_70B",
"AGENT_8B",
"RANDOM_SEARCH_AGENT",
"AGENT_CUSTOM",
Expand Down
16 changes: 10 additions & 6 deletions src/agentlab/agents/generic_agent/agent_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@
)

# llama3-70b default config
FLAGS_70B = GenericPromptFlags(
FLAGS_LLAMA3_70B = GenericPromptFlags(
obs=dp.ObsFlags(
use_html=False,
use_ax_tree=True,
Expand Down Expand Up @@ -135,9 +135,13 @@
add_missparsed_messages=True,
)

AGENT_70B = GenericAgentArgs(
chat_model_args=CHAT_MODEL_ARGS_DICT["meta-llama/Meta-Llama-3-70B-Instruct"],
flags=FLAGS_70B,
AGENT_LLAMA3_70B = GenericAgentArgs(
chat_model_args=CHAT_MODEL_ARGS_DICT["openrouter/meta-llama/llama-3-70b-instruct"],
flags=FLAGS_LLAMA3_70B,
)
AGENT_LLAMA31_70B = GenericAgentArgs(
chat_model_args=CHAT_MODEL_ARGS_DICT["openrouter/meta-llama/llama-3.1-70b-instruct"],
flags=FLAGS_LLAMA3_70B,
)

FLAGS_8B = GenericPromptFlags(
Expand Down Expand Up @@ -208,8 +212,8 @@
action=dp.ActionFlags(
multi_actions=False,
action_set="bid",
long_description=True,
individual_examples=True,
long_description=False,
individual_examples=False,
),
use_plan=False,
use_criticise=False,
Expand Down
1 change: 1 addition & 0 deletions src/agentlab/agents/generic_agent/generic_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def __post_init__(self):
pass

def set_benchmark(self, benchmark):
"""Override Some flags based on the benchmark."""
if benchmark == "miniwob":
self.flags.obs.use_html = True

Expand Down
Loading