Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Performance Different Than Reported #3

Open
jiadingfang opened this issue Jul 20, 2023 · 4 comments
Open

Benchmarking Performance Different Than Reported #3

jiadingfang opened this issue Jul 20, 2023 · 4 comments
Assignees

Comments

@jiadingfang
Copy link

Hi, thanks for such an all-around repo for working with 3DSG planning!

I would like to reproduce the benchmarking results in your repo under the benchmark folder to make sure everything runs properly before testing my own planners. However, during my testing, the behaviors of the planners are quite different than what are reported.

As of 07/20/2023, I ran all available planners in pddlgym_planners/__init__.py with pddl_domain taskographyv2tiny1 with the command python scripts/benchmark/plan.py --domain-name $DOMAIN_NAME --planner $PLANNER. The results are the following:

  1. FF: error while running

gcc -o ff main.o memory.o output.o parse.o inst_pre.o inst_easy.o inst_hard.o inst_final.o orderings.o relax.o search.o scan-fct_pddl.tab.o scan-ops_pddl.tab.o -Wall -g -std=gnu99 -O6 -lm
/usr/bin/ld: search.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/search.c:110: multiple definition of lcurrent_goals'; relax.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/relax.c:111: first defined here /usr/bin/ld: scan-fct_pddl.tab.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/lex-fct_pddl.l:9: multiple definition of gbracket_count'; main.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/main.c:147: first defined here
collect2: error: ld returned 1 exit status
make: *** [makefile:74: ff] Error 1

  1. FF-X: the same error as FF
  2. FD-lama-first: plan failure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. Cerberus-seq-sat: plan falure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. Cerberus-seq-agl: plan failure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. DecStar-agl-decoupled: plan failure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. lapkt-bfws: slightly different behavior than benchmark/taskographyv2tiny1_bfws. My result:

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [03:21<00:00, 5.04s/it]
{'failure_rate': 0.0,
'num_node_expansions': 468.48387096774195,
'num_node_expansions_std': 192.6469059835003,
'plan_length': 14.709677419354838,
'plan_length_std': 3.828530825661262,
'search_time': 0.4536315483870968,
'search_time_std': 0.3696494008728636,
'success_rate': 0.775,
'timeout_rate': 0.225,
'total_time': 0.4536315483870968,
'total_time_std': 0.3696494008728636}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [05:57<00:00, 6.51s/it]
{'failure_rate': 0.0,
'num_node_expansions': 573.3225806451613,
'num_node_expansions_std': 338.3147405651472,
'plan_length': 15.32258064516129,
'plan_length_std': 4.394917128465223,
'search_time': 0.5754497419354839,
'search_time_std': 0.8765903350261305,
'success_rate': 0.5636363636363636,
'timeout_rate': 0.43636363636363634,
'total_time': 0.5754497419354839,
'total_time_std': 0.8765903350261305}

reported in benchmark/taskographyv2tiny1_bfws/taskographyv2tiny1_bfws_test.json:

{
"failure_rate": 0.0,
"num_node_expansions": 609.6279069767442,
"num_node_expansions_std": 339.64208406455214,
"plan_length": 15.55813953488372,
"plan_length_std": 4.15570398469826,
"search_time": 0.8969197023255813,
"search_time_std": 1.3382104019851668,
"success_rate": 0.7818181818181819,
"timeout_rate": 0.21818181818181817,
"total_time": 0.8969197023255813,
"total_time_std": 1.3382104019851668
}

  1. FD-seq-opt-lmcut: plan failure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. Delfi: plan failure:

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

  1. DecStar-opt-decoupled: plan failure

{'failure_rate': 1.0,
'num_node_expansions': nan,
'num_node_expansions_std': nan,
'plan_length': nan,
'plan_length_std': nan,
'search_time': nan,
'search_time_std': nan,
'success_rate': 0.0,
'timeout_rate': 0.0,
'total_time': nan,
'total_time_std': nan}

I followed the installation stated in the https://github.com/taskography/taskography-api#installation with only a few changes to fix some errors:
0. Ubuntu 22.04.

  1. Conda create an empty env with python=3.7.
  2. Add a comma , at the end of line
    "tqdm"
    to separate the two lines.
  3. Run pip install -e . and pip install -r requirements.txt.
  4. Downgrade importlib-metadata from 6.7.0 to 4.12.0 to avoid error 'EntryPoints' object has no attribute 'get'. Source: https://stackoverflow.com/questions/73929564/entrypoints-object-has-no-attribute-get-digital-ocean
  5. Move from __future__ import annotations to the first line to avoid error from __future__ imports must occur at the beginning of the file. Source: https://stackoverflow.com/questions/38688504/from-future-imports-must-occur-at-the-beginning-of-the-file-what-defines
  6. Run scripts/validate/loader.py and scripts/validate/taskography_env.py, pass both.

I'm willing to offer more details if needed. Highly appreciate it if you could offer some help as a solid benchmark is the pre-requisite to any possible future researches. Thanks in advance!

@krrish94
Copy link
Contributor

I think the planners did not compile at all, which is why you get nans nearly everywhere.

Tagging this as an install issue, and looping in @agiachris @Khodeir

@krrish94
Copy link
Contributor

I looked into this further -- things seem fine on my end.

Install dependencies + version info

  1. I started out with a fresh conda environment (python=3.10), on Ubuntu 20.04
  2. I installed all dependencies from the current requirements.txt file in taskography-api -- with two changes. I had to comment out the gym==0.21.0 requirement here and here
  3. Added the missing comma here -- thanks for this catch!
  4. I ran the same set of commands raised in the issue above, but they executed correctly on my end.

C, CXX compiler versions used

-- The C compiler identification is GNU 8.4.0
-- The CXX compiler identification is GNU 8.4.0

Run planners

python scripts/benchmark/plan.py --domain-name taskographyv2tiny1 --planner FF

{'failure_rate': 0.0,
 'num_node_expansions': 18.65,
 'num_node_expansions_std': 4.514144437210666,
 'plan_length': 15.225,
 'plan_length_std': 4.09565318355937,
 'search_time': 0.059750000000000004,
 'search_time_std': 0.05884248040319171,
 'success_rate': 1.0,
 'timeout_rate': 0.0,
 'total_time': 0.059750000000000004,
 'total_time_std': 0.05884248040319171}

FF-X

{'failure_rate': 0.0,
 'num_node_expansions': 18.65,
 'num_node_expansions_std': 4.514144437210666,
 'plan_length': 15.225,
 'plan_length_std': 4.09565318355937,
 'search_time': 0.04825,
 'search_time_std': 0.0485225463058154,
 'success_rate': 1.0,
 'timeout_rate': 0.0,
 'total_time': 0.04825,
 'total_time_std': 0.0485225463058154}

FD-lama-first

{'failure_rate': 0.0,
 'num_node_expansions': 19.5625,
 'num_node_expansions_std': 4.022883760438524,
 'plan_length': 14.84375,
 'plan_length_std': 3.841462734102727,
 'search_time': 0.0116954025,
 'search_time_std': 0.00940676205541192,
 'success_rate': 0.8,
 'timeout_rate': 0.2,
 'total_time': 0.49729570312500004,
 'total_time_std': 0.3234760420509391}

Cerberus-seq-sat

{'failure_rate': 0.7,
 'num_node_expansions': 0.0,
 'num_node_expansions_std': 0.0,
 'plan_length': 13.0,
 'plan_length_std': 0.0,
 'search_time': 0.708582,
 'search_time_std': 0.0,
 'success_rate': 0.025,
 'timeout_rate': 0.275,
 'total_time': 0.841963,
 'total_time_std': 0.0}

Cerberus-seq-agl (on the train split -- all problems time out; on the test split, a couple valid plans are found)

{'failure_rate': 0.9272727272727272,
 'num_node_expansions': 24.666666666666668,
 'num_node_expansions_std': 0.9428090415820634,
 'plan_length': 13.0,
 'plan_length_std': 0.0,
 'search_time': 0.3534086666666667,
 'search_time_std': 0.01469978123042048,
 'success_rate': 0.05454545454545454,
 'timeout_rate': 0.01818181818181818,
 'total_time': 0.6266036666666667,
 'total_time_std': 0.028384116501232768}

DecStar-agl-decoupled

{'failure_rate': 0.925,
 'num_node_expansions': 18.0,
 'num_node_expansions_std': 0.0,
 'plan_length': 15.0,
 'plan_length_std': 0.0,
 'search_time': 0.09,
 'search_time_std': 0.0,
 'success_rate': 0.025,
 'timeout_rate': 0.05,
 'total_time': 0.26,
 'total_time_std': 0.0}

FD-seq-opt-lmcut

{'failure_rate': 0.0,
 'num_node_expansions': 109.76666666666667,
 'num_node_expansions_std': 208.1750358605843,
 'plan_length': 14.366666666666667,
 'plan_length_std': 3.3910011632096047,
 'search_time': 0.44586308,
 'search_time_std': 0.463168532250191,
 'success_rate': 0.75,
 'timeout_rate': 0.25,
 'total_time': 0.58704412,
 'total_time_std': 0.49714965928707583}

Delfi

{'failure_rate': 0.0,
 'num_node_expansions': 55.166666666666664,
 'num_node_expansions_std': 33.69182228507222,
 'plan_length': 14.416666666666666,
 'plan_length_std': 3.882832585740581,
 'search_time': 0.03140590499999999,
 'search_time_std': 0.020406450112085275,
 'success_rate': 0.3,
 'timeout_rate': 0.7,
 'total_time': 0.27425427499999994,
 'total_time_std': 0.5906703315447319}

Compilation issue

Returning to the issue in question, I think it'd be worth looking into compiler changes acros Ubuntu 20.04, 22.04, and whether that's causing the FF and FD variants to not compile on your end. One other change I notice is also that I used python 3.10 (although I'm not sure that's what's causing the FF, FD planners to not compile on your end)

@jiadingfang
Copy link
Author

Thanks for the fast and detailed response, really appreciate it!

I follow your advice by starting with a fresh ubuntu:20.04 env and perform the installation. This time, I get more planners to work (yea!) but there are still a few that report nan failures. The detailed test results will be listed at the bottom.

To facilitate reproducibility, I used docker to produce the results. I made a Dockerfile and a script for running as in my PR. You can run all my tests by running python scripts/benchmark/test.py after you are inside the interactive docker container by running docker.sh script.

I hope we can together work on a good Dockerfile so that other people can use it more easily because this is such a good benchmark. I will also contribute a docker image once we fix the dependency issues and pass all tests.

FF-X

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.472727272727273,
    "num_node_expansions_std": 6.452214688009482,
    "plan_length": 16.70909090909091,
    "plan_length_std": 5.826811781513976,
    "search_time": 0.08345454545454543,
    "search_time_std": 0.0829066985300767,
    "success_rate": 1.0,
    "timeout_rate": 0.0,
    "total_time": 0.08345454545454543,
    "total_time_std": 0.0829066985300767
}

FF

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.472727272727273,
    "num_node_expansions_std": 6.452214688009482,
    "plan_length": 16.70909090909091,
    "plan_length_std": 5.826811781513976,
    "search_time": 0.09854545454545455,
    "search_time_std": 0.09634064905924718,
    "success_rate": 1.0,
    "timeout_rate": 0.0,
    "total_time": 0.09854545454545455,
    "total_time_std": 0.09634064905924718
}

FD-seq-opt-lmcut

{
    "failure_rate": 0.0,
    "num_node_expansions": 111.58333333333333,
    "num_node_expansions_std": 112.63988217126095,
    "plan_length": 14.555555555555555,
    "plan_length_std": 3.345348714961022,
    "search_time": 0.7273536333333332,
    "search_time_std": 0.8763859731543729,
    "success_rate": 0.6545454545454545,
    "timeout_rate": 0.34545454545454546,
    "total_time": 0.8770647888888887,
    "total_time_std": 0.9133111112508425
}

###FD-lama-first

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.25,
    "num_node_expansions_std": 4.247057805116384,
    "plan_length": 15.375,
    "plan_length_std": 4.090767042988393,
    "search_time": 0.01569877825,
    "search_time_std": 0.01445009488748809,
    "success_rate": 0.7272727272727273,
    "timeout_rate": 0.2727272727272727,
    "total_time": 0.5665527499999999,
    "total_time_std": 0.43317077262043835
}

Delfi

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

DecStar-opt-decoupled

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

DecStar-agl-decoupled

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

Cerberus-seq-sat

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

Cerberus-seq-agl

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

lapkt-bfws

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

I notice that for the cases with 'nan' failures, it always associates with the following warning

[PlannerHandler] Fetching planner Cerberus-seq-sat
Instantiating Cerberus with --alias seq-sat-cerberus2018
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:21<00:00,  1.87it/s]
scripts/benchmark/plan.py:70: RuntimeWarning: Mean of empty slice.
  stat_mean = float(planner_stats[stat].mean().item())
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
{'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}

I'm glad to provide more details if necessary!

@emoLeader
Copy link

Hello, I also have the same problem, I would like to ask if you have solved it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants