Benchmarking Performance Different Than Reported #3

jiadingfang · 2023-07-20T22:33:16Z

Hi, thanks for such an all-around repo for working with 3DSG planning!

I would like to reproduce the benchmarking results in your repo under the benchmark folder to make sure everything runs properly before testing my own planners. However, during my testing, the behaviors of the planners are quite different than what are reported.

As of 07/20/2023, I ran all available planners in pddlgym_planners/__init__.py with pddl_domain taskographyv2tiny1 with the command python scripts/benchmark/plan.py --domain-name $DOMAIN_NAME --planner $PLANNER. The results are the following:

FF: error while running

FF-X: the same error as FF
FD-lama-first: plan failure

Cerberus-seq-sat: plan falure

Cerberus-seq-agl: plan failure

DecStar-agl-decoupled: plan failure

lapkt-bfws: slightly different behavior than benchmark/taskographyv2tiny1_bfws. My result:

reported in benchmark/taskographyv2tiny1_bfws/taskographyv2tiny1_bfws_test.json:

FD-seq-opt-lmcut: plan failure

Delfi: plan failure:

DecStar-opt-decoupled: plan failure

I followed the installation stated in the https://github.com/taskography/taskography-api#installation with only a few changes to fix some errors:
0. Ubuntu 22.04.

Conda create an empty env with python=3.7.
Add a comma , at the end of line

taskography-api/setup.py

Line 26 in bcb47fc

"tqdm"

to separate the two lines.
Run pip install -e . and pip install -r requirements.txt.
Downgrade importlib-metadata from 6.7.0 to 4.12.0 to avoid error 'EntryPoints' object has no attribute 'get'. Source: https://stackoverflow.com/questions/73929564/entrypoints-object-has-no-attribute-get-digital-ocean
Move from __future__ import annotations to the first line to avoid error from __future__ imports must occur at the beginning of the file. Source: https://stackoverflow.com/questions/38688504/from-future-imports-must-occur-at-the-beginning-of-the-file-what-defines
Run scripts/validate/loader.py and scripts/validate/taskography_env.py, pass both.

I'm willing to offer more details if needed. Highly appreciate it if you could offer some help as a solid benchmark is the pre-requisite to any possible future researches. Thanks in advance!

The text was updated successfully, but these errors were encountered:

krrish94 · 2023-07-21T01:53:08Z

I think the planners did not compile at all, which is why you get nans nearly everywhere.

Tagging this as an install issue, and looping in @agiachris @Khodeir

krrish94 · 2023-07-21T02:27:55Z

I looked into this further -- things seem fine on my end.

Install dependencies + version info

I started out with a fresh conda environment (python=3.10), on Ubuntu 20.04
I installed all dependencies from the current requirements.txt file in taskography-api -- with two changes. I had to comment out the gym==0.21.0 requirement here and here
Added the missing comma here -- thanks for this catch!
I ran the same set of commands raised in the issue above, but they executed correctly on my end.

C, CXX compiler versions used

-- The C compiler identification is GNU 8.4.0
-- The CXX compiler identification is GNU 8.4.0

Run planners

python scripts/benchmark/plan.py --domain-name taskographyv2tiny1 --planner FF

{'failure_rate': 0.0,
 'num_node_expansions': 18.65,
 'num_node_expansions_std': 4.514144437210666,
 'plan_length': 15.225,
 'plan_length_std': 4.09565318355937,
 'search_time': 0.059750000000000004,
 'search_time_std': 0.05884248040319171,
 'success_rate': 1.0,
 'timeout_rate': 0.0,
 'total_time': 0.059750000000000004,
 'total_time_std': 0.05884248040319171}

FF-X

{'failure_rate': 0.0,
 'num_node_expansions': 18.65,
 'num_node_expansions_std': 4.514144437210666,
 'plan_length': 15.225,
 'plan_length_std': 4.09565318355937,
 'search_time': 0.04825,
 'search_time_std': 0.0485225463058154,
 'success_rate': 1.0,
 'timeout_rate': 0.0,
 'total_time': 0.04825,
 'total_time_std': 0.0485225463058154}

FD-lama-first

{'failure_rate': 0.0,
 'num_node_expansions': 19.5625,
 'num_node_expansions_std': 4.022883760438524,
 'plan_length': 14.84375,
 'plan_length_std': 3.841462734102727,
 'search_time': 0.0116954025,
 'search_time_std': 0.00940676205541192,
 'success_rate': 0.8,
 'timeout_rate': 0.2,
 'total_time': 0.49729570312500004,
 'total_time_std': 0.3234760420509391}

Cerberus-seq-sat

{'failure_rate': 0.7,
 'num_node_expansions': 0.0,
 'num_node_expansions_std': 0.0,
 'plan_length': 13.0,
 'plan_length_std': 0.0,
 'search_time': 0.708582,
 'search_time_std': 0.0,
 'success_rate': 0.025,
 'timeout_rate': 0.275,
 'total_time': 0.841963,
 'total_time_std': 0.0}

Cerberus-seq-agl (on the train split -- all problems time out; on the test split, a couple valid plans are found)

{'failure_rate': 0.9272727272727272,
 'num_node_expansions': 24.666666666666668,
 'num_node_expansions_std': 0.9428090415820634,
 'plan_length': 13.0,
 'plan_length_std': 0.0,
 'search_time': 0.3534086666666667,
 'search_time_std': 0.01469978123042048,
 'success_rate': 0.05454545454545454,
 'timeout_rate': 0.01818181818181818,
 'total_time': 0.6266036666666667,
 'total_time_std': 0.028384116501232768}

DecStar-agl-decoupled

{'failure_rate': 0.925,
 'num_node_expansions': 18.0,
 'num_node_expansions_std': 0.0,
 'plan_length': 15.0,
 'plan_length_std': 0.0,
 'search_time': 0.09,
 'search_time_std': 0.0,
 'success_rate': 0.025,
 'timeout_rate': 0.05,
 'total_time': 0.26,
 'total_time_std': 0.0}

FD-seq-opt-lmcut

{'failure_rate': 0.0,
 'num_node_expansions': 109.76666666666667,
 'num_node_expansions_std': 208.1750358605843,
 'plan_length': 14.366666666666667,
 'plan_length_std': 3.3910011632096047,
 'search_time': 0.44586308,
 'search_time_std': 0.463168532250191,
 'success_rate': 0.75,
 'timeout_rate': 0.25,
 'total_time': 0.58704412,
 'total_time_std': 0.49714965928707583}

Delfi

{'failure_rate': 0.0,
 'num_node_expansions': 55.166666666666664,
 'num_node_expansions_std': 33.69182228507222,
 'plan_length': 14.416666666666666,
 'plan_length_std': 3.882832585740581,
 'search_time': 0.03140590499999999,
 'search_time_std': 0.020406450112085275,
 'success_rate': 0.3,
 'timeout_rate': 0.7,
 'total_time': 0.27425427499999994,
 'total_time_std': 0.5906703315447319}

Compilation issue

Returning to the issue in question, I think it'd be worth looking into compiler changes acros Ubuntu 20.04, 22.04, and whether that's causing the FF and FD variants to not compile on your end. One other change I notice is also that I used python 3.10 (although I'm not sure that's what's causing the FF, FD planners to not compile on your end)

jiadingfang · 2023-07-26T23:20:43Z

Thanks for the fast and detailed response, really appreciate it!

I follow your advice by starting with a fresh ubuntu:20.04 env and perform the installation. This time, I get more planners to work (yea!) but there are still a few that report nan failures. The detailed test results will be listed at the bottom.

To facilitate reproducibility, I used docker to produce the results. I made a Dockerfile and a script for running as in my PR. You can run all my tests by running python scripts/benchmark/test.py after you are inside the interactive docker container by running docker.sh script.

I hope we can together work on a good Dockerfile so that other people can use it more easily because this is such a good benchmark. I will also contribute a docker image once we fix the dependency issues and pass all tests.

FF-X

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.472727272727273,
    "num_node_expansions_std": 6.452214688009482,
    "plan_length": 16.70909090909091,
    "plan_length_std": 5.826811781513976,
    "search_time": 0.08345454545454543,
    "search_time_std": 0.0829066985300767,
    "success_rate": 1.0,
    "timeout_rate": 0.0,
    "total_time": 0.08345454545454543,
    "total_time_std": 0.0829066985300767
}

FF

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.472727272727273,
    "num_node_expansions_std": 6.452214688009482,
    "plan_length": 16.70909090909091,
    "plan_length_std": 5.826811781513976,
    "search_time": 0.09854545454545455,
    "search_time_std": 0.09634064905924718,
    "success_rate": 1.0,
    "timeout_rate": 0.0,
    "total_time": 0.09854545454545455,
    "total_time_std": 0.09634064905924718
}

FD-seq-opt-lmcut

{
    "failure_rate": 0.0,
    "num_node_expansions": 111.58333333333333,
    "num_node_expansions_std": 112.63988217126095,
    "plan_length": 14.555555555555555,
    "plan_length_std": 3.345348714961022,
    "search_time": 0.7273536333333332,
    "search_time_std": 0.8763859731543729,
    "success_rate": 0.6545454545454545,
    "timeout_rate": 0.34545454545454546,
    "total_time": 0.8770647888888887,
    "total_time_std": 0.9133111112508425
}

###FD-lama-first

{
    "failure_rate": 0.0,
    "num_node_expansions": 20.25,
    "num_node_expansions_std": 4.247057805116384,
    "plan_length": 15.375,
    "plan_length_std": 4.090767042988393,
    "search_time": 0.01569877825,
    "search_time_std": 0.01445009488748809,
    "success_rate": 0.7272727272727273,
    "timeout_rate": 0.2727272727272727,
    "total_time": 0.5665527499999999,
    "total_time_std": 0.43317077262043835
}

Delfi

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

DecStar-opt-decoupled

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

DecStar-agl-decoupled

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

Cerberus-seq-sat

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

Cerberus-seq-agl

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

lapkt-bfws

{
    "failure_rate": 1.0,
    "num_node_expansions": NaN,
    "num_node_expansions_std": NaN,
    "plan_length": NaN,
    "plan_length_std": NaN,
    "search_time": NaN,
    "search_time_std": NaN,
    "success_rate": 0.0,
    "timeout_rate": 0.0,
    "total_time": NaN,
    "total_time_std": NaN
}

I notice that for the cases with 'nan' failures, it always associates with the following warning

[PlannerHandler] Fetching planner Cerberus-seq-sat
Instantiating Cerberus with --alias seq-sat-cerberus2018
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:21<00:00,  1.87it/s]
scripts/benchmark/plan.py:70: RuntimeWarning: Mean of empty slice.
  stat_mean = float(planner_stats[stat].mean().item())
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
{'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}

I'm glad to provide more details if necessary!

emoLeader · 2024-05-05T13:37:39Z

Hello, I also have the same problem, I would like to ask if you have solved it

krrish94 assigned krrish94, Khodeir and agiachris and unassigned krrish94 Jul 21, 2023

krrish94 added the installation label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Performance Different Than Reported #3

Benchmarking Performance Different Than Reported #3

jiadingfang commented Jul 20, 2023

krrish94 commented Jul 21, 2023

krrish94 commented Jul 21, 2023

jiadingfang commented Jul 26, 2023

emoLeader commented May 5, 2024

Benchmarking Performance Different Than Reported #3

Benchmarking Performance Different Than Reported #3

Comments

jiadingfang commented Jul 20, 2023

krrish94 commented Jul 21, 2023

krrish94 commented Jul 21, 2023

Install dependencies + version info

Run planners

Compilation issue

jiadingfang commented Jul 26, 2023

FF-X

FF

FD-seq-opt-lmcut

Delfi

DecStar-opt-decoupled

DecStar-agl-decoupled

Cerberus-seq-sat

Cerberus-seq-agl

lapkt-bfws

emoLeader commented May 5, 2024