Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some errors when testing Fluid on minist tuning #3

Open
xyzustc opened this issue Jan 8, 2022 · 0 comments
Open

Some errors when testing Fluid on minist tuning #3

xyzustc opened this issue Jan 8, 2022 · 0 comments

Comments

@xyzustc
Copy link

xyzustc commented Jan 8, 2022

I am trying to writing code tuning minist to observe the performance boots provided by Fluid, but got some errors.

My env

A 8-gpus local server. Python 3.7.11. ray 0.8.5.
Not using pip package, but using Fluid implementation from the newest repo, aka. the repo after this commit:

commit bc59400c61da7e6fde3cac29ddfe40a718795a58
Author: Peifeng Yu <[email protected]>
Date:   Fri Jan 7 19:55:44 2022 -0500

    Log for debugging CI

RUN and ERROR info

I am in Fluid/workloads now. I run cp -r ../fluid ./rfluid to avoid ambiguity when importing.
I run this tune_fluid_minist.py use python tune_fluid_minist.py -l
(This tune_fluid_minist.py file is based on Fluid/workloads/tune_fluid_minist.py of this repo, but change the import and change the Executor used.)

I got error like this : all_error_output_info.txt
I take the Traceback parts in the output here:

[2022-01-08 21:32:01,130][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44eb4392, currently running ones are []
[2022-01-08 21:32:01,138][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44eb4392: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:03,241][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ea46f4, currently running ones are []
[2022-01-08 21:32:03,251][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ea46f4: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:05,259][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ebbf8e, currently running ones are []
[2022-01-08 21:32:05,265][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ebbf8e: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
Traceback (most recent call last):
  File "tune_fluid_mnist.py", line 80, in <module>
    main()
  File "tune_fluid_mnist.py", line 71, in main
    analysis = tune.run(MyTrainable, **params)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/tune.py", line 326, in run
    runner.step()
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 333, in step
    self.trial_executor.on_step_begin(self)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 693, in on_step_begin
    self._update_avail_resources()
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 753, in _update_avail_resources
    ), "Cluster removed resources from running trials!"
AssertionError: Cluster removed resources from running trials!

VERY thanks for you reply !
I am also trying understand these errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant