You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to writing code tuning minist to observe the performance boots provided by Fluid, but got some errors.
My env
A 8-gpus local server. Python 3.7.11. ray 0.8.5.
Not using pip package, but using Fluid implementation from the newest repo, aka. the repo after this commit:
commit bc59400c61da7e6fde3cac29ddfe40a718795a58
Author: Peifeng Yu <[email protected]>
Date: Fri Jan 7 19:55:44 2022 -0500
Log for debugging CI
RUN and ERROR info
I am in Fluid/workloads now. I run cp -r ../fluid ./rfluid to avoid ambiguity when importing.
I run this tune_fluid_minist.py use python tune_fluid_minist.py -l
(This tune_fluid_minist.py file is based on Fluid/workloads/tune_fluid_minist.py of this repo, but change the import and change the Executor used.)
[2022-01-08 21:32:01,130][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44eb4392, currently running ones are []
[2022-01-08 21:32:01,138][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44eb4392: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:03,241][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ea46f4, currently running ones are []
[2022-01-08 21:32:03,251][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ea46f4: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:05,259][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ebbf8e, currently running ones are []
[2022-01-08 21:32:05,265][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ebbf8e: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
Traceback (most recent call last):
File "tune_fluid_mnist.py", line 80, in <module>
main()
File "tune_fluid_mnist.py", line 71, in main
analysis = tune.run(MyTrainable, **params)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/tune.py", line 326, in run
runner.step()
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 333, in step
self.trial_executor.on_step_begin(self)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 693, in on_step_begin
self._update_avail_resources()
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 753, in _update_avail_resources
), "Cluster removed resources from running trials!"
AssertionError: Cluster removed resources from running trials!
VERY thanks for you reply !
I am also trying understand these errors.
The text was updated successfully, but these errors were encountered:
I am trying to writing code tuning minist to observe the performance boots provided by Fluid, but got some errors.
My env
A 8-gpus local server.
Python 3.7.11
.ray 0.8.5
.Not using pip package, but using Fluid implementation from the newest repo, aka. the repo after this commit:
RUN and ERROR info
I am in
Fluid/workloads
now. I runcp -r ../fluid ./rfluid
to avoid ambiguity when importing.I run this
tune_fluid_minist.py
usepython tune_fluid_minist.py -l
(This
tune_fluid_minist.py
file is based onFluid/workloads/tune_fluid_minist.py
of this repo, but change the import and change the Executor used.)I got error like this : all_error_output_info.txt
I take the Traceback parts in the output here:
VERY thanks for you reply !
I am also trying understand these errors.
The text was updated successfully, but these errors were encountered: