Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HPU] llm-full-mp-gpus #299

Open
Delaunay opened this issue Oct 4, 2024 · 2 comments
Open

[HPU] llm-full-mp-gpus #299

Delaunay opened this issue Oct 4, 2024 · 2 comments

Comments

@Delaunay
Copy link
Collaborator

Delaunay commented Oct 4, 2024

Pytorch version too old for fused optimizer

llm-full-mp-gpus.0 [stderr] [rank0]: Traceback (most recent call last):
llm-full-mp-gpus.0 [stderr] [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 645, in <module>
llm-full-mp-gpus.0 [stderr] [rank0]:     sys.exit(recipe_main())
llm-full-mp-gpus.0 [stderr] [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchtune/config/_parse.py", line 50, in wrapper
llm-full-mp-gpus.0 [stderr] [rank0]:     sys.exit(recipe_main(conf))
llm-full-mp-gpus.0 [stderr] [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 632, in recipe_main
llm-full-mp-gpus.0 [stderr] [rank0]:     recipe = FullFinetuneRecipeDistributed(cfg=cfg)
llm-full-mp-gpus.0 [stderr] [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 116, in __init__
llm-full-mp-gpus.0 [stderr] [rank0]:     raise RuntimeError(
llm-full-mp-gpus.0 [stderr] [rank0]: RuntimeError: Using fused optimizer on CPU is only supported in PyTorch nightly.
@Delaunay Delaunay added the HPU label Oct 4, 2024
@Delaunay
Copy link
Collaborator Author

Delaunay commented Oct 4, 2024

--- a/benchmarks/llm/configs/llama3_70B_full.yaml
+++ b/benchmarks/llm/configs/llama3_70B_full.yaml
@@ -94,9 +94,9 @@ gradient_accumulation_steps: 1
 device: cuda
 
 # Memory management
-enable_activation_checkpointing: True
-memory_efficient_fsdp_wrap: True
-fsdp_cpu_offload: True
+enable_activation_checkpointing: false
+memory_efficient_fsdp_wrap: false
+fsdp_cpu_offload: false
 
 # Reduced precision
 dtype: bf16
	* 1 x [rank0]: ValueError: Inconsistent compute device and `device_id` on rank 0: hpu:0 vs hpu
    	| [rank0]: Traceback (most recent call last):
    	| [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 645, in <module>
    	| [rank0]: 	sys.exit(recipe_main())
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchtune/config/_parse.py", line 50, in wrapper
    	| [rank0]: 	sys.exit(recipe_main(conf))
    	| [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 633, in recipe_main
    	| [rank0]: 	recipe.setup(cfg=cfg)
    	| [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 213, in setup
    	| [rank0]: 	self._model = self._setup_model(
    	| [rank0]:   File "/homes/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 323, in _setup_model
    	| [rank0]: 	model = FSDP(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/gpu_migration/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 102, in __init__
    	| [rank0]: 	return FullyShardedDataParallel.call_parent_func(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/gpu_migration/core/register.py", line 158, in call_parent_func
    	| [rank0]: 	return func(*args, **kwargs)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
    	| [rank0]: 	_auto_wrap(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 72, in _auto_wrap
    	| [rank0]: 	_post_order_apply(root_module, wrap_fn)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 79, in _post_order_apply
    	| [rank0]: 	_post_order_apply_inner(root_module, "", None)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 63, in _post_order_apply_inner
    	| [rank0]: 	_post_order_apply_inner(child_module, child_module_name, module)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 63, in _post_order_apply_inner
    	| [rank0]: 	_post_order_apply_inner(child_module, child_module_name, module)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 64, in _post_order_apply_inner
    	| [rank0]: 	optional_module = fn(module)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 98, in fn
    	| [rank0]: 	return fsdp_fn(module, **kwargs)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/gpu_migration/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 102, in __init__
    	| [rank0]: 	return FullyShardedDataParallel.call_parent_func(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/gpu_migration/core/register.py", line 158, in call_parent_func
    	| [rank0]: 	return func(*args, **kwargs)
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
    	| [rank0]: 	_init_param_handle_from_module(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 582, in _init_param_handle_from_module
    	| [rank0]: 	state.compute_device = _get_compute_device(
    	| [rank0]:   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1045, in _get_compute_device
    	| [rank0]: 	raise ValueError(
    	| [rank0]: ValueError: Inconsistent compute device and `device_id` on rank 0: hpu:0 vs hpu

@Delaunay Delaunay changed the title llm-full-mp-gpus on HPU [HPU] llm-full-mp-gpus Oct 15, 2024
@Delaunay
Copy link
Collaborator Author

Dies after running 15/30 (50% of the bench)


llm-full-mp-gpus.0 [data] {'cpudata': {'load': 6.8, 'memory': [217340166144, 1081801142272]}, 'task': 'main', 'time':
1729186143.3196068}
llm-full-mp-gpus.0 [stderr] Synapse detected a device critical error that requires a restart. Killing process in 5 seconds
(hl: 4) 10:28:55 [No progress error]
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.898000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687945 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687946 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687947 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687948 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687949 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687951 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:15.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:851]
Sending process 3687952 closing signal SIGTERM
llm-full-mp-gpus.0 [stderr] W1017 10:29:45.899000 140525722707968 torch/distributed/elastic/multiprocessing/api.py:868] Unable to
shutdown process 3687948 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant