-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Intel xpu as a new backend of PyTorch-Lightning #17700
base: master
Are you sure you want to change the base?
Conversation
ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext() | ||
with ctx: | ||
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs) | ||
else: | ||
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the xpu case? I think the logic should be:
`
# https://pytorch.org/docs/stable/notes/cuda.html#id5
ctx = nullcontext()
if self.root_device.type == "cuda" and device_ids is not None:
ctx = torch.cuda.stream(torch.cuda.Stream())
elif self.root_device.type == "xpu" and device_ids is not None:
ctx = torch.xpu.stream(torch.xpu.Stream())
with ctx:
....
`
ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext() | ||
with ctx: | ||
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that this block missing the xpu case?
torch.cuda.empty_cache() | ||
with suppress(AttributeError): | ||
torch.xpu.empty_cache() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before calling empty_cache, how about detect whether it is cuda or xpu first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a PR , will merge .
torch.cuda.manual_seed_all(seed) | ||
if XPUAccelerator.is_available(): | ||
XPUAccelerator.manual_seed_all(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to check device(cuda,xpu) first before calling corresponding manual_seed_all api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed in this case, as declaration already checks.
ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext() | ||
with ctx: | ||
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing xpu case
_check_bad_cuda_fork() | ||
if XPUAccelerator.is_available(): | ||
_check_bad_xpu_fork() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be better if check device first before calling corresponding fork api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks already present , additionally not needed.
else: | ||
num_xpus = 0 | ||
xpu_available = False | ||
rank_zero_info(f"XPU available: {xpu_available}, using: {num_xpus} XPUs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that num_xpus and xpu_available are local vars, line 193 shall report error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are set in either if or else scopes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are set in either if or else scopes.
yes, but that means local var, it might pass but it is not good programming style.
6e3dcf1
to
bfc38aa
Compare
6b1a8fc
to
4077b3a
Compare
sys.modules["lightning.pytorch.accelerators.xpu"] = self | ||
|
||
|
||
class XPUAccelerator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is needed as XPU newer was part of the codebase yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sure. Let me remove it.
e03a1ec
to
eae6682
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
57703b7
to
f63b63f
Compare
72e953d
to
af90eb9
Compare
af90eb9
to
59b6935
Compare
d2abb7f
to
da55ed0
Compare
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
@jingxu10 @abhilash1910 I just wanted to check what's the status on XPU integration with lightning. Is this effort still on the horizon or abandoned? |
Yes it is still in progress while we work out upstreaming Lightning -xpu module . Since there is a lot of demand , we are trying to expedite. |
f1d22b2
to
90cde8d
Compare
Do you have a beta of the Lightning-xpu module ? Google indicates some repo which does not exist anymore, internal to lightning-Ai only ? |
Any progress on this? It would be great to see this implemented. Appreciate the effort! |
Yes there is an ongoing discussion from Intel on this. Expecting further upstreaming effort soon. |
[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update typos and bug fixes [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci xpu seeding PR1 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add seeding for pytorch utilities mp_fabric xpu forking xpu multiprocess pytorch add header for xpu rename change to lightning.pytorch [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Teardown from lightning-xpu (from #PR- 3) From #3 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add torch.xpu.stream to ddp update docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update _LIGHTNING_XPU_AVAILABLE to _lightning_xpu_available correct fabric imports.py 1. remove xpu.py from _graveyard 2. correct _lightning_xpu_available() usage fix _try_import function not defined issue in fabric add docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix circle import issue update pytorch trainer connector [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci correct usage in multiprocessing Fix precision device [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update warning format
This pr and examples in conversations enables intel XPU on official torch.xpu, and I think it will work for ipex |
What does this PR do?
Enable Intel xpu as a new backend of PyTorch-Lightning.
Follow up PR of #16834
Contributed by [email protected] and [email protected].
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist