Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GdbServer: Implement new netstream that can be interrupted #4223

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Sonicadvance1
Copy link
Member

@Sonicadvance1 Sonicadvance1 commented Dec 18, 2024

A major limitation of iostream is that you can't have reads or writes
with a safe interrupt. Instead rewrite the interface with Linux ppoll so
that these can be safely interrupted with a signal and return early.

@neobrain
Copy link
Member

Not sure timeouts are the way to go here, but I may be swayed by seeing a PR for the corresponding gdb stub changes.

We briefly talked about this but it would still be good to write down the actual scenario that can trigger problems otherwise.

@Sonicadvance1 Sonicadvance1 changed the title GdbServer: Implement new netstream that can have timeouts GdbServer: Implement new netstream that can be interrupted Jan 8, 2025
@Sonicadvance1
Copy link
Member Author

Not sure timeouts are the way to go here, but I may be swayed by seeing a PR for the corresponding gdb stub changes.

We briefly talked about this but it would still be good to write down the actual scenario that can trigger problems otherwise.

Instead of using a timeout, I have removed the timeout capability and left the remaining ppoll implementation. This allows us to safely get interrupted with a signal, returning EINTR, and passing that up as no data.
This will let me instead signal to the gdbserver thread that a thread has some messages to be sent.

Copy link
Member

@neobrain neobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some preliminary comments.

This allows us to safely get interrupted with a signal, returning EINTR, and passing that up as no data.

Is the occurrence of a signal here a future implementation choice or something that somehow happens during regular GDB use? Do we need any special code to handle this signal or do send/recv simply return an error when a signal fires?

As before, could you sketch out the actual scenario that triggers problems if we don't support interruptible netstreams?

Comment on lines 17 to 32
struct ReturnGet {
char data;
bool Hangup;
};
std::optional<ReturnGet> get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better modeled using an std::variant with a monostate:

struct ReturnGet : std::variant<...> {
    bool ShouldHangup() { ... }
    ...
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to excuse me for not being as good at C++ as you, I don't know how to do what you want here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something like I pushed? I've never used std::variant like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks good.

Now that the code is easier to follow, I'm noticing this however: Is there any difference between the {true} and {false} configurations? {false} indicates no a Hangup was received but no data can be received either, which effectively means we can't progress. Is there any benefit in handling this differently than a hangup?

If there's no difference, the entire return value can actually be condensed down to just a std::optional<char>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false means there was no data and no hangup, meaning it was interrupted.

Source/Tools/LinuxEmulation/LinuxSyscalls/NetStream.cpp Outdated Show resolved Hide resolved
Source/Tools/LinuxEmulation/LinuxSyscalls/GdbServer.h Outdated Show resolved Hide resolved
Source/Tools/LinuxEmulation/LinuxSyscalls/GdbServer.cpp Outdated Show resolved Hide resolved
A major limitation of iostream is that you can't have reads or writes
with a safe interrupt. Instead rewrite the interface with Linux ppoll so
that these can be safely interrupted with a signal and return early.
@Sonicadvance1
Copy link
Member Author

Added some preliminary comments.

This allows us to safely get interrupted with a signal, returning EINTR, and passing that up as no data.

Is the occurrence of a signal here a future implementation choice or something that somehow happens during regular GDB use? Do we need any special code to handle this signal or do send/recv simply return an error when a signal fires?

As before, could you sketch out the actual scenario that triggers problems if we don't support interruptible netstreams?

The signal is a future implementation choice, because we need to send packets from different threads in a sorted fashion and only when gdbserver is in a "running" state. So threads will queue packets in fifo, and signal the gdbserver thread.

Netstream doesn't need anything special to handle the signal in this case, because this will cause ppoll to early exit with EINTR, and then we can start consuming the packet queue.

If we can't interrupt the netstream then the gdbserver just...can't handle queued packets because it will forever be waiting to receive with an infinite timeout? And we can't be sending the packets from other threads because of strict ordering guarantees that gdb actually requires, which is one of the reasons why gdbserver in FEX has historically been a buggy mess.

char escaped;
stream >> escaped;
packet.push_back(escaped ^ 0x20);
auto escaped = CommsStream.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a potential race condition here? An interrupt exactly after } would be caught here but silently be skipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it will just permanently hang trying to receive more data, won't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless a hangup occurs, similar to previous behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix it? Not sure it makes sense to add the capability for interruptions and then only use it half of the time...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no fix in this case. We need to wait until gdb has finished sending the message, otherwise the command stream is borked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a scenario that can happen in practice or must other FEX code somehow ensure no interruptions in this specific part?

How do you currently plan to deal with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gdb is likely to have already sent the message and its in our socket's buffer. The extreme case will be gdb has crashed which we notice as a hangup and early terminate.

When interrupted, it will continue waiting for its next message from gdb and handle whatever the interruption was about afterwards. There's nothing more to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing it, which I think relates back to my unanswered questions at the end of #4223 (comment) . I need to see an actual rundown of the problem scenario(s) you're addressing. It's unclear where interruptions happen and why, and which elements of this PR are mandated by GDB and which ones are pending on implementation choices.

Normally I just wouldn't expect async-aware code to be able to enter a permanently hanging state. It looks like a design flaw in the current implementation, but I can't pin it down more specifically without more context.

break;
}
case '#': // end of packet
{
char hexString[3] = {0, 0, 0};
stream.read(hexString, 2);
CommsStream.read(hexString, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the above, this doesn't seem to handle hangup events properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Source/Tools/LinuxEmulation/LinuxSyscalls/NetStream.cpp Outdated Show resolved Hide resolved
Comment on lines 17 to 32
struct ReturnGet {
char data;
bool Hangup;
};
std::optional<ReturnGet> get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks good.

Now that the code is easier to follow, I'm noticing this however: Is there any difference between the {true} and {false} configurations? {false} indicates no a Hangup was received but no data can be received either, which effectively means we can't progress. Is there any benefit in handling this differently than a hangup?

If there's no difference, the entire return value can actually be condensed down to just a std::optional<char>.

@neobrain
Copy link
Member

The signal is a future implementation choice, because we need to send packets from different threads in a sorted fashion and only when gdbserver is in a "running" state. So threads will queue packets in fifo, and signal the gdbserver thread.

Considering we're already using poll, what's the benefit of using a signal over adding e.g. an eventfd to the fd list?

Netstream doesn't need anything special to handle the signal in this case, because this will cause ppoll to early exit with EINTR, and then we can start consuming the packet queue.

What happens if the interrupt is sent while outside of poll?

If we can't interrupt the netstream then the gdbserver just...can't handle queued packets because it will forever be waiting to receive with an infinite timeout? And we can't be sending the packets from other threads because of strict ordering guarantees that gdb actually requires, which is one of the reasons why gdbserver in FEX has historically been a buggy mess.

There seem to be a lot of assumptions about common knowledge here. What kind of packet queuing would leave us waiting indefinitely? What ordering guarantees does gdb require? I know the answer is probably "lots of guarantees", but it would be good to see literally any concrete example of this. What I'm looking for is something like "guest app is running 2 threads; GDB user presses ctrl+c in gdb shell while simultaneously X is happening on the guest; Y must happen now, but it can't because FEX's GDB code is sending data that the GDB client does not expect".

Copy link
Member

@neobrain neobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be easier to review if the actual interruption logic were illustrated in a followup PR already. It's difficult to see if signals are indeed the way forward here without that, but at least probably most of this code should apply when using other wakeup mechanisms are used...

}
}

if (c.HasHangup()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control flow has become weird here (maybe since the latest changes?).

Why do we skip the InvalidateSocket call when we're receiving a Hangup?

Why is there still an outer while (!CoreShuttingDown.load()) { loop? I don't see how that outer loop would ever run more than one iteration, since we reach this line only when a Hangup is received or CoreShuttingDown==true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was a bug introduced due to the refactoring request to delete an outer loop. Added the loop back because it's required to allow gdb connections to disconnect and reconnect at a later time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's three levels of nesting loops now, two of them on the same variable...

  while (!CoreShuttingDown.load()) {
    // ...

    while (!CoreShuttingDown.load()) {
      // ...
      while ((c = CommsStream.get()).HasData()) {

@Sonicadvance1
Copy link
Member Author

This would be easier to review if the actual interruption logic were illustrated in a followup PR already. It's difficult to see if signals are indeed the way forward here without that, but at least probably most of this code should apply when using other wakeup mechanisms are used...

#4275 There you go.

@neobrain
Copy link
Member

#4275 There you go.

What am I looking at? It's hard to make out the relevant change among 14 commits titled "More gdbserver" and a 1000 lines diff. Is it just the last patch that's relevant or is there more?

Also I was hoping for something that explains your design decisions on using signals (as opposed to other interruption mechanisms), not just a dump of your WIP branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants