-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI often fails with "Could not resolve host: github.com" #549
Comments
I'm asking around other projects to see if they're seeing this as well. |
Haven't seen reports of this elsewhere. @eu9ene - have you seen this on GPU workers only? Or also on the CPU workers? |
I'm not sure but I feel like I've been seeing this in random places. |
Here's an instance on a GPU worker: https://firefox-ci-tc.services.mozilla.com/tasks/XRYrb2BxTOyJa209jrLQ7A/runs/0/logs/public/logs/live.log |
Thanks; so it seems very unlikely to be related to specific worker images. @aerickson - I don't suppose you have any idea what's going on here? |
@bhearsum I'm not sure what's going on. Translations GPU workers on GCP should use GCP's DNS servers (provided by DHCP) and the Snakepit worker use our internal infoblox servers (configured in dnsmasq). It seems like a network blip or perhaps the DNS server could be overloaded for a second? I haven't heard about any Github outages around DNS. I've never really heard of DNS outages (it's a pretty resilient service/protocol). If we find a concentrated event or location let me know and I'll dig in some more. |
We do see these fairly often - I would say maybe on 5-10% of the tasks run. I'll try to collect some data to help us analyze this better. |
Here's failures by worker group:
And here's timestamps when we hit the failures:
And by worker image:
Clearly the most notable part here here is that we're seeing more issues on GPU images. And within that, we seem to have gotten more beginning in late April/early May. We added a test pool with a new image on April 22nd (that I was running a lot to test things) in https://phabricator.services.mozilla.com/D208202. That image went to production on May 8th in https://phabricator.services.mozilla.com/D209840. mozilla-platform-ops/monopacker#140 was the PR related to this image, but I don't know how deterministic the other parts are? Eg: could we have picked up a change to a system package that is now causing problems? @aerickson - do you have any thoughts? If we still have the old image, maybe we could poke around and compare the new to old one? (I'd be happy to do this if you want.) |
Ok, it happens every single pipeline run for me now and does not restart. Marking as blocker. |
…DNS failures This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
…DNS failures This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
Poking at this a bit on an interactive instance, too. I sortof repro'ed (it seems to have retried with success though):
Looking at the machine configuration, I see that it uses the standard resolved that we expect on Ubuntu:
It claims to be timing out talking to 127.0.0.53 - but I don't know if that means it really couldn't talk to the local resolver, or if that's just the local resolver passing along a failure from the upstream. It seems more likely that it's the latter, but I can't say that with any certainty. The upstream server is a reserved address, and I'm guessing it's something internal to GCP? I'm really not sure, to be honest - that's quite out of my depth. I looked through syslogs and found nothing of note, just messages like this every time I perfomed a lookup:
|
…DNS failures This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
…DNS failures This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
I was looking through worker logs of a worker that had a dns issue in production and found other things of interest. In the task we had:
And in the syslogs I found:
(For whatever reason, there seems to be a timestamp discrepancy between the task log and the system logs, for example we have messages like Within a very short time we see:
Maybe that second lookup is expected, maybe it's not - I'm really not sure. The github.com one succeeds, the I'm still not sure what to make of this, just dropping more info at the moment. |
Another interesting thing is that we 12 attempts to look up the A record for
|
Did this go away by itself? |
I'm not aware of anything we did to fix it. So if it's not happening anymore...yes! |
We spoke too soon clearly, given #855. Some suggestions for further investigation from the Taskcluster weekly meeting:
|
It's possible. We don't pin the base image (we accept patch level updates like 22.04.x) and the scripts run |
I had a chance to compare the most recent GPU image to the previous one today. As far as packages installed goes, there's a bunch of minor upgrades but nothing that really stands out. The closest things to a network related package that changed are
|
Doing some more research on this. systemd/systemd#21123 is one of the first things that comes up. The solution that most people use is to switch to dnsmasq. @aerickson - is that something we can try here? (I can work up a monopacker PR for it - just want to check that it's a viable thing to try before doing so.) |
Nice find. Yeah, definitely. |
Hopefully a fix for mozilla/translations#549
Hopefully a fix for mozilla/translations#549
#949 is switching CPU tasks to a new image running Ubuntu 24.04. We'll be switching GPU tasks to that in the near future as well. I'm hoping that taking newer versions of systemd-resolved and/or other packages will help with this, but we'll see. |
If it's expected that it can fail we should add retries to all our steps:
Here is an example task: https://firefox-ci-tc.services.mozilla.com/tasks/JdDA-zYDQnG4166Zbsqq6w
The text was updated successfully, but these errors were encountered: