b.collector down for 8.5 hours #157

darkk · 2017-09-10T10:17:53Z

Impact: TBD, it's the primary collector for the mobile app

Detection: email & IRC alert

Timeline UTC, Sep 10:
00:00:01: b systemd[1]: Starting Certbot...
00:00:03: certbot Should renew, less than 30 days before certificate expiry 2017-10-09 23:01:00 UTC. Running pre-hook command: docker stop ooni-backend-b.collector.ooni.io
00:00:15: certbot: Running post-hook command: ... && docker start ooni-backend-b.collector.ooni.io
00:00:32: checkForStaleReports() ValueError: time data '2017-50-20 18:7:3' does not match format '%Y-%m-%d %H:%M:%S'
00:05:55: AlertManager FIRING InstanceDown
07:32:41 @channel can somebody look at this?
07:41:41 good morning
08:30:55: AlertManager RESOLVED InstanceDown
10:14:95: incident published

What went well:

alerting works
it was possible to move broken files away from /data/b.collector.ooni.io/raw_reports to /home/darkk/20170910/ to hotfix the issue

What went wrong:

midnight issue triggered by letsencrypt at UTC midnight, see also b.web-connectivity.th down for 7.6 hours #128
hours-lasting issue noone noticed till morning, see also b.web-connectivity.th down for 7.6 hours #128
no tested way to deploy new version of the collector, all playbooks are in obsolete state

What is still unclear:

there were 7000 files in main.report_dir with 7000 corresponding metadata.json dated ranging from 2017-07-20 to 2017-09-09. List of these files can be found at /home/darkk/20170910/archive.ls-ltr These files were moved to main.archive_dir after successful daemon restart. Is it some case to be monitored? Seems, the spice was flowing from b.collector.ooni.io according to ooni-pipeline-cron.log at chameleon.infra.ooni.io, rsync was taking tens minutes.
there is some significant amount of log lines in /data/b.collector.ooni.io/var/log/ooni like 404 POST /report/20170714T164513Z_AS47589_Lw5RfUUfj5kHbr1MGn7WmnmxKQX3WmqZM3gmrykqRuSTpZUt10, do these lines mean, we're dropping data on the floor? Seems, the client thinks so and retries.
what does primary collector for the mobile app mean? Does mobile app retry failed uploads? Does it retry them with different collector? Is it possible that one part of the report is uploaded to one collector and another part goes to another one?

What could be done to prevent relapse and decrease impact:

there was the cause for those 7000 reports to get stuck, it should be identified and solved
there was a cause for 12 malformed raw reports to kill oonib on restart, it should be identified and solved
the source for these 12 malformed raw reports is TransportCanary/0.0.10-beta, what is it?
get some insistent notifications for urgent actionable alerts, as discussed in IRC, also b.web-connectivity.th down for 7.6 hours #128
move letsencrypt updates to team "business hours", people sleep at UTC midnight, also b.web-connectivity.th down for 7.6 hours #128
avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1, also b.web-connectivity.th down for 7.6 hours #128

The text was updated successfully, but these errors were encountered:

darkk · 2017-11-09T16:28:19Z

Once again.

Timeline UTC:
09 Nov 00:05: [FIRING:1] InstanceDown (https://b.collector.ooni.io/invalidpath
09 Nov 08:42: @darkk | looking at it
09 Nov 09:15: [RESOLVED] InstanceDown

The cause is same: malformed raw reports from TransportCanary/0.0.10-beta. The trigger is same: letsencrypt cert update. Another batch is stored in /home/darkk/20171109/

darkk · 2018-01-08T21:13:12Z

FTR, once again. Timeline UTC:
08 Jan 00:05 [FIRING:1] BlackboxDown
08 Jan 00:20 [RESOLVED] BlackboxDown

darkk · 2018-09-20T09:49:14Z

FTR, once again. Timeline UTC:
20 Sep 00:05 [FIRING] https://c.collector.ooni.io/invalidpath endpoint down
20 Sep 09:21 @hellais I am looking into it just now
20 Sep 09:40 [RESOLVED] https://c.collector.ooni.io/invalidpath endpoint down

The stacktrace was

  File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 188, in checkForStaleReports
   closeReport(report_id)
 File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 166, in closeReport
   report_id)
 File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 25, in report_file_path
   timestamp = datetime.strptime(report_details['test_start_time'], "%Y-%m-%d %H:%M:%S")
 File "/usr/lib/python2.7/_strptime.py", line 328, in _strptime
   data_string[found.end():])
ValueError: unconverted data remains: .774990

That was probably caused by libnettest2 (testing? version 0.0.0 does not sound like release version):

{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "0000-00-00 426827:38:04"
{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "2018-09-10 11:25:45.774990"

xref: ooni/backend#115

darkk · 2018-11-04T11:59:08Z

FTR, once again. Timeline UTC:
04 Nov 00:00 DOWN https://b.collector.ooni.io/invalidpath
04 Nov 11:56 UP

The stracktrace was ValueError: unconverted data remains: .000000

It was libnettest2 once again:

{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "2018-09-10 11:39:31.000000", "test_name": "web_connectivity", "data_format_version": "0.2.0", "test_version": "0.0.1", "input_hashes": [], "probe_asn": "AS0", "probe_cc": "ZZ"}

darkk · 2019-04-29T16:49:44Z

get some insistent notifications for urgent actionable alerts

That's decided as WONTFIX for the moment: #158

hellais · 2020-02-18T18:00:27Z

ooni/backend#343

darkk added the incident label Sep 10, 2017

darkk mentioned this issue Sep 10, 2017

Insistent notifications for urgent actionable alerts #158

Closed

hellais closed this as completed Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b.collector down for 8.5 hours #157

b.collector down for 8.5 hours #157

darkk commented Sep 10, 2017 •

edited

Loading

darkk commented Nov 9, 2017 •

edited

Loading

darkk commented Jan 8, 2018

darkk commented Sep 20, 2018

darkk commented Nov 4, 2018

darkk commented Apr 29, 2019

hellais commented Feb 18, 2020

b.collector down for 8.5 hours #157

b.collector down for 8.5 hours #157

Comments

darkk commented Sep 10, 2017 • edited Loading

darkk commented Nov 9, 2017 • edited Loading

darkk commented Jan 8, 2018

darkk commented Sep 20, 2018

darkk commented Nov 4, 2018

darkk commented Apr 29, 2019

hellais commented Feb 18, 2020

darkk commented Sep 10, 2017 •

edited

Loading

darkk commented Nov 9, 2017 •

edited

Loading