Skip to content

Commit

Permalink
Crawl data from new PHE dashboard. Fixes #38
Browse files Browse the repository at this point in the history
  • Loading branch information
tomwhite committed Apr 16, 2020
1 parent 38dfced commit 16196f7
Show file tree
Hide file tree
Showing 4 changed files with 86 additions and 8 deletions.
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,11 @@ There is an *experimental* [Datasette instance](https://covid-19-uk-datasette-65

## News

* 15 April 2020. A new [dashboard][PHE-dashboard] for UK and England was launched, replacing the ArcGIS one. As a part of this change the XLSX/CSV files for daily indicators, and case counts by region and UTLA (in England) are no longer being produced. They have been replaced by CSV files, or - for programmatic access - a JSON feed.
* 14 April 2020. No per-area case numbers produced for NI, even though it is a weekday (Tuesday). Yesterday was a bank holiday, and no case numbers were produced either.
* 9 April 2020. The reporting period for case numbers in Wales changed. "For operational reasons, we are moving the point at which we count new cases of Novel Coronavirus (Covid-19) back from 7pm to 1pm. Case numbers on Thursday [9 April] will therefore be lower than usual, and will return to normal on Friday [10 April]."
* 8 April 2020. Scotland started publishing numbers for people in hospital and intensive care, by health board. They also started reporting numbers that were less 5 as "*".
* 6 April 2020. Wales published a new interactive [dashboard](https://public.tableau.com/profile/public.health.wales.health.protection#!/vizhome/RapidCOVID-19virology-Public/Headlinesummary), which gives data for confirmed cases, and testing episodes, broken down by local authority and health board. There is historical data too. Unfortunately there is currently no way of exporting the raw data from the dashboard.
* 6 April 2020. Wales published a new interactive [dashboard][PHW-dashboard], which gives data for confirmed cases, and testing episodes, broken down by local authority and health board. There is historical data too. Unfortunately there is currently no way of exporting the raw data from the dashboard.
* 2 April 2020. Scotland [reported a more timely process for counting deaths](https://www.gov.scot/news/new-process-for-reporting-covid-19-deaths/).
* 29 March 2020. There's a [new spreadsheet](https://fingertips.phe.org.uk/documents/Historic%20COVID-19%20Dashboard%20Data.xlsx) that includes historical data for the dashboard. This includes cases (by country, English UTLA, English NHS region), deaths (by country), and recovered patients (although this isn't being updated at the time of writing).
* 27 March 2020. UK daily indicators now include number of deaths for UK, England, Scotland, Wales, and Northern Ireland.
Expand All @@ -57,10 +58,10 @@ Department of Health and Social Care, and Public Health England
3. Publish deaths by hospital every day.

Public Health Wales
1. ~~Publish the number of tests being performed every day.~~ _The new [dashboard](https://public.tableau.com/profile/public.health.wales.health.protection#!/vizhome/RapidCOVID-19virology-Public/Headlinesummary) includes number of new testing episodes every day._
1. ~~Publish the number of tests being performed every day.~~ _The new [dashboard][PHW-dashboard] includes number of new testing episodes every day._
2. Publish daily totals (tests, confirmed cases, deaths) in machine readable form (CSV).
3. Publish confirmed cases by local authority/health board in machine readable form (CSV).
4. ~~Publish historical data, not just the current day's data.~~ _The new [dashboard](https://public.tableau.com/profile/public.health.wales.health.protection#!/vizhome/RapidCOVID-19virology-Public/Headlinesummary) includes historical data._
4. ~~Publish historical data, not just the current day's data.~~ _The new [dashboard][PHW-dashboard] includes historical data._
5. Publish deaths by hospital every day.

Public Health Scotland
Expand Down Expand Up @@ -143,6 +144,7 @@ Note that the arcgis.com links are direct links to the data.
* Another PHE dashboard: [Coronavirus (COVID-19) in the UK](https://covid19static.azurewebsites.net/), this one is [open source](https://github.com/PublicHealthEngland/coronavirus-dashboard), and provides a download of data in CSV format.
* Ian Watt's [COVID-19 Scotland dataset](https://github.com/watty62/Scot_covid19)
* Emma Doughty's [UK COVID-19 data](https://github.com/emmadoughty/Daily_COVID-19)
* ODI Leeds mirror of PHE dashboard data: https://github.com/odileeds/coronavirus-data

## Tools

Expand Down Expand Up @@ -172,10 +174,7 @@ The **updates** tool runs **crawl** then **convert_sqlite_to_csvs**, and issues
./tools/update.sh Scotland
./tools/update.sh NI
./tools/update.sh UK
./tools/update.sh UK-daily-indicators
./tools/update.sh England
DATE=$(date +'%Y-%m-%d')
curl -L https://www.arcgis.com/sharing/rest/content/items/ca796627a2294c51926865748c4a56e8/data -o data/raw/NHSR_Cases_table-$DATE.csv
./tools/update.sh UK-cases-and-deaths
```

The equivalent done manually (just for Wales):
Expand Down Expand Up @@ -222,3 +221,6 @@ The following will compare the data in this repository, with the data published
curl -L https://fingertips.phe.org.uk/documents/Historic%20COVID-19%20Dashboard%20Data.xlsx -o "data/raw/Historic COVID-19 Dashboard Data.xlsx"
tools/compare_phe_historical.py
```

[PHE-dashboard]: https://coronavirus.data.gov.uk/
[PHW-dashboard]: https://public.tableau.com/profile/public.health.wales.health.protection#!/vizhome/RapidCOVID-19virology-Public/Headlinesummary
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ requests
titlecase
word2number
xlrd
xmltodict
61 changes: 60 additions & 1 deletion tools/crawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@
import requests
import sqlite3
import sys
import xmltodict

from parsers import (
get_text_from_pdf,
parse_daily_areas,
parse_daily_areas_json,
parse_daily_areas_pdf,
parse_totals,
parse_totals_pdf_text,
Expand Down Expand Up @@ -47,6 +49,8 @@ def crawl(date, dataset, check_only=False):
return crawl_arcgis(date, "England", check_only)
elif dataset.lower() == "uk-daily-indicators":
return crawl_arcgis(date, "UK", check_only)
elif dataset.lower() == "uk-cases-and-deaths":
return crawl_json(date, "UK", check_only)


def get_html_url(date, country):
Expand Down Expand Up @@ -109,6 +113,61 @@ def crawl_html(date, country, check_only):
f.write(html)


def crawl_json(date, country, check_only):
if country == "UK":
# See https://github.com/PublicHealthEngland/coronavirus-dashboard
blobs_url = "https://publicdashacc.blob.core.windows.net/publicdata?restype=container&comp=list"
local_data_file = "data/raw/phe/coronavirus-covid-19-number-of-cases-in-{}-{}.json".format(
format_country(country), date
)

if not os.path.exists(local_data_file):
r = requests.get(blobs_url)
blobs_xml = r.text
blobs_dict = xmltodict.parse(blobs_xml)
blob_names = sorted([o["Name"] for o in blobs_dict["EnumerationResults"]["Blobs"]["Blob"] if o["Name"]])
dt = dateparser.parse(date, date_formats=['%Y-%m-%d'], locales=["en-GB"])
blob_names_for_date = [name for name in blob_names if name.startswith("data_{}".format(dt.strftime('%Y%m%d')))]

if len(blob_names_for_date) == 0:
if check_only:
return DatasetUpdate.UPDATE_NOT_AVAILABLE
sys.stderr.write("No data available for {}\n".format(date))
sys.exit(1)

if check_only:
return DatasetUpdate.UPDATE_AVAILABLE

# Use most recent date
data_url = "https://c19pub.azureedge.net/{}".format(blob_names_for_date[-1])
r = requests.get(data_url)
with open(local_data_file, "w") as f:
f.write(r.text)

if check_only:
return DatasetUpdate.ALREADY_UPDATED

with open(local_data_file) as f:
json_data = json.load(f)

totalUKCases = json_data["overview"]["K02000001"]["totalCases"]["value"]
totalUKDeaths = json_data["overview"]["K02000001"]["deaths"]["value"]
englandCases = json_data["countries"]["E92000001"]["totalCases"]["value"]
englandDeaths = json_data["countries"]["E92000001"]["deaths"]["value"]

with sqlite3.connect('data/covid-19-uk.db') as conn:
c = conn.cursor()
c.execute(f"INSERT OR REPLACE INTO indicators VALUES ('{date}', 'UK', 'ConfirmedCases', {totalUKCases})")
c.execute(f"INSERT OR REPLACE INTO indicators VALUES ('{date}', 'UK', 'Deaths', {totalUKDeaths})")
c.execute(f"INSERT OR REPLACE INTO indicators VALUES ('{date}', 'England', 'ConfirmedCases', {englandCases})")
c.execute(f"INSERT OR REPLACE INTO indicators VALUES ('{date}', 'England', 'Deaths', {englandDeaths})")

# get area data for England
daily_areas = parse_daily_areas_json(date, "England", json_data)
if daily_areas is not None:
save_daily_areas(date, "England", daily_areas)
save_daily_areas_to_sqlite(date, "England", daily_areas)

def crawl_pdf(date, country, check_only):
if country == "Northern Ireland":

Expand Down Expand Up @@ -257,7 +316,7 @@ def crawl_arcgis(date, country, check_only):
print("There are no updates before 14:00")
sys.exit(0)
date = now.strftime('%Y-%m-%d')
datasets = ["Wales", "Wales-daily-cases", "Scotland", "NI", "UK"]
datasets = ["Wales", "Wales-daily-cases", "Scotland", "NI", "UK", "UK-cases-and-deaths"]
new_updates_available = False
for dataset in datasets:
updated = crawl(date, dataset, check_only=True)
Expand Down
16 changes: 16 additions & 0 deletions tools/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,22 @@ def parse_daily_areas(date, country, html):
return None


def parse_daily_areas_json(date, country, json_data):
if country == "England":
output_rows = [["Date", "Country", "AreaCode", "Area", "TotalCases"]]
for area_code, o in json_data["utlas"].items():
area = o["name"]["value"]
cases = normalize_int(o["totalCases"]["value"])
if area_code != lookup_local_authority_code(area):
print("Area code mismatch for {}, JSON file gave {}, but lookup was {}".format(area, area_code, lookup_local_authority_code(area)))
return None
output_row = [date, country, area_code, area, cases]
output_rows.append(output_row)
return output_rows

return None


def parse_daily_areas_pdf(date, country, local_pdf_file):
if country == "Northern Ireland":
pdf = pdfplumber.open(local_pdf_file)
Expand Down

0 comments on commit 16196f7

Please sign in to comment.