Skip to content

Commit

Permalink
Merge pull request #33 from sensiblecodeio/peter/geography-file-heade…
Browse files Browse the repository at this point in the history
…r-case

The code to process the geography lookup file expects lowercase file suffixes cd, nm and nmw
  • Loading branch information
phynes-sensiblecode authored Aug 19, 2022
2 parents ae6e697 + 6a547a3 commit 39559c5
Show file tree
Hide file tree
Showing 10 changed files with 53 additions and 48 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ and converts them to hierarchical JSON that can be loaded into `cantabular-metad
It is compatible with version `1.2` of the metadata schema and versions `10.1.0`/`10.0.0`/`9.3.0`/`9.2.0` of
`cantabular-metadata`. `10.1.0` format is used by default and is identical to the `10.1.0` and `9.3.0` format.

This is version `1.2.delta` of the CSV to JSON processing software and is subject to change.
This is version `1.2.epsilon` of the CSV to JSON processing software and is subject to change.

The applications only use packages in the Python standard library.

Expand Down Expand Up @@ -35,7 +35,7 @@ Basic logging will be displayed by default, including the number of high-level C
objects loaded and the name of the output files.
```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/
t=2022-07-14 16:07:50,859 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 16:07:50,859 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 16:07:50,859 lvl=INFO msg=CSV source directory: test/testdata/
t=2022-07-14 16:07:50,859 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
t=2022-07-14 16:07:50,865 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER
Expand All @@ -55,7 +55,7 @@ t=2022-07-14 16:07:50,869 lvl=INFO msg=Written service metadata file to: ctb_met
More detailed information can be obtained by running with a `-l DEBUG` flag e.g.:
```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -l DEBUG
t=2022-07-14 16:08:32,792 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 16:08:32,792 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 16:08:32,792 lvl=INFO msg=CSV source directory: test/testdata/
t=2022-07-14 16:08:32,792 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
t=2022-07-14 16:08:32,793 lvl=DEBUG msg=Creating classification for geographic variable: GEO1
Expand Down Expand Up @@ -118,7 +118,7 @@ arguments as described in the help text for `ons_csv_to_ctb_json_main.py`:
For example:
```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -p t -m test -b 42
t=2022-07-14 16:09:09,794 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 16:09:09,794 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 16:09:09,794 lvl=INFO msg=CSV source directory: test/testdata/
t=2022-07-14 16:09:09,794 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
t=2022-07-14 16:09:09,796 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER
Expand Down Expand Up @@ -147,7 +147,7 @@ This repository contains some test data that is full of errors. It can be used t
of the `--best-effort` flag as shown below:
```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/best_effort -o ctb_metadata_files/ -m best-effort --best-effort
t=2022-07-14 22:50:38,931 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 22:50:38,931 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 22:50:38,932 lvl=INFO msg=CSV source directory: test/testdata/best_effort
t=2022-07-14 22:50:38,934 lvl=WARNING msg=Reading test/testdata/best_effort/Classification.csv:3 no value supplied for required field Variable_Mnemonic
t=2022-07-14 22:50:38,934 lvl=WARNING msg=Reading test/testdata/best_effort/Classification.csv:3 dropping record
Expand Down Expand Up @@ -215,7 +215,7 @@ datasets with a `Dataset_Mnemonic` beginning with **TS** are processed.

```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/dataset_filter/ -o ctb_metadata_files/ --dataset-filter TS
t=2022-08-18 16:06:26,780 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-08-18 16:06:26,780 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-08-18 16:06:26,780 lvl=INFO msg=CSV source directory: test/testdata/dataset_filter/
t=2022-08-18 16:06:26,780 lvl=INFO msg=Dataset filter: TS
t=2022-08-18 16:06:26,781 lvl=INFO msg=No geography file specified
Expand Down Expand Up @@ -244,7 +244,7 @@ can be found in the `sample_2011` directory.
Use this command to convert the files to JSON (with debugging enabled):
```
> python3 bin/ons_csv_to_ctb_json_main.py -i sample_2011/ -g sample_2011/geography.csv -o ctb_metadata_files/ -m 2001-sample -l DEBUG
t=2022-07-14 16:10:18,924 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 16:10:18,924 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 16:10:18,924 lvl=INFO msg=CSV source directory: sample_2011/
t=2022-07-14 16:10:18,924 lvl=INFO msg=Geography file: sample_2011/geography.csv
t=2022-07-14 16:10:18,927 lvl=DEBUG msg=Creating classification for geographic variable: Region
Expand Down Expand Up @@ -329,7 +329,7 @@ will be reflected in the output filenames, but `10.1.0` format will be used.
To generate version 9.2.0 compatible files from the test data use the following command:
```
> python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -v 9.2.0
t=2022-07-14 16:14:01,895 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
t=2022-07-14 16:14:01,895 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
t=2022-07-14 16:14:01,895 lvl=INFO msg=CSV source directory: test/testdata/
t=2022-07-14 16:14:01,895 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
t=2022-07-14 16:14:01,897 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER
Expand Down
5 changes: 5 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
Release Notes
=============

1.2.epsilon
-----------
- The code to process the geography lookup file expects lowercase file suffixes `cd`, `nm` and `nmw`.
Previously it expected uppercase suffixes.

1.2.delta
-----------
- Added a new `--dataset-filter` option that is used to filter the datasets which are processed
Expand Down
2 changes: 1 addition & 1 deletion bin/fixup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import csv
from argparse import ArgumentParser

VERSION = '1.2.delta'
VERSION = '1.2.epsilon'


def main():
Expand Down
18 changes: 9 additions & 9 deletions bin/ons_csv_to_ctb_json_geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,29 @@
AreaName = namedtuple('AreaName', 'name welsh_name')


CODE_SUFFIX = 'CD'
NAME_SUFFIX = 'NM'
WELSH_NAME_SUFFIX = 'NMW'
CODE_SUFFIX = 'cd'
NAME_SUFFIX = 'nm'
WELSH_NAME_SUFFIX = 'nmw'


def read_geo_cats(filename):
"""
Read a lookup file containing variable category codes, labels and Welsh labels.
Each variable will have a CD (code) column. It may also have NM (name) and NMW (Welsh name)
Each variable will have a cd (code) column. It may also have nm (name) and nmw (Welsh name)
columns. The column names are expected to have the format (as a regular expression):
<variable name><2 numerical digits for year><column type>
And to match the regular expression:
^[a-zA-Z0-9_-]+[0-9][0-9](CD|NM|NMW)$
^[a-zA-Z0-9_-]+[0-9][0-9](cd|nm|nmw)$
Category names are returned for all variables with a NM column. Welsh category names are
returned for all variables with a NMW column. The names are returned as a dict of dicts keyed
Category names are returned for all variables with a nm column. Welsh category names are
returned for all variables with a nmw column. The names are returned as a dict of dicts keyed
on the variable name. Each sub-dict is keyed on the category code and each item is of type
AreaName.
- All fields have leading/trailing whitespace removed.
- There must not be entries for a single variable with different years e.g. LAD11CD
and LAD22CD.
- There must not be entries for a single variable with different years e.g. LAD11cd
and LAD22cd.
- If multiple lines in the file refer to the same category then the names must be consistent
on all lines.
Expand Down
2 changes: 1 addition & 1 deletion bin/ons_csv_to_ctb_json_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from ons_csv_to_ctb_json_load import Loader, PUBLIC_SECURITY_MNEMONIC
from ons_csv_to_ctb_json_bilingual import BilingualDict, Bilingual

VERSION = '1.2.delta'
VERSION = '1.2.epsilon'

SYSTEM = 'cantabm'
DEFAULT_CANTABULAR_VERSION = '10.1.0'
Expand Down
2 changes: 1 addition & 1 deletion bin/remove_empty_rows_and_columns.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import csv
from argparse import ArgumentParser

VERSION = '1.2.delta'
VERSION = '1.2.epsilon'


def main():
Expand Down
2 changes: 1 addition & 1 deletion sample_2011/geography.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Region11CD,Region11NM,Region11NMW,Country11CD,Country11NM,Country11NMW
Region11cd,Region11nm,Region11nmw,Country11cd,Country11nm,Country11nmw
E12000001,North East,Gogledd Ddwyrain,E,England,Lloegr
E12000002,North West,Gogledd Orllewin,E,England,Lloegr
E12000003,Yorkshire and the Humber,Swydd Efrog a'r Humber,E,England,Lloegr
Expand Down
4 changes: 2 additions & 2 deletions test/test_category.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ def test_cats_for_geo_classification(self):
self.run_test([row], f'^Reading {FILENAME}:2 found category for geographic classification GEO1: all categories for geographic classifications must be in a separate lookup file$')

def test_cats_for_non_geo_var(self):
read_data = """CLASS122CD,CLASS122NM,CLASS122NMW
CD1,NM1,NMW1
read_data = """CLASS122cd,CLASS122nm,CLASS122nmw
cd1,nm1,nmw1
"""
expected_error = f'^Reading {GEO_FILENAME}: found Welsh labels for non geographic classification: CLASS1$'
with unittest.mock.patch('builtins.open', conditional_mock_open('geography.csv', read_data = read_data)):
Expand Down
48 changes: 24 additions & 24 deletions test/test_geo_read.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ def mock_open(*args, **kargs):


class TestGeoRead(unittest.TestCase):
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name
OA2,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name
OA3,LAD2,LAD2 Name,LAD2 Name (Welsh),COUNTRY1,COUNTRY1 Name
Expand All @@ -34,10 +34,10 @@ def test_read_file(self, m):
},
})

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data=""" LAD22CD , LAD22NM,LAD22NMW
LAD1 , LAD1 Name,LAD1 Name (Welsh)
LAD2 , LAD2 Name,LAD2 Name (Welsh)
LAD3 , LAD3 Name,LAD3 Name (Welsh)
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data=""" LAD22cd , LAD22nm,LAD22nmw
LAD1 , LAD1 Name,LAD1 Name (Welsh)
LAD2 , LAD2 Name,LAD2 Name (Welsh)
LAD3 , LAD3 Name,LAD3 Name (Welsh)
""")
def test_whitespace_stripping(self, m):
data = read_geo_cats('file.csv')
Expand All @@ -49,7 +49,7 @@ def test_whitespace_stripping(self, m):
}
})

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""AbyZ_-422CD,AbyZ_-422NM,AbyZ_-422NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""AbyZ_-422cd,AbyZ_-422nm,AbyZ_-422nmw
1,Name,Name (Welsh)
""")
def test_valid_varname_characters(self, m):
Expand All @@ -60,21 +60,21 @@ def test_valid_varname_characters(self, m):
}
})

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name,extra
""")
def test_too_many_columns(self, m):
with self.assertRaisesRegex(ValueError, 'Reading file.csv: too many fields on row 2'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1
""")
def test_too_few_columns(self, m):
with self.assertRaisesRegex(ValueError, 'Reading file.csv: too few fields on row 2'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw
LAD1,LAD1 Name,LAD1 Name (Welsh)
LAD2,LAD2 Name,LAD2 Name (Welsh)
LAD2,LAD2 Name,LAD2 Name (Welsh)
Expand All @@ -85,7 +85,7 @@ def test_different_welsh_names(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: different Welsh name for code LAD3 of LAD: "Other Name \(Welsh\)" and "LAD3 Name \(Welsh\)"$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw
LAD1,LAD1 Name,LAD1 Name (Welsh)
LAD2,LAD2 Name,LAD2 Name (Welsh)
LAD2,Other Name,LAD2 Name (Welsh)
Expand All @@ -94,46 +94,46 @@ def test_different_names(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: different name for code LAD2 of LAD: "Other Name" and "LAD2 Name"$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW,LAD22NM,LAD22NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw,LAD22nm,LAD22nmw
""")
def test_duplicate_column_names(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22NM, LAD22NMW$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22nm, LAD22nmw$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW,LAD22NM,LAD22NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw,LAD22nm,LAD22nmw
""")
def test_duplicate_column_names(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22NM, LAD22NMW$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22nm, LAD22nmw$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA1DCD
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA1Dcd
""")
def test_invalid_varname_missing_year(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1DCD$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1Dcd$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA^22CD
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA^22cd
""")
def test_invalid_varname_character(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA\^22CD$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA\^22cd$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA1DCD
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA1Dcd
""")
def test_invalid_code_column_name(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1DCD$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1Dcd$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11CD,LAD22CD
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11cd,LAD22cd
""")
def test_multiple_code_columns(self, m):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: multiple code columns found for LAD: LAD22CD and LAD11CD$'):
with self.assertRaisesRegex(ValueError, '^Reading file.csv: multiple code columns found for LAD: LAD22cd and LAD11cd$'):
read_geo_cats('file.csv')

@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11CD,LAD11NM,LAD11NMW,OA11NM,COUNTRY11NMW,DISTRICT22CD,DISTRICT22NMW
@unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11cd,LAD11nm,LAD11nmw,OA11nm,COUNTRY11nmw,DISTRICT22cd,DISTRICT22nmw
""")
def test_unexpected_fields(self, m):
with self.assertRaisesRegex(ValueError, '^Unexpected fieldnames: COUNTRY11NMW, DISTRICT22NMW, OA11NM'):
with self.assertRaisesRegex(ValueError, '^Unexpected fieldnames: COUNTRY11nmw, DISTRICT22nmw, OA11nm'):
read_geo_cats('file.csv')


Expand Down
2 changes: 1 addition & 1 deletion test/testdata/geography/geography.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
GEO222CD,GEO222NM,GEO222NMW,OTHER21CD,OTHER21NM,OTHER21NMW,GEO111CD,GEO111NM,GEO111NMW
GEO222cd,GEO222nm,GEO222nmw,OTHER21cd,OTHER21nm,OTHER21nmw,GEO111cd,GEO111nm,GEO111nmw
CD1,NM1,NM1 (Welsh),O1,,O1NM (Welsh),G1 CD1,G1 NM1,
CD2,NM2,NM2 (Welsh),O2,,,G1 CD2,G1 NM2,
CD3,NM3,,O3,,,G1 CD2,G1 NM2,
Expand Down

0 comments on commit 39559c5

Please sign in to comment.