Merge pull request #33 from sensiblecodeio/peter/geography-file-heade…

…r-case The code to process the geography lookup file expects lowercase file suffixes cd, nm and nmw
sensiblecodeio · Aug 19, 2022 · 39559c5 · 39559c5
2 parents ae6e697 + 6a547a3
commit 39559c5
Show file tree

Hide file tree

Showing 10 changed files with 53 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ and converts them to hierarchical JSON that can be loaded into `cantabular-metad
 It is compatible with version `1.2` of the metadata schema and versions `10.1.0`/`10.0.0`/`9.3.0`/`9.2.0` of
 `cantabular-metadata`. `10.1.0` format is used by default and is identical to the `10.1.0` and `9.3.0` format.
 
-This is version `1.2.delta` of the CSV to JSON processing software and is subject to change.
+This is version `1.2.epsilon` of the CSV to JSON processing software and is subject to change.
 
 The applications only use packages in the Python standard library.
 
@@ -35,7 +35,7 @@ Basic logging will be displayed by default, including the number of high-level C
 objects loaded and the name of the output files.
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/
-t=2022-07-14 16:07:50,859 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 16:07:50,859 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 16:07:50,859 lvl=INFO msg=CSV source directory: test/testdata/
 t=2022-07-14 16:07:50,859 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
 t=2022-07-14 16:07:50,865 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER
@@ -55,7 +55,7 @@ t=2022-07-14 16:07:50,869 lvl=INFO msg=Written service metadata file to: ctb_met
 More detailed information can be obtained by running with a `-l DEBUG` flag e.g.:
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -l DEBUG
-t=2022-07-14 16:08:32,792 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 16:08:32,792 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 16:08:32,792 lvl=INFO msg=CSV source directory: test/testdata/
 t=2022-07-14 16:08:32,792 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
 t=2022-07-14 16:08:32,793 lvl=DEBUG msg=Creating classification for geographic variable: GEO1
@@ -118,7 +118,7 @@ arguments as described in the help text for `ons_csv_to_ctb_json_main.py`:
 For example:
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -p t -m test -b 42
-t=2022-07-14 16:09:09,794 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 16:09:09,794 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 16:09:09,794 lvl=INFO msg=CSV source directory: test/testdata/
 t=2022-07-14 16:09:09,794 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
 t=2022-07-14 16:09:09,796 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER
@@ -147,7 +147,7 @@ This repository contains some test data that is full of errors. It can be used t
 of the `--best-effort` flag as shown below:
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/best_effort  -o ctb_metadata_files/ -m best-effort --best-effort
-t=2022-07-14 22:50:38,931 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 22:50:38,931 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 22:50:38,932 lvl=INFO msg=CSV source directory: test/testdata/best_effort
 t=2022-07-14 22:50:38,934 lvl=WARNING msg=Reading test/testdata/best_effort/Classification.csv:3 no value supplied for required field Variable_Mnemonic
 t=2022-07-14 22:50:38,934 lvl=WARNING msg=Reading test/testdata/best_effort/Classification.csv:3 dropping record
@@ -215,7 +215,7 @@ datasets with a `Dataset_Mnemonic` beginning with **TS** are processed.
 
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/dataset_filter/ -o ctb_metadata_files/ --dataset-filter TS
-t=2022-08-18 16:06:26,780 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-08-18 16:06:26,780 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-08-18 16:06:26,780 lvl=INFO msg=CSV source directory: test/testdata/dataset_filter/
 t=2022-08-18 16:06:26,780 lvl=INFO msg=Dataset filter: TS
 t=2022-08-18 16:06:26,781 lvl=INFO msg=No geography file specified
@@ -244,7 +244,7 @@ can be found in the `sample_2011` directory.
 Use this command to convert the files to JSON (with debugging enabled):
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i sample_2011/ -g sample_2011/geography.csv -o ctb_metadata_files/ -m 2001-sample -l DEBUG
-t=2022-07-14 16:10:18,924 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 16:10:18,924 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 16:10:18,924 lvl=INFO msg=CSV source directory: sample_2011/
 t=2022-07-14 16:10:18,924 lvl=INFO msg=Geography file: sample_2011/geography.csv
 t=2022-07-14 16:10:18,927 lvl=DEBUG msg=Creating classification for geographic variable: Region
@@ -329,7 +329,7 @@ will be reflected in the output filenames, but `10.1.0` format will be used.
 To generate version 9.2.0 compatible files from the test data use the following command:
 ```
 > python3 bin/ons_csv_to_ctb_json_main.py -i test/testdata/ -g test/testdata/geography/geography.csv -o ctb_metadata_files/ -v 9.2.0
-t=2022-07-14 16:14:01,895 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.delta
+t=2022-07-14 16:14:01,895 lvl=INFO msg=ons_csv_to_ctb_json_main.py version 1.2.epsilon
 t=2022-07-14 16:14:01,895 lvl=INFO msg=CSV source directory: test/testdata/
 t=2022-07-14 16:14:01,895 lvl=INFO msg=Geography file: test/testdata/geography/geography.csv
 t=2022-07-14 16:14:01,897 lvl=INFO msg=Reading test/testdata/geography/geography.csv: found Welsh labels for unknown classification: OTHER

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,6 +1,11 @@
 Release Notes
 =============
 
+1.2.epsilon
+-----------
+- The code to process the geography lookup file expects lowercase file suffixes `cd`, `nm` and `nmw`.
+  Previously it expected uppercase suffixes.
+
 1.2.delta
 -----------
 - Added a new `--dataset-filter` option that is used to filter the datasets which are processed

diff --git a/bin/fixup.py b/bin/fixup.py
@@ -11,7 +11,7 @@
 import csv
 from argparse import ArgumentParser
 
-VERSION = '1.2.delta'
+VERSION = '1.2.epsilon'
 
 
 def main():

diff --git a/bin/ons_csv_to_ctb_json_geo.py b/bin/ons_csv_to_ctb_json_geo.py
@@ -7,29 +7,29 @@
 AreaName = namedtuple('AreaName', 'name welsh_name')
 
 
-CODE_SUFFIX = 'CD'
-NAME_SUFFIX = 'NM'
-WELSH_NAME_SUFFIX = 'NMW'
+CODE_SUFFIX = 'cd'
+NAME_SUFFIX = 'nm'
+WELSH_NAME_SUFFIX = 'nmw'
 
 
 def read_geo_cats(filename):
     """
     Read a lookup file containing variable category codes, labels and Welsh labels.
 
-    Each variable will have a CD (code) column. It may also have NM (name) and NMW (Welsh name)
+    Each variable will have a cd (code) column. It may also have nm (name) and nmw (Welsh name)
     columns. The column names are expected to have the format (as a regular expression):
         <variable name><2 numerical digits for year><column type>
     And to match the regular expression:
-        ^[a-zA-Z0-9_-]+[0-9][0-9](CD|NM|NMW)$
+        ^[a-zA-Z0-9_-]+[0-9][0-9](cd|nm|nmw)$
 
-    Category names are returned for all variables with a NM column. Welsh category names are
-    returned for all variables with a NMW column. The names are returned as a dict of dicts keyed
+    Category names are returned for all variables with a nm column. Welsh category names are
+    returned for all variables with a nmw column. The names are returned as a dict of dicts keyed
     on the variable name. Each sub-dict is keyed on the category code and each item is of type
     AreaName.
 
      - All fields have leading/trailing whitespace removed.
-     - There must not be entries for a single variable with different years e.g. LAD11CD
-       and LAD22CD.
+     - There must not be entries for a single variable with different years e.g. LAD11cd
+       and LAD22cd.
      - If multiple lines in the file refer to the same category then the names must be consistent
        on all lines.
 

diff --git a/bin/ons_csv_to_ctb_json_main.py b/bin/ons_csv_to_ctb_json_main.py
@@ -9,7 +9,7 @@
 from ons_csv_to_ctb_json_load import Loader, PUBLIC_SECURITY_MNEMONIC
 from ons_csv_to_ctb_json_bilingual import BilingualDict, Bilingual
 
-VERSION = '1.2.delta'
+VERSION = '1.2.epsilon'
 
 SYSTEM = 'cantabm'
 DEFAULT_CANTABULAR_VERSION = '10.1.0'

diff --git a/bin/remove_empty_rows_and_columns.py b/bin/remove_empty_rows_and_columns.py
@@ -13,7 +13,7 @@
 import csv
 from argparse import ArgumentParser
 
-VERSION = '1.2.delta'
+VERSION = '1.2.epsilon'
 
 
 def main():

diff --git a/sample_2011/geography.csv b/sample_2011/geography.csv
@@ -1,4 +1,4 @@
-Region11CD,Region11NM,Region11NMW,Country11CD,Country11NM,Country11NMW
+Region11cd,Region11nm,Region11nmw,Country11cd,Country11nm,Country11nmw
 E12000001,North East,Gogledd Ddwyrain,E,England,Lloegr
 E12000002,North West,Gogledd Orllewin,E,England,Lloegr
 E12000003,Yorkshire and the Humber,Swydd Efrog a'r Humber,E,England,Lloegr

diff --git a/test/test_category.py b/test/test_category.py
@@ -59,8 +59,8 @@ def test_cats_for_geo_classification(self):
         self.run_test([row], f'^Reading {FILENAME}:2 found category for geographic classification GEO1: all categories for geographic classifications must be in a separate lookup file$')
 
     def test_cats_for_non_geo_var(self):
-        read_data = """CLASS122CD,CLASS122NM,CLASS122NMW
-CD1,NM1,NMW1
+        read_data = """CLASS122cd,CLASS122nm,CLASS122nmw
+cd1,nm1,nmw1
 """
         expected_error = f'^Reading {GEO_FILENAME}: found Welsh labels for non geographic classification: CLASS1$'
         with unittest.mock.patch('builtins.open', conditional_mock_open('geography.csv', read_data = read_data)):

diff --git a/test/test_geo_read.py b/test/test_geo_read.py
@@ -9,7 +9,7 @@ def mock_open(*args, **kargs):
 
 
 class TestGeoRead(unittest.TestCase):
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
 OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name
 OA2,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name
 OA3,LAD2,LAD2 Name,LAD2 Name (Welsh),COUNTRY1,COUNTRY1 Name
@@ -34,10 +34,10 @@ def test_read_file(self, m):
                 },
             })
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data=""" LAD22CD , LAD22NM,LAD22NMW 
- LAD1 , LAD1 Name,LAD1 Name (Welsh) 
- LAD2 , LAD2 Name,LAD2 Name (Welsh) 
- LAD3 , LAD3 Name,LAD3 Name (Welsh) 
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data=""" LAD22cd , LAD22nm,LAD22nmw
+ LAD1 , LAD1 Name,LAD1 Name (Welsh)
+ LAD2 , LAD2 Name,LAD2 Name (Welsh)
+ LAD3 , LAD3 Name,LAD3 Name (Welsh)
 """)
     def test_whitespace_stripping(self, m):
         data = read_geo_cats('file.csv')
@@ -49,7 +49,7 @@ def test_whitespace_stripping(self, m):
                 }
             })
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""AbyZ_-422CD,AbyZ_-422NM,AbyZ_-422NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""AbyZ_-422cd,AbyZ_-422nm,AbyZ_-422nmw
 1,Name,Name (Welsh)
 """)
     def test_valid_varname_characters(self, m):
@@ -60,21 +60,21 @@ def test_valid_varname_characters(self, m):
                 }
             })
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
 OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1,COUNTRY1 Name,extra
 """)
     def test_too_many_columns(self, m):
         with self.assertRaisesRegex(ValueError, 'Reading file.csv: too many fields on row 2'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11CD,LAD22CD,LAD22NM,LAD22NMW,COUNTRY22CD,COUNTRY22NM
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA11cd,LAD22cd,LAD22nm,LAD22nmw,COUNTRY22cd,COUNTRY22nm
 OA1,LAD1,LAD1 Name,LAD1 Name (Welsh),COUNTRY1
 """)
     def test_too_few_columns(self, m):
         with self.assertRaisesRegex(ValueError, 'Reading file.csv: too few fields on row 2'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw
 LAD1,LAD1 Name,LAD1 Name (Welsh)
 LAD2,LAD2 Name,LAD2 Name (Welsh)
 LAD2,LAD2 Name,LAD2 Name (Welsh)
@@ -85,7 +85,7 @@ def test_different_welsh_names(self, m):
         with self.assertRaisesRegex(ValueError, '^Reading file.csv: different Welsh name for code LAD3 of LAD: "Other Name \(Welsh\)" and "LAD3 Name \(Welsh\)"$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw
 LAD1,LAD1 Name,LAD1 Name (Welsh)
 LAD2,LAD2 Name,LAD2 Name (Welsh)
 LAD2,Other Name,LAD2 Name (Welsh)
@@ -94,46 +94,46 @@ def test_different_names(self, m):
         with self.assertRaisesRegex(ValueError, '^Reading file.csv: different name for code LAD2 of LAD: "Other Name" and "LAD2 Name"$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW,LAD22NM,LAD22NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw,LAD22nm,LAD22nmw
 """)
     def test_duplicate_column_names(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22NM, LAD22NMW$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22nm, LAD22nmw$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22CD,LAD22NM,LAD22NMW,LAD22NM,LAD22NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD22cd,LAD22nm,LAD22nmw,LAD22nm,LAD22nmw
 """)
     def test_duplicate_column_names(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22NM, LAD22NMW$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: duplicate column names: LAD22nm, LAD22nmw$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA1DCD
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA1Dcd
 """)
     def test_invalid_varname_missing_year(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1DCD$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1Dcd$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA^22CD
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA^22cd
 """)
     def test_invalid_varname_character(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA\^22CD$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA\^22cd$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22CD,LA1DCD
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""OA22cd,LA1Dcd
 """)
     def test_invalid_code_column_name(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1DCD$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: invalid code column name: LA1Dcd$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11CD,LAD22CD
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11cd,LAD22cd
 """)
     def test_multiple_code_columns(self, m):
-        with self.assertRaisesRegex(ValueError, '^Reading file.csv: multiple code columns found for LAD: LAD22CD and LAD11CD$'):
+        with self.assertRaisesRegex(ValueError, '^Reading file.csv: multiple code columns found for LAD: LAD22cd and LAD11cd$'):
             read_geo_cats('file.csv')
 
-    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11CD,LAD11NM,LAD11NMW,OA11NM,COUNTRY11NMW,DISTRICT22CD,DISTRICT22NMW
+    @unittest.mock.patch('builtins.open', new_callable=mock_open, read_data="""LAD11cd,LAD11nm,LAD11nmw,OA11nm,COUNTRY11nmw,DISTRICT22cd,DISTRICT22nmw
 """)
     def test_unexpected_fields(self, m):
-        with self.assertRaisesRegex(ValueError, '^Unexpected fieldnames: COUNTRY11NMW, DISTRICT22NMW, OA11NM'):
+        with self.assertRaisesRegex(ValueError, '^Unexpected fieldnames: COUNTRY11nmw, DISTRICT22nmw, OA11nm'):
             read_geo_cats('file.csv')
 
 

diff --git a/test/testdata/geography/geography.csv b/test/testdata/geography/geography.csv
@@ -1,4 +1,4 @@
-GEO222CD,GEO222NM,GEO222NMW,OTHER21CD,OTHER21NM,OTHER21NMW,GEO111CD,GEO111NM,GEO111NMW
+GEO222cd,GEO222nm,GEO222nmw,OTHER21cd,OTHER21nm,OTHER21nmw,GEO111cd,GEO111nm,GEO111nmw
 CD1,NM1,NM1 (Welsh),O1,,O1NM (Welsh),G1 CD1,G1 NM1,
 CD2,NM2,NM2 (Welsh),O2,,,G1 CD2,G1 NM2,
 CD3,NM3,,O3,,,G1 CD2,G1 NM2,