Add geography lookup data and functions #81

cjrace · 2024-09-05T22:51:53Z

Before you start reading, go make a brew ☕ no PR with this many file changes will be a quick read...

Brief overview of changes

Added the Ward-PCon-LAD-LA hiearchy lookup file into the package (currently at https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/ward_lad_la_pcon_hierarchy.csv), joined on region and country columns and added some helper functions to pull unique locations from that file for Ward, PCon, LAD, LA, and Region level.

Added a countries dataset too, to replace https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/country.csv.

Also moved some of the code around in a way that I'm hoping will future proof the package a bit. I had some internal debates on this that I've detailed at the end of the PR, as well as some specific questions I'd like a review to feedback on.

Why are these changes being made?

Want to make this easier for everyone in DfE, moving the lookup file allows us to start moving away from relying on the screener repo, and puts the lookups at the fingertips of R users.

Will be a really nice benefit for us too for using in the screener and also the edustats briefing tool (which is currently running off of this branch).

Detailed description of changes

ONS API query

Added a wrapper function for getting lookup data from the ONS API rather than downloading the files manually. Hoping this will prove useful for future lookups too. It has its quirks...

Exported it in case it's helpful for anyone, though strongly suspect it'll only be us who use it.

Creating the data sets

This is documented in the data-raw/ folder, folding the package guidance on https://r-pkgs.org/data.html.

Created functions in the internal_functions.R script to help neaten the code a bit so hopefully it's easier to use moving forwards.

Lookup data

I've recreated the hierarchy file from the data screener using the ONS API so we can have a more reproducible method when we next come to update it.

I did various bits of QA of the data added against the existing data in the data screener repo and what was downloadable from the Open Geography Portal to dual run with the API. I'm happy that the data we have matches the screener data.

Have added some unit tests to make sure the expected numbers of locations are returned.

Using waldo::compare() the difference I found was (suggests the new lookup I've made here is a slight improvement on what we had before):

screener <- read.csv("https://raw.githubusercontent.com/dfe-analytical-services/dfe-published-data-qa/main/data/ward_lad_la_pcon_hierarchy.csv")

waldo::compare(
  screener %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name), #%>% select(ward_name, ward_code),
  dfeR::wd_pcon_lad_la %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name) #%>% select(ward_name, ward_code)
)

One location in over 24,000 has a code that fails the usual pattern, I think it's safe to assume this was a typo in the ONS 2017 file where they missed a 0 from the PCon code, have fixed this in a commented section for manual fixes and noted it in the documentation for the data set so users are aware.

The change is to make Glasgow East's PCon code S14000030 instead of S1400030

Fetch functions
I'm hoping I've made these self-explanatory in the documentation - so rather than make this description any longer I'll use this as the first test of that!

Making new 'fetch' functions from lookups should be pretty speedy moving forwards too.

Package code reshuffle

Moved code of conduct into .github folder
Moved comma_sep() and round_five_up() into helper_functions.R (as we already list under helper functions in the pkgdown site)
Made a new family for formatting years, combining the format_ay and format_fy scripts
Made a new prettying family for pretty num, filesize and time taken functions

create_project()

I haven't changed anything about this function, though did a quick refactor of the validation code as the way it was written before wasn't being styled properly by styler. You'd run styler and then lintr would complain, if you then address lintr's issues styler would undo it next time you reran it.

The refactor now plays more nicely with styler / lintr and I think this has accidentally resolved #77 too. All the tests are still passing so don't think this needs much reviewing, though tagging @JT-39 for sight on this as the original author of that function.

Additional information for reviewers

Future plans

We want to add more look up files that are currently in the data screener and associated 'fetch' functions:

LSIP / LAD
LEPs
EDAs

We should make an extra table / options for the custom locations we've added like 'Unknown' and 'Outside of England' etc. too.

Questions I'm interested in thoughts on

Splitting of code / scripts
I've tried to combine our scripts a bit more and lay out a template / standard for how we approach it moving forwards as I'm wary we can't have subdirectories in the R folder for a package, meaning we could end up with lots of scripts. and we could easily lose time to having code in inconsistent places and struggling to find it.

I've added some notes into the contributing file describing this, though interested on thoughts about the approach? At a a high level it is

internal_functions script for all helper functions that are NOT exported (internal only)
helper functions script for all helper functions that are exported
all_datasets script for describing / giving metadata for all the data sets exported in the package

Then

One script per family (e.g. cookies in dfeshiny)
One script per function if not in a family

For the creation of data sets, they get their own scripts in data-raw/, I expect these will use some of the helper functions (both exported and internal only) to make them more readable and easier to update.

Question is, does that make sense as an approach?

Testing
There's some tests on the data sets and the exported functions.

There's nothing on the ONS API wrapper as there didn't seem anything obvious to test.

There is also currently no tests on internal non-exported functions.

Question is, I'm not sure if there's more we should be checking here or we're fine with this?

Regions
Currently only England has regions. Which makes me wonder if:
a) we stick as we are now
a1) and therefore should we allow Scotland / Wales / NI as regions? (We know some want it somtimes)
or
b) we should include an alternative level for Scotland / Wales / NI as regions, and in the big lookup file join with those instead for those countries? (thought we might be able to do counties, but ONS only seem to publish them for England too)
To fetch or not fetch
Currently regions and countries have fetch functions that just pull in their data. Maybe they don't need them? Maybe it's helpful? Maybe we need to expand to have a flag for custom DfE locations? Thoughts?

Issue ticket number/s and link

#77 is the only issue linked to this

…s and tests

… squash these commits!)

…ions

…s and tests

… squash these commits!)

…ions

Merge branch 'create-geog-helpers' of https://github.com/dfe-analytical-services/dfeR into create-geog-helpers # Conflicts: # DESCRIPTION # NAMESPACE # _pkgdown.yml # inst/WORDLIST

cjrace · 2024-09-06T07:49:47Z

Copying in @jen-machin for sight, though expecting @rmbielby will have the most opinions on reviewing as it's pulling across data from the data screener.

JT-39 · 2024-09-06T10:10:18Z

I think the only thing I can comment on really is the splitting of scripts.

I'm in favour of your updated method and condensing the scripts.

Perhaps make clear a family of functions always have a family prefix e.g., "pretty_"? (I know this touched upon in the contributing so just me being picky).

cjrace · 2024-09-09T08:32:10Z

Create geography lookup functions in dfeR

rmbielby · 2024-09-09T12:23:30Z

For the re-organisation, I'm wary it's just going to get messy the way you've got it. I agree with the lines:

One script per family (e.g. cookies in dfeshiny)
One script per function if not in a family

But I'm not massively into how you're using internal_functions and helper_functions.

Thoughts are:

internal_functions.R seems too arbitrary a grouping. At the moment, if I'm wanting to edit how things in fetch_geographies works, I don't quickly know where all the related functions are. As a contributor, I'm not searching for something based on whether it's imported or exported, I'm looking for it based on context. I'd be keener on extending the two principles above to be:
- One script per family for primarily user-focussed functions (e.g. fetch_geographies.R)
- An extra script per family where needed for supporting functions (e.g. fetch_geographies_utils.R)
- One script per function if not in a family
- utils.R script for any functions used widely across different families and functions within the package, primarily for internal purposes (sorry, I still don't like helper_functions.R!).
I think we should set a high bar for things getting into any generic hold all script like utils/helper_functions/internal_functions - i.e. they should really be used extensively across different scripts within the package, whilst being of minimal use to end users. For example, a validation rule that gets used across multiple individual functions / families of functions or a common error message that we're relying on a lot.
Re-hashing an old conversation, but I think the filename should be a recognisable commonly used term for this kind of collective script, so that experienced package developers can find it where they'd expect. I'm wary that helper_functions.R and internal_functions.R aren't recognisable terms within the wider R programming community and I'd rather play to that audience if we're going down the CRAN route, rather than the internal DfE analysts audience when it comes to structuring the package. And that's why I think I'll generally be in favour of utils.R as opposed to anything we create as a name ourselves.
round_five_up() and comma_sep are primarily user facing functions that also happen to be used internally in specific places, I'd stick with the one script per function principle for these.

cjrace · 2024-09-09T16:23:46Z

Agree with you on the proposed reshuffling @rmbielby - I specifically asked about it as I wasn't fully happy with what I had and I'd already been through a couple of variations and found myself tripping up over the code and trying to keep it easy to find as I went.

Like your solution the best of anything I've tried so far. Here's the updated contributing rules I've followed in my lat-ees-t commit:

Folder and script structure conventions

In an R package you are not allowed to have any sub directories in the R/ folder. Where possible we should have:

one script per function
or
one script per function family if a function belongs to a family

The script should share the name of the function or family. If needed then a _utils.R script should be used for
internal only functions that relate to a specific family.

Documentation for all data shipped with the packages is kept in R/datasets_documentation.R. Scripts used for preparing data used in the package is not in the R folder, it is in the data-raw/ folder, helper functions for this can be found in the R/datasets_utils.R folder.

utils.R should be used to hold any cross-package helpers that aren't exported as functions or specific to a family.

Every exported function or data set gets its own test script, called test-<function-name>.R or test-data-<data-name>.R if for a data set.

rmbielby · 2024-09-12T15:44:08Z

This may be counter-intuitive, but given you've taken my suggestion and gone with it, I feel like I need to post-justify it with some more informed opinions than just what seemed like a good idea to me at the time... I've done a bit of scanning through blogs and there's some interesting stuff in the following articles that's relevant for anyone reading back through this chain (and for us going forward with other packages). Think it broadly fits in with our thinking.

a nice succinct description of utils and extensive analytics of what developers do across the R package ecosystem: https://rud.is/b/2018/04/08/dissecting-r-package-utility-belts/
this one asks (and answers) some really good questions like "Where to put internal functions?" (spoiler: utils.R): https://www.r-bloggers.com/2019/12/internal-functions-in-r-packages/

rmbielby

Finishing for the day, so sending comments for the bits I've looked through so far.

R/fetch_utils.R

R/datasets_utils.R

R/create_project.R

R/datasets_utils.R

data-raw/countries.R

data-raw/regions.R

data-raw/wd_pcon_lad_la_rgn_ctry.R

rmbielby

Couple of minor things

.github/CONTRIBUTING.md

cjrace added 30 commits July 27, 2024 16:59

Initial code to query ONS API

e73ff87

WIP: add lookup data from ONS API

78c4d49

Working wd_pcon_lad_la file, needs tidying, documentation improvement…

f37d30a

…s and tests

still WIP, got mostly working but a few data issues (going to need to…

a6a1504

… squash these commits!)

got fetch_ons_api batching up okay

ccc2b3d

update column order in description

9c1c1a3

add verbose toggle and general WIP

0546a61

general package tidying

f8874a0

removed multiple years and stopped returning time cols in fetch funct…

078cec4

…ions

WIP: two todo's left to fix

74b2301

Everything finally works!

4e12f6d

add examples to readme, fix up pkgdown site

84e8f4a

Initial code to query ONS API

8c09a0c

WIP: add lookup data from ONS API

bfb184d

Working wd_pcon_lad_la file, needs tidying, documentation improvement…

312da19

…s and tests

still WIP, got mostly working but a few data issues (going to need to…

36944bc

… squash these commits!)

got fetch_ons_api batching up okay

38a8a9b

update column order in description

d1489d7

add verbose toggle and general WIP

e71d8d3

general package tidying

ea7259f

removed multiple years and stopped returning time cols in fetch funct…

c883c7a

…ions

WIP: two todo's left to fix

b9e0add

Everything finally works!

25386f9

add examples to readme, fix up pkgdown site

2a5e52f

fix conflicts

5d1b202

Merge branch 'create-geog-helpers' of https://github.com/dfe-analytical-services/dfeR into create-geog-helpers # Conflicts: # DESCRIPTION # NAMESPACE # _pkgdown.yml # inst/WORDLIST

Quick refactor of the create_project() code to prevent linting issues

591e153

remove notes in contributing

2f15a90

tidy up wordlist

c7b4a3b

fix issue in api wrapper

1a7f3f7

code to add country information

7d87236

cjrace added 3 commits September 6, 2024 00:01

Increment version number to 0.5.0

edd6c97

add regions

8a50b96

add regions into pkgdown yml

e526481

cjrace requested review from rmbielby and jen-machin September 6, 2024 07:48

cjrace assigned rmbielby Sep 6, 2024

cjrace marked this pull request as ready for review September 6, 2024 07:49

shuffle shuffle (to <family>_utils.R and binning helper_functions.R)

3ee3146

typo fix in test file naming

c192a0d

rmbielby requested changes Sep 12, 2024

View reviewed changes

rmbielby assigned cjrace Sep 13, 2024

cjrace added 5 commits September 14, 2024 09:44

tighten up check_fetch_location_inputs

5d5160a

Update countries and regions to use year variables at start of scripts

e8f5184

Move to 4 digit years and improve documentation

5e45ea7

Add a section to the contributing guide about the geography data

9aa2a1e

update function comments

9de88d6

cjrace requested a review from rmbielby September 16, 2024 08:14

rmbielby requested changes Sep 16, 2024

View reviewed changes

.github/CONTRIBUTING.md Outdated Show resolved Hide resolved

.github/CONTRIBUTING.md Show resolved Hide resolved

cjrace added 3 commits September 16, 2024 11:06

add internal link into contributing

4f154cc

add rich's error guidance

1a1ff9a

fix typo

a0eeaad

rmbielby approved these changes Sep 16, 2024

View reviewed changes

cjrace merged commit 45a6834 into main Sep 16, 2024
8 checks passed

cjrace deleted the extend-wd_pcon_lad_la branch September 16, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add geography lookup data and functions #81

Add geography lookup data and functions #81

cjrace commented Sep 5, 2024 •

edited

Loading

cjrace commented Sep 6, 2024

JT-39 commented Sep 6, 2024

cjrace commented Sep 9, 2024

rmbielby commented Sep 9, 2024 •

edited

Loading

cjrace commented Sep 9, 2024

rmbielby commented Sep 12, 2024

rmbielby left a comment

rmbielby left a comment

Add geography lookup data and functions #81

Add geography lookup data and functions #81

Conversation

cjrace commented Sep 5, 2024 • edited Loading

Brief overview of changes

Why are these changes being made?

Detailed description of changes

Additional information for reviewers

Future plans

Questions I'm interested in thoughts on

Issue ticket number/s and link

cjrace commented Sep 6, 2024

JT-39 commented Sep 6, 2024

cjrace commented Sep 9, 2024

rmbielby commented Sep 9, 2024 • edited Loading

cjrace commented Sep 9, 2024

Folder and script structure conventions

rmbielby commented Sep 12, 2024

rmbielby left a comment

Choose a reason for hiding this comment

rmbielby left a comment

Choose a reason for hiding this comment

cjrace commented Sep 5, 2024 •

edited

Loading

rmbielby commented Sep 9, 2024 •

edited

Loading