Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geography lookup data and functions #81

Merged
merged 48 commits into from
Sep 16, 2024
Merged

Add geography lookup data and functions #81

merged 48 commits into from
Sep 16, 2024

Conversation

cjrace
Copy link
Contributor

@cjrace cjrace commented Sep 5, 2024

Before you start reading, go make a brew ☕ no PR with this many file changes will be a quick read...

Brief overview of changes

Added the Ward-PCon-LAD-LA hiearchy lookup file into the package (currently at https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/ward_lad_la_pcon_hierarchy.csv), joined on region and country columns and added some helper functions to pull unique locations from that file for Ward, PCon, LAD, LA, and Region level.

Added a countries dataset too, to replace https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/country.csv.

Also moved some of the code around in a way that I'm hoping will future proof the package a bit. I had some internal debates on this that I've detailed at the end of the PR, as well as some specific questions I'd like a review to feedback on.

Why are these changes being made?

Want to make this easier for everyone in DfE, moving the lookup file allows us to start moving away from relying on the screener repo, and puts the lookups at the fingertips of R users.

Will be a really nice benefit for us too for using in the screener and also the edustats briefing tool (which is currently running off of this branch).

Detailed description of changes

  1. ONS API query

Added a wrapper function for getting lookup data from the ONS API rather than downloading the files manually. Hoping this will prove useful for future lookups too. It has its quirks...

Exported it in case it's helpful for anyone, though strongly suspect it'll only be us who use it.

  1. Creating the data sets

This is documented in the data-raw/ folder, folding the package guidance on https://r-pkgs.org/data.html.

Created functions in the internal_functions.R script to help neaten the code a bit so hopefully it's easier to use moving forwards.

  1. Lookup data

I've recreated the hierarchy file from the data screener using the ONS API so we can have a more reproducible method when we next come to update it.

I did various bits of QA of the data added against the existing data in the data screener repo and what was downloadable from the Open Geography Portal to dual run with the API. I'm happy that the data we have matches the screener data.

Have added some unit tests to make sure the expected numbers of locations are returned.

Using waldo::compare() the difference I found was (suggests the new lookup I've made here is a slight improvement on what we had before):

image

screener <- read.csv("https://raw.githubusercontent.com/dfe-analytical-services/dfe-published-data-qa/main/data/ward_lad_la_pcon_hierarchy.csv")

waldo::compare(
  screener %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name), #%>% select(ward_name, ward_code),
  dfeR::wd_pcon_lad_la %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name) #%>% select(ward_name, ward_code)
)

One location in over 24,000 has a code that fails the usual pattern, I think it's safe to assume this was a typo in the ONS 2017 file where they missed a 0 from the PCon code, have fixed this in a commented section for manual fixes and noted it in the documentation for the data set so users are aware.

The change is to make Glasgow East's PCon code S14000030 instead of S1400030

image

  1. Fetch functions
    I'm hoping I've made these self-explanatory in the documentation - so rather than make this description any longer I'll use this as the first test of that!

Making new 'fetch' functions from lookups should be pretty speedy moving forwards too.

  1. Package code reshuffle
  • Moved code of conduct into .github folder
  • Moved comma_sep() and round_five_up() into helper_functions.R (as we already list under helper functions in the pkgdown site)
  • Made a new family for formatting years, combining the format_ay and format_fy scripts
  • Made a new prettying family for pretty num, filesize and time taken functions
  1. create_project()

I haven't changed anything about this function, though did a quick refactor of the validation code as the way it was written before wasn't being styled properly by styler. You'd run styler and then lintr would complain, if you then address lintr's issues styler would undo it next time you reran it.

The refactor now plays more nicely with styler / lintr and I think this has accidentally resolved #77 too. All the tests are still passing so don't think this needs much reviewing, though tagging @JT-39 for sight on this as the original author of that function.

Additional information for reviewers

Future plans

We want to add more look up files that are currently in the data screener and associated 'fetch' functions:

  • LSIP / LAD
  • LEPs
  • EDAs

We should make an extra table / options for the custom locations we've added like 'Unknown' and 'Outside of England' etc. too.

Questions I'm interested in thoughts on

  1. Splitting of code / scripts
    I've tried to combine our scripts a bit more and lay out a template / standard for how we approach it moving forwards as I'm wary we can't have subdirectories in the R folder for a package, meaning we could end up with lots of scripts. and we could easily lose time to having code in inconsistent places and struggling to find it.

I've added some notes into the contributing file describing this, though interested on thoughts about the approach? At a a high level it is

  • internal_functions script for all helper functions that are NOT exported (internal only)
  • helper functions script for all helper functions that are exported
  • all_datasets script for describing / giving metadata for all the data sets exported in the package

Then

  • One script per family (e.g. cookies in dfeshiny)
  • One script per function if not in a family

For the creation of data sets, they get their own scripts in data-raw/, I expect these will use some of the helper functions (both exported and internal only) to make them more readable and easier to update.

Question is, does that make sense as an approach?

  1. Testing
    There's some tests on the data sets and the exported functions.

There's nothing on the ONS API wrapper as there didn't seem anything obvious to test.

There is also currently no tests on internal non-exported functions.

Question is, I'm not sure if there's more we should be checking here or we're fine with this?

  1. Regions
    Currently only England has regions. Which makes me wonder if:
    a) we stick as we are now
    a1) and therefore should we allow Scotland / Wales / NI as regions? (We know some want it somtimes)
    or
    b) we should include an alternative level for Scotland / Wales / NI as regions, and in the big lookup file join with those instead for those countries? (thought we might be able to do counties, but ONS only seem to publish them for England too)

  2. To fetch or not fetch
    Currently regions and countries have fetch functions that just pull in their data. Maybe they don't need them? Maybe it's helpful? Maybe we need to expand to have a flag for custom DfE locations? Thoughts?

Issue ticket number/s and link

#77 is the only issue linked to this

cjrace added 30 commits July 27, 2024 16:59
Merge branch 'create-geog-helpers' of https://github.com/dfe-analytical-services/dfeR into create-geog-helpers

# Conflicts:
#	DESCRIPTION
#	NAMESPACE
#	_pkgdown.yml
#	inst/WORDLIST
@cjrace
Copy link
Contributor Author

cjrace commented Sep 6, 2024

Copying in @jen-machin for sight, though expecting @rmbielby will have the most opinions on reviewing as it's pulling across data from the data screener.

@cjrace cjrace marked this pull request as ready for review September 6, 2024 07:49
@JT-39
Copy link
Collaborator

JT-39 commented Sep 6, 2024

I think the only thing I can comment on really is the splitting of scripts.

I'm in favour of your updated method and condensing the scripts.

Perhaps make clear a family of functions always have a family prefix e.g., "pretty_"? (I know this touched upon in the contributing so just me being picky).

@cjrace
Copy link
Contributor Author

cjrace commented Sep 9, 2024

@rmbielby
Copy link
Contributor

rmbielby commented Sep 9, 2024

For the re-organisation, I'm wary it's just going to get messy the way you've got it. I agree with the lines:

  • One script per family (e.g. cookies in dfeshiny)
  • One script per function if not in a family

But I'm not massively into how you're using internal_functions and helper_functions.

Thoughts are:

  • internal_functions.R seems too arbitrary a grouping. At the moment, if I'm wanting to edit how things in fetch_geographies works, I don't quickly know where all the related functions are. As a contributor, I'm not searching for something based on whether it's imported or exported, I'm looking for it based on context. I'd be keener on extending the two principles above to be:
    • One script per family for primarily user-focussed functions (e.g. fetch_geographies.R)
    • An extra script per family where needed for supporting functions (e.g. fetch_geographies_utils.R)
    • One script per function if not in a family
    • utils.R script for any functions used widely across different families and functions within the package, primarily for internal purposes (sorry, I still don't like helper_functions.R!).
  • I think we should set a high bar for things getting into any generic hold all script like utils/helper_functions/internal_functions - i.e. they should really be used extensively across different scripts within the package, whilst being of minimal use to end users. For example, a validation rule that gets used across multiple individual functions / families of functions or a common error message that we're relying on a lot.
  • Re-hashing an old conversation, but I think the filename should be a recognisable commonly used term for this kind of collective script, so that experienced package developers can find it where they'd expect. I'm wary that helper_functions.R and internal_functions.R aren't recognisable terms within the wider R programming community and I'd rather play to that audience if we're going down the CRAN route, rather than the internal DfE analysts audience when it comes to structuring the package. And that's why I think I'll generally be in favour of utils.R as opposed to anything we create as a name ourselves.
  • round_five_up() and comma_sep are primarily user facing functions that also happen to be used internally in specific places, I'd stick with the one script per function principle for these.

@cjrace
Copy link
Contributor Author

cjrace commented Sep 9, 2024

Agree with you on the proposed reshuffling @rmbielby - I specifically asked about it as I wasn't fully happy with what I had and I'd already been through a couple of variations and found myself tripping up over the code and trying to keep it easy to find as I went.

Like your solution the best of anything I've tried so far. Here's the updated contributing rules I've followed in my lat-ees-t commit:

Folder and script structure conventions

In an R package you are not allowed to have any sub directories in the R/ folder. Where possible we should have:

  • one script per function
    or
  • one script per function family if a function belongs to a family

The script should share the name of the function or family. If needed then a _utils.R script should be used for
internal only functions that relate to a specific family.

Documentation for all data shipped with the packages is kept in R/datasets_documentation.R. Scripts used for preparing data used in the package is not in the R folder, it is in the data-raw/ folder, helper functions for this can be found in the R/datasets_utils.R folder.

utils.R should be used to hold any cross-package helpers that aren't exported as functions or specific to a family.

Every exported function or data set gets its own test script, called test-<function-name>.R or test-data-<data-name>.R if for a data set.

@rmbielby
Copy link
Contributor

This may be counter-intuitive, but given you've taken my suggestion and gone with it, I feel like I need to post-justify it with some more informed opinions than just what seemed like a good idea to me at the time... I've done a bit of scanning through blogs and there's some interesting stuff in the following articles that's relevant for anyone reading back through this chain (and for us going forward with other packages). Think it broadly fits in with our thinking.

Copy link
Contributor

@rmbielby rmbielby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finishing for the day, so sending comments for the bits I've looked through so far.

@cjrace cjrace requested a review from rmbielby September 16, 2024 08:14
Copy link
Contributor

@rmbielby rmbielby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor things

@cjrace cjrace merged commit 45a6834 into main Sep 16, 2024
8 checks passed
@cjrace cjrace deleted the extend-wd_pcon_lad_la branch September 16, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create_project() has too high a cyclomatic complexity score in linting
3 participants