Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema for spend data #21

Closed
rossjones opened this issue May 6, 2016 · 12 comments
Closed

Schema for spend data #21

rossjones opened this issue May 6, 2016 · 12 comments

Comments

@rossjones
Copy link
Contributor

Recently the topic of spend data has come up, particularly how everyone seems to publish their data in slightly different structures, sometimes the same organisation using a different structure each month.

I was aware of the Local Government Association's schema ( https://github.com/esd-org-uk/schemas/blob/master/Spend/Spend.json ) and a HMRC schema was mentioned. I haven't found the HMRC schema yet, although I am presuming it is core-department specific, so if anyone has any pointers.

There's likely to be a problem persuading people to use a specific schema, but we should at least have a schema to suggest and to that end, I'm hoping to gather opinions on the best approach for this. Should we be asking people like https://www.spendnetwork.com/ for guidance on what they would expect? Could @torgo arrange this if so, as I believe they are at ODI.

This might be of interest to @davidread as he's had some experience with the https://openspending.org/ codebase.

@davidread
Copy link

The central government spend data guidance was produced by HM Treasury in 2010: https://www.gov.uk/government/publications/guidance-for-publishing-spend-over-25000

HMT describe the schema pretty accurately in prose and with a spreadsheet example. There is a JSON schema here: https://github.com/datagovuk/schemas/blob/master/spend-hmt/spend-25k.json

I've also put together a quick guide for publishers here: http://guidance.data.gov.uk/25k-spend-data.html although it's not been sent out methodically to publishers. But perhaps you would like to do a PR with a bit about how they should check their CSV complies with a schema in goodtables/csvlint.io? And clearly there has long been an opportunity to tie these into data.gov.uk and/or gov.uk.

@rossjones
Copy link
Contributor Author

Thanks for the pointers, I'll send the suggested PR. Disappointing to see that HMRC aren't following their own guidance, specifically for the Amount column over the last three months at https://data.gov.uk/dataset/financial-transactions-data-hmrc - perhaps it is a bigger problem than I thought.

@davidread
Copy link

I don't know where the HMRC references are coming from, or understand why would they be involved in standards.

@rossjones
Copy link
Contributor Author

You're right, getting my HM(*) mixed up, but HMRC were mentioned and I don't know if it was a mistake when HMT was meant, or whether they also have a schema.

@MikeThacker1
Copy link

The LGA page with the spend CSV schema, guidance and other resources is at http://schemas.opendata.esd.org.uk/Spend

I believe SpendNetwork was consulted along with lots of councils. I'll ask them for comment.

@rossjones
Copy link
Contributor Author

@MikeThacker1 is there a reason why the ESD schema is so different from the HMT one? I understand a slightly different purpose, but am just trying to get my head around what might be missing from each.

@MikeThacker1
Copy link

The ESD / LGA one was designed to meet the requirements of the "Local government transparency code" in 2014, updated in 2015. See https://www.gov.uk/government/publications/local-government-transparency-code-2015

DCLG did not want to be too prescriptive in how LAs should record things so allowed flexibility in how each requirement is met (it would have been easier if there had been less flexibility). Also some differences in how local government operates.

That said, I'm not sure how much the HMT one was referenced. I've asked LGA people if they might chip in.

@robmckinnon
Copy link

Five years ago, I found an HMT Guidance document, the same one @davidread linked to above. By mistake I referred to it as an HMRC guidance yesterday.

In 2011, I wrote a spending data CSV validator, which I ran over spending data files found on data gov uk via this query: http://data.gov.uk/search/apachesolr_search/spend%20over?filters=type:ckan_package

The validator checked whether files were valid CSV, and whether mandatory headers specified in the Guidance were in the first row. It produced a page reporting which files conformed and which had problems.

At the time my idea was that some part of government e.g. the National Audit Office, could run the validator and notify publishers when their files do not conform to the HMT mandated format.

@davidread
Copy link

I love your idea of seeing these things as an 'audit', the same way as any legit company provides accounts in standard format and is audited. I've no idea if the NAO could be interested in this.

However I'm also keen that checking is done at the earliest opportunity, so that the feedback loop is as strong as possible. As soon as you add a delay in time and place on the web then it's not as powerful. It's an obvious thing to do schema checking when adding the data file to the central publishing infrastructure, which means gov.uk Publications and/or data.gov.uk.

@robmckinnon
Copy link

Ideally we should help organisations validate their data prior to publishing. And consider blocking publication when validation checks fail.

Thanks to the Internet Archive, you can see the UK Spending Data CSV validation report I generated in 2011. For the files analysed the validation report breakdown was:

Good Data - 36%
All mandatory headers in first row - 1,132 files

Partial Data - 19%
Some mandatory headers in first row - 605 files

Bad Data - 45%
No standard headers in first row - 795 files 25%
Errors parsing file as CSV - 507 files 16%
File not found - 112 files 4%

@rossjones
Copy link
Contributor Author

@robmckinnon I've added a ticket at datagovuk/ckanext-dgu#416 about discussing the feasibility of adding this to DGU for when people add the metadata. Unfortunately it won't solve the problem as they will already have uploaded the actual content to gov.uk, but if we have some code to share, then that might help encourage its use.

@edent
Copy link
Contributor

edent commented Jan 5, 2017

It looks like there are schemas (schemata? schemae?) available which have been published as official guidance. Therefore I'm closing this issue.

If you think the Standards team should take another look at this, please let me know.

@edent edent closed this as completed Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants