Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inventory gathered data and document relationships among sets #52

Open
mattgawarecki opened this issue Feb 23, 2017 · 9 comments
Open

Comments

@mattgawarecki
Copy link
Contributor

Status

The details of this issue are currently being discussed in the comments below. This issue may contain elements where development work is helpful, but is not primarily code-driven.

Task

We should take a look at all the data we've gathered to document how and by which fields various datasets are interconnected.

What we're looking for

To ensure it fits in with all our existing documentation, the result of work on this issue should go into a Markdown file in the /docs directory of the repo. This file should list out the following:

  • all the data sources we've gathered: what they're called, and a one-line description of what they contain
  • any "key" fields that join together one or more datasets: names of the field(s) and a one-line description of what they represent

Optionally, it would be nice to have a graphical representation of how our datasets interconnect. This can be done programmatically, through the use of a graph visualization tool, or manually.

How this will help

Knowing which data sets are related makes it much easier for people to think about what insights can be gathered from them. It also identifies gaps in our understanding of the data we have and shows us what we should try to collect in the future.

@jenniferthompson
Copy link
Contributor

jenniferthompson commented Feb 23, 2017

Definitely needed. Should this replace, be in addition to, or be combined with our current /datadictionaries README (which needs some updating)?

@skirmer
Copy link
Member

skirmer commented Feb 23, 2017

I think this file we're compiling with the one-line descriptions should link to the respective data dictionaries.

Edit: I gotta learn to read. I think that this might be more complicated than the Readme needs to be exactly, because of the specific info about linkages we're trying to build. But I could be wrong!

@jenniferthompson
Copy link
Contributor

@skirmer Yep, I agree that this is trying to do more. Just trying to figure out what the role of each (if we keep both) would be!

I could see:

  • point 1 ("all the data sources we've gathered: what they're called, and a one-line description of what they contain") staying in the README, linked to each data dictionary and this new document, with point 2 (key fields + graphic description) being its own document

or possibly

  • this eventually becoming the README (including links to data dictionaries)

I currently have no opinions on which would be better! I like streamlined, so that would mean just having the README with everything we need, but that might be trying to do too much in one spot.

@skirmer
Copy link
Member

skirmer commented Feb 24, 2017

As we were discussing in the slack channel, just to document it here, ggraph might be a good tool to use to illustrate the links between our datasets in a visual fashion.

@sharonbrener
Copy link
Member

Hey all! I know all of the data for this project lives on data.world, and that file descriptions and labels have been added to most files, but I wasn't sure if y'all knew that you can also add column descriptions to note key/joinable fields. That seems like it would be a great option for the second bullet point mentioned, and would keep that information living alongside the descriptions that already live on DW (which seems to fulfill the first bullet point).

From a column's info overlay, you can add a description:
screenshot 2017-02-27 09 18 24

We're also actively working on a new view that compiles all dictionaries for a dataset into a single view (@jenniferthompson just user-tested our prototype of this on Friday, actually 🙌).

I'd be happy to answer any questions around current functionality, share a preview of what's coming soon, or chat about any other feature requests from this team that we should consider building as part of our data dictionary initiative. As you might imagine, proper documentation is very near and dear to our hearts at data.world!

@jenniferthompson
Copy link
Contributor

Definitely agreed - thanks for making sure we know about it, @sharonbrener!

(And I'm really excited about this coming-soon feature, guys! It looks awesome.)

@sharonbrener
Copy link
Member

This issue does highlight that along with our new data dictionary view, we should prioritize a way to export data dictionaries as MD files so they can be added to the repo as well. I'll bring that note back to our team.

@darya-akimova
Copy link
Contributor

I'm officially reviving this issue haha, Have some free time with the holidays and I'd like to spend it on creating a useful data inventory/data dictionary for all of the datasets collected.

@darya-akimova darya-akimova mentioned this issue Jan 6, 2018
@darya-akimova
Copy link
Contributor

Submitted a pull request for data dictionary files that I created for most existing datasets on data.world.

Still to do:

  • figure out which of the following datasets are most up to date and create a dictionary for it: companies_drugs_keyed.csv, manufacturers_drugs_cleaned.csv, and drugdata_clean.csv (which seem to be the same dataset, but slightly modified)
  • A list or reference for which columns between all of the datasets are related or can be joined to each other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants