Inventory gathered data and document relationships among sets #52

mattgawarecki · 2017-02-23T16:03:53Z

Status

The details of this issue are currently being discussed in the comments below. This issue may contain elements where development work is helpful, but is not primarily code-driven.

Task

We should take a look at all the data we've gathered to document how and by which fields various datasets are interconnected.

What we're looking for

To ensure it fits in with all our existing documentation, the result of work on this issue should go into a Markdown file in the /docs directory of the repo. This file should list out the following:

all the data sources we've gathered: what they're called, and a one-line description of what they contain
any "key" fields that join together one or more datasets: names of the field(s) and a one-line description of what they represent

Optionally, it would be nice to have a graphical representation of how our datasets interconnect. This can be done programmatically, through the use of a graph visualization tool, or manually.

How this will help

Knowing which data sets are related makes it much easier for people to think about what insights can be gathered from them. It also identifies gaps in our understanding of the data we have and shows us what we should try to collect in the future.

The text was updated successfully, but these errors were encountered:

jenniferthompson · 2017-02-23T16:07:21Z

Definitely needed. Should this replace, be in addition to, or be combined with our current /datadictionaries README (which needs some updating)?

skirmer · 2017-02-23T16:16:19Z

I think this file we're compiling with the one-line descriptions should link to the respective data dictionaries.

Edit: I gotta learn to read. I think that this might be more complicated than the Readme needs to be exactly, because of the specific info about linkages we're trying to build. But I could be wrong!

jenniferthompson · 2017-02-23T16:26:43Z

@skirmer Yep, I agree that this is trying to do more. Just trying to figure out what the role of each (if we keep both) would be!

I could see:

point 1 ("all the data sources we've gathered: what they're called, and a one-line description of what they contain") staying in the README, linked to each data dictionary and this new document, with point 2 (key fields + graphic description) being its own document

or possibly

this eventually becoming the README (including links to data dictionaries)

I currently have no opinions on which would be better! I like streamlined, so that would mean just having the README with everything we need, but that might be trying to do too much in one spot.

skirmer · 2017-02-24T20:47:49Z

As we were discussing in the slack channel, just to document it here, ggraph might be a good tool to use to illustrate the links between our datasets in a visual fashion.

sharonbrener · 2017-02-27T15:36:07Z

Hey all! I know all of the data for this project lives on data.world, and that file descriptions and labels have been added to most files, but I wasn't sure if y'all knew that you can also add column descriptions to note key/joinable fields. That seems like it would be a great option for the second bullet point mentioned, and would keep that information living alongside the descriptions that already live on DW (which seems to fulfill the first bullet point).

From a column's info overlay, you can add a description:

We're also actively working on a new view that compiles all dictionaries for a dataset into a single view (@jenniferthompson just user-tested our prototype of this on Friday, actually 🙌).

I'd be happy to answer any questions around current functionality, share a preview of what's coming soon, or chat about any other feature requests from this team that we should consider building as part of our data dictionary initiative. As you might imagine, proper documentation is very near and dear to our hearts at data.world!

jenniferthompson · 2017-02-27T15:56:33Z

Definitely agreed - thanks for making sure we know about it, @sharonbrener!

(And I'm really excited about this coming-soon feature, guys! It looks awesome.)

sharonbrener · 2017-02-27T15:59:58Z

This issue does highlight that along with our new data dictionary view, we should prioritize a way to export data dictionaries as MD files so they can be added to the repo as well. I'll bring that note back to our team.

darya-akimova · 2017-12-22T15:09:49Z

I'm officially reviving this issue haha, Have some free time with the holidays and I'd like to spend it on creating a useful data inventory/data dictionary for all of the datasets collected.

darya-akimova · 2018-01-06T19:17:00Z

Submitted a pull request for data dictionary files that I created for most existing datasets on data.world.

Still to do:

figure out which of the following datasets are most up to date and create a dictionary for it: companies_drugs_keyed.csv, manufacturers_drugs_cleaned.csv, and drugdata_clean.csv (which seem to be the same dataset, but slightly modified)
A list or reference for which columns between all of the datasets are related or can be joined to each other

mattgawarecki added documentation status-in-progress status-needs-grooming and removed status-in-progress labels Feb 23, 2017

mattgawarecki added the meta-needs-followup label Mar 18, 2017

darya-akimova mentioned this issue Jan 6, 2018

Master #69

Merged

darya-akimova added status-in-progress and removed meta-needs-followup status-needs-grooming labels Jan 12, 2018

darya-akimova self-assigned this Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inventory gathered data and document relationships among sets #52

Inventory gathered data and document relationships among sets #52

mattgawarecki commented Feb 23, 2017

jenniferthompson commented Feb 23, 2017 •

edited

Loading

skirmer commented Feb 23, 2017 •

edited

Loading

jenniferthompson commented Feb 23, 2017

skirmer commented Feb 24, 2017

sharonbrener commented Feb 27, 2017

jenniferthompson commented Feb 27, 2017

sharonbrener commented Feb 27, 2017

darya-akimova commented Dec 22, 2017

darya-akimova commented Jan 6, 2018

Inventory gathered data and document relationships among sets #52

Inventory gathered data and document relationships among sets #52

Comments

mattgawarecki commented Feb 23, 2017

Status

Task

What we're looking for

How this will help

jenniferthompson commented Feb 23, 2017 • edited Loading

skirmer commented Feb 23, 2017 • edited Loading

jenniferthompson commented Feb 23, 2017

skirmer commented Feb 24, 2017

sharonbrener commented Feb 27, 2017

jenniferthompson commented Feb 27, 2017

sharonbrener commented Feb 27, 2017

darya-akimova commented Dec 22, 2017

darya-akimova commented Jan 6, 2018

jenniferthompson commented Feb 23, 2017 •

edited

Loading

skirmer commented Feb 23, 2017 •

edited

Loading