Skip to content

sunlightpolicy/popular-data-sets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

popular-data-sets

An analysis of the most popular open datasets on U.S. local and state data portals.

See our blog post for background info and a summary of results for this project.

Highlights of this repository:

  • Our detailed methodology can be found in our Jupyter Notebook ("Socrata API Open Data Portal Analysis - Final Version.ipynb"), which contains both the code itself and explanations of the process and decisions that were made. If you want, you can use this to run your own analysis.
  • The analysis generated a set of 52 dataset topics, each of which respresents a cluster of related datasets. Using a popularity measure that is explained in the Jupyter Notebook, we ranked these dataset topics by popularity. We have a table ("final_topic_ranks.csv") that has that ranked list.
    • Note that the "Topic Content" is the set of words that the clustering algorithm used to define the cluster of related datasets for a topic. The words were of decreasing importance going from left to right in the list. For example, in the second-highest ranked topic (ID number 3), "transportation" was the most important word for the cluster while "bike" was the least important.
  • If you're wondering which individual datasets where classified into which topics, go to the "topic_datasets" folder, which has a list of all the datasets that were part of the cluster for each topic. Note that the individual tables are large and best viewed in a spreadsheet program like Excel.

Thanks for checking out this repo, and let us know if you have any questions by opening an issue or emailing [email protected].

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published