popular-data-sets

An analysis of the most popular open datasets on U.S. local and state data portals.

See our blog post for background info and a summary of results for this project.

Highlights of this repository:

Our detailed methodology can be found in our Jupyter Notebook ("Socrata API Open Data Portal Analysis - Final Version.ipynb"), which contains both the code itself and explanations of the process and decisions that were made. If you want, you can use this to run your own analysis.
The analysis generated a set of 52 dataset topics, each of which respresents a cluster of related datasets. Using a popularity measure that is explained in the Jupyter Notebook, we ranked these dataset topics by popularity. We have a table ("final_topic_ranks.csv") that has that ranked list.
- Note that the "Topic Content" is the set of words that the clustering algorithm used to define the cluster of related datasets for a topic. The words were of decreasing importance going from left to right in the list. For example, in the second-highest ranked topic (ID number 3), "transportation" was the most important word for the cluster while "bike" was the least important.
If you're wondering which individual datasets where classified into which topics, go to the "topic_datasets" folder, which has a list of all the datasets that were part of the cluster for each topic. Note that the individual tables are large and best viewed in a spreadsheet program like Excel.

Thanks for checking out this repo, and let us know if you have any questions by opening an issue or emailing [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
topic_datasets		topic_datasets
.gitignore		.gitignore
8.14corpus.mm		8.14corpus.mm
8.14corpus.mm.index		8.14corpus.mm.index
LICENSE		LICENSE
Original LDA Code - Socrata API Dataset - Gensim LDA Groupings.ipynb		Original LDA Code - Socrata API Dataset - Gensim LDA Groupings.ipynb
README.md		README.md
Socrata API Open Data Portal Analysis - Final Version.ipynb		Socrata API Open Data Portal Analysis - Final Version.ipynb
big_mash_archive_8_14.14		big_mash_archive_8_14.14
final_topic_ranks.csv		final_topic_ranks.csv
full_dataset_with_topic_comps.csv		full_dataset_with_topic_comps.csv
generate_topic_dataset_tables.ipynb		generate_topic_dataset_tables.ipynb
lda_52_sixty_good_model		lda_52_sixty_good_model
lda_52_sixty_good_model.expElogbeta.npy		lda_52_sixty_good_model.expElogbeta.npy
lda_52_sixty_good_model.id2word		lda_52_sixty_good_model.id2word
lda_52_sixty_good_model.state		lda_52_sixty_good_model.state
most_popular_topics_by_portal_by_raw_totals.csv		most_popular_topics_by_portal_by_raw_totals.csv

Provide feedback