Skip to content

Latest commit

 

History

History
23 lines (19 loc) · 3.28 KB

File metadata and controls

23 lines (19 loc) · 3.28 KB

Overscripted Web: Data Analysis in the Open

The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. Some preliminary insights already uncovered from this data are illustrated in this blog post. Ongoing analyses can be tracked here

Technical criteria for submitting an analysis:

  • Analyses should be performed in Python using the jupyter scientific notebook format and executing in this environment.
  • Analysis can be submitted by filing a Pull Request against this repository with the analysis formatted as an *.ipynb file in the /analyses/ folder.
    • Environment can be confugured locally by calling conda env create -f environment.yaml
  • Only *.ipynb format entries submitted via a pull request to the /analyses/ folder will be considered. Notebooks must be well documented and run on the environment described. Any entries not meeting these criteria will not be considered and no review will be carried out for error-generating code.
  • Any additional code submitted will not be considered. The *.ipynb notebook should be a self contained analysis.

Accessing the Data

Each of the links below links to a bz2 zipped portion of the total dataset. A small sample of the data is available in safe_dataset.sample.tar.bz2 to get a feel for the content without commiting to the full download.

Unzipped the full parquet data will be approximately 70Gb. Each (compressed) chunk dataset is around 9GB. SHA256SUMS contains the checksums for all datasets including the sample.