Transform the dependency relationships between repositories into a graph and then perform exploratory data analysis and visualization. See this blogpost, in portuguese, for more details.
.
│
├── dataset/
| ├── sqlite/ <- sqlite github database
| └── json/ <- network as json files
|
├── notebooks/ <- Jupyter notebooks
|
└── make/
├── features/ <- features getter
└── network/ <- network getter (dependencies)
- Clone this repo
- Create a virtual environment (
venv
) - Activate your environment:
$ source [ENVIRONMENT_NAME]/bin/activate
- Install dependencies:
$ pip install -r requirements.txt
- Have a Github personal token (generate here) to insert on .env
There are mainly 2 Python Scripts, one network visualization using D3.js and some EDA on a jupyter notebook.
Run this command:
$ python3 getGithubNetwork.py -r [REPO_NAME] -o [JSON_FILENAME] -d [DEPTH]
This script uses GitHub GraphQL API and a Depth Limited Search(DLS) to fetch dependencies until reach the depth limit.
If you give --depth 0
, then the script will try to find all dependencies, as far down as they go.
The Json file will be availabe at dataset/json/
directory.
Initially you need to buil a database using github-to-sqlite. Originally only the scrape-dependents
script is available in the original repository. Our script to get dependencies was added on this forked repository
Run this command:
$ python3 getNetworkFromSqlite.py -db [DATABASE_NAME] -s [MINIMUM_STARS] -o [JSON_FILENAME]
This script fetch a network from dependents table and convert it to a JSON file. The --stars
parameter indicates the minimum number of stars that a repository must have in order to be added to the network.
If you provide --stars 0
, then the script will add all repositories to the network.
The Json file will be availabe at dataset/json/
directory.
D3.js was used to view a sample of the network. You can see here or opening index.html
on a local host.
The notebooks/
contains some jupyter notebooks with some Exploratory Data Analysis.