This repository contains the necessary scripts in order to build a dataset of open-source projects and analyze how their reuse characteristics are related to their security vulnerabiltities.
For the ICSR'19 paper version of this dataset checkout the icsr19
branch.
This document presents:
This repository consists of two main directories:
- data: stores all files that will be analyzed
- tooling: stores all scripts for building and analyzing the dataset
The analysis was performed using the following tools:
- Linux Mint (v 19.3)
- Python (v 3)
- Anaconda (v. 4.5.12)
- Java (v > 8)
- Maven (v. 3.6)
- Install Anaconda
- From a terminal, create a conda environment for the study.
$ conda create -n study-env
$ conda activate study-env
- From a terminal, install the necessary packages.
$ conda install -c conda-forge notebook maven xmltodict numpy scipy pandas matplotlib seaborn
- Now, from a terminal, execute the
$ tooling/download-vendor-tools.sh
-
Next, open the
tooling/script.py
and replace theSTUDY_HOME
path variable with the path of your locally cloned repository. -
Finally, create a
JAVA_HOME
system variable and export to thePATH
. See more instructions here.
The steps for the data collection are described in the tooling/DataCollection.ipynb
, tooling/DataCollectionRQ2.ipynb
and tooling/DataCollectionRQ3.ipynb
jupyter notebooks.
More specific instruction for each substep are included before each substep.
The steps for the data analysis are described in the tooling/DataVisualization.ipynb
jupyter notebook.
The execution of the steps is linear, and thus it should be executed from the top to the bottom.
Analyzing the dataset requires a local Maven .m2
directory which have all built projects and their dependencies jars
.