Project for DSP2022 - Data Science Project I - Detecting micro and macro spatial-temporal patterns in behaviour and habitat selection by a migratory bird
- Python version python3.9 or higher
- numpy
- pandas
- geopandas
- sklearn
- matplotlib
- seaborn
- pointpats
- contextily
- Folium
- Plotly
- PySAL
- [Pointpats] (https://pypi.org/project/pointpats/)
- (Jupyter) notebook
Anaconda can be installed and the code and the notbooks can be executed in conda environment. For example, numpy, pandas and matplotlib are already inluded in Anaconda.
- Put original data csv files in resources/data/original
- Run
python3 src/data_wrangling/join_datasets.py
to join the original 2 datasets together by the NestID - Run
python3 src/data_wrangling/clustering.py
to calculate the clusters using DBSCAN with different distances for each year - Run
python3 src/data_wrangling/nearby_nests.py
to calculate the closest nest and calculate number of nearby (different distances) neighbours - Run
python3 src/data_wrangling/missing_data.py
to impute values to null variables - Run
python3 src/data_wrangling/nest_features.py
to compute features related to nests such as the percentage of shy birds in a cluster or in a neighbourhood and also combine the features generated by the previous python scripts together in a single csv file. - Run
python3 src/data_wrangling/generate_test_train_split.py
to split the dataset into two parts (test and train) and also generates one hot variables out of some of the features
- Run
python3 src/machine_learning/build_classification_models.py
to build and evaluate classification models - Run
python3 src/machine_learning/build_clustering_models.py
to build the clustering models which try to find a better propensity threshold - Run
python3 src/machine_learning/build_regression_models.py
to build and evaluate regression models (these do very poorly) - Run
python3 src/machine_learning/explaining_models.py
to run MPI and PIMP for feature importance over the decision trees built by the classification script - Run
python3 src/machine_learning/explaining_model_lime.py
to run LIME for explaining individual predictions
- Run
python3 src/data_wrangling/mobber_percentages_per_cluster.py
to draw plots for percentages of shy and aggressive birds in clusters. - Run
python3 src/data_wrangling/mobber_percentages_per_cluster_per_year.py
to draw plots for percentages of shy and aggressive birds in clusters for each year.
Please note that all scripts must be run from the root repository. For example, if we want to run the clustering phase, we would run python src/data_wrangling/clustering.py
and not python3 data_wrangling/clustering.py
- Run
python3 src/analysis/analysis.py
to draw the histograms with the real values for each year, as well as print the values autocorrelation.py
contains the functions needed foranalysis.py
to run- The distance threshold may be edited in
analysis.py
Before running the notebooks data files must be created. If not done in machine learning part run the following commands
$ python3 src/data_wrangling/join_datasets.py
$ python src/data_wrangling/clustering.py
Then go to visualisation folder
cd src/visualisation/
- Open Visual Studio Code and navigate there to the location
src/visualisation/
and open the chosen notebook file. Choose the kernel by clickingSelect Kernel
on the right corner and choose one before running the code. If you have installed conda, you may want to choose kernel that name starts withbase
. - The notebooks can also be executed on the server. Open the notebook server with command
$ jupyter notebook
that opens the folder into a browser. You can then choose the notebook you want to look more closely. More info can be found here.
The code for map clustering with the K-means algorithm is also written in a python notebook. To open the notebook and run it follow the same instruction as for running visualisations.