Project files for pattern recognition group assignment
Currently contains the following files:
data/WikiEssentials_L4.7z
: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).preprocess_utils.py
: preprocessing functions for Wiki data.model_utils.py
: various utility functions used for modeling (e.g. loading embeddings).1_preprocess_raw_data.py
: preprocessing of raw input data. Currently shortens each article to first 8 sentences.2_baseline_model.py
: tokenization, vectorization of input data and baseline model (1-layer NN with softmax classifier).
- Download and install Anaconda Python 3
- Download latest version of Rstudio. Need this to run python scripts in Rstudio.
- In a terminal, go to this repository's folder and set up the Conda environment
conda env create -f environment.yml
- Install PyTorch with cuda 9.2 support
conda activate VitalWikiClassifier
conda install pytorch torchvision cudatoolkit=9.2 -c pytorch -c defaults -c numba/label/dev
- In R, install the
reticulate
library:
install.packages("reticulate")
- Check the
.Rprofile
file to ensure that R knows where to find your anaconda distribution.