KDD 2018 held a competition to predict the intensity of air pollutants in Beijing and London for the next 48 hours during a month.
This is our team's code (NIPL Rises
) that achieved 16th position (among 4000 teams) in last 10 days category executed on a single laptop.
We used tensorflow
to build a hybrid model composed of CNNs, LSTMs, and MLPs for an end-to-end prediction over 2D grid data, time-series, and categorical data. This project includes (1) fetching and crawling of weather and pollutant data from multiple sources,
(2) data cleaning and integration, (3) visualization for insights,
and (4) prediction of pollutants (PM2.5, PM10, O3) for the next 48 hours in Beijing and London.
- Download csv files from KDD 2018 data repository (requires sign-up),
- Install required packages including
tensorflow
andkeras
(for deep learning),selenium
(for web crawling), - Copy
default.config.ini
toconfig.ini
, - Download
chromedriver.exe
for web crawling, and set the addressCHROME_DRIVER_PATH = path to chromedriver.exe
- Set the addresses of downloaded data-sets,
BJ_AQ = Beijing air quality history BJ_AQ_REST = Beijing air quality history for 2nd and 3rd months of 2018 BJ_AQ_STATIONS = Beijing air quality stations BJ_MEO = Beijing meteorological history BJ_GRID_DATA = Beijing grid weather history LD_* = same for London
- Set the addresses for fetched/cleaned data to be stored,
BJ_AQ_LIVE = fetched Beijing air quality live data BJ_MEO_LIVE = fetched Beijing meteorology live data BJ_OBSERVED = cleaned Beijing observed air quality and meteorology time series BJ_OBSERVED_MISS = marked missing data in BJ_OBSERVED BJ_STATIONS = cleaned data of stations in Beijing BJ_GRID_LIVE = fetched grid of current weather in Beijing BJ_GRID_FORECAST = fetched grid of forecast weather in Beijing BJ_GRIDS = history of grid data in Beijing BJ_GRID_COARSE = coarsened grid of data to lower resolutions LD_* = same for London
- Set [lower, upper] bounds for date intervals of urls
BJ_AQ_URL = */2018-06-05-0/2k0d1d8 BJ_MEO_URL = */2018-06-05-0/2k0d1d8 BJ_GRID_URL = */2018-06-05-0/2k0d1d8 LD_*_URL = same for London
- Set a path for generated features and models
FEATURE_DIR = directory for extracted features MODEL_DIR = directory for generated models
- Data pre-process
- Run
src/preprocess/preprocess_all.py
to create the cleaned data sets in your pre-specified addresses,
- Run
- Data visualization
- Run scripts in
src/statistics
to gain basic insights about value distributions, time series and geographical positions - Change
BJ_*
toLD_*
for London data
- Run scripts in
- Feature generation
- Go to main method of
src/feature_generators/hybrid_fg.py
, - Uncomment desired (city-pollutant, sample rate) pairs in
cases
variable (all pairs are eventually required). Higher sample rate, larger data, - Run the script
- Go to main method of
- Model training
- Go to
src/methods/lstm_pre_train.py
- Run the script; simple LSTM models are pre-trained for all pollutants. These models are fed (unchanged) to the final model for better performance,
- Go to main method of
src/methods/hybrid.py
, - Uncomment desired city-pollutant,
- Run the script; best model so far will be saved automatically
- Go to
- Model testing
- Go to
src/methods/model_tests.py
, - Uncomment desired city-pollutant, set a time interval in
TEST_FROM
andTEST_TO
, - Run the script; SMAPE score will be printed.
- Go to
src/methods/model_investigate.py
, - Run the script to see SMAPE score per station sorted and geographically visualized.
- Go to
- Prediction
- Go to
src/predict_next_48.py
- Change
timedelta
if you wish to predict previous 48 hours, - Run the script
- Go to
examples/gcforest
includes basic examples of using forests instead of neurons to do deep learning proposed in this paper,examples/tensorflow
includes basic examples of using tensorflow