This is the GitHub repository for the paper: A. Y. Yıldız, E. Koç, A. Koç, “Multivariate Time Series Imputation with Transformers”, IEEE Signal Processing Letters, 2022. This paper is based on Multivariate Time Series Transformer Framework and extended on imputation tasks.
Physionet Healthcare Dataset and Beijing Air Quality Dataset are used for imputation task.
For any dataset, including Healthcare and Air Quality, Pandas Time Series Data (ptsd)
format is used. We preprocess the datasets following BRITS. Dataset is pre-processed into numpy array
data with shapes (number of samples × features × time points). Afterwards, using create_df
function in the create_df.py
file, data is saved with to_pickle
function as .pickle
files that are named as train_inputs
- train_labels
, and test_inputs
- test_labels
in a desired folder. --data_dir
option parameter holds the directory of the stored folder.
Codes are implemented on Linux based systems, e.g. Ubuntu. Packages that are used with versions are included in requirements.txt
. Additionally, for conda users, venv.yml
file is also included.
Models are trained and saved in experiments
folder. You are expected to create this folder beforehand, by mkdir experiments
. Models can be tested by using the best model checkpoint for any model saved in experiments
. Additionally, for any task implemented, e.g. imputation for our case, results are recorded in the file determined by --records_file
with the row name --name
. Corresponding sample terminal option parameters are shown below.
Training
For Air Quality experiment:
python src/main.py --output_dir experiments --name imputation_air_quality --records_file imputation_air_quality.xls --data_dir air_quality_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation
For Healthcare experiment:
python src/main.py --output_dir experiments --name imputation_healthcare --records_file imputation_healthcare.xls --data_dir healthcare_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation
Test
In --load_model
, $experiment_name
is the trained model folder to be tested. --masking_ratio
and --mask_distribution
parameters are specific for the test requirements, and may not be used if not wanted. Default values for these parameters are shown in options.py
.
For Air Quality experiment:
python src/main.py --output_dir experiments --name imputation_air_quality --records_file imputation_air_quality.xls --data_dir air_quality_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation --test_only testset --test_pattern test --load_model experiments/$experiment_name/checkpoints/model_best.pth
For Healthcare experiment:
python src/main.py --output_dir experiments --name imputation_healthcare --records_file imputation_healthcare.xls --data_dir healthcare_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation --test_only testset --test_pattern test --load_model experiments/$experiment_name/checkpoints/model_best.pth --masking_ratio 0.1 --mask_distribution bernoulli
After testing; three numpy array
files are saved under the folder visualize_data
, which are target.npy
, target_mask.npy
and predictions.npy
whose shape are also (number of samples × features × time points). These files correspond to the ground-truth values, the masked indexes, and the imputed values of the test data respectively. These can be used to visualize the time points of the testing data by selecting any sample index and any feature index.