-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This repository contains the scripts used to benchmark ExeTera, Pandas, Dask and PostgreSQL relative to each other. It contains scripts for generating artificial data, and running benchmarks on both artificial and real data.
These scripts are used to import data from csv to hdf5. The data used in official benchmarks is the Covid Symptom Study, a copy of which (with the exception of a few fields containing identifiable information) can be obtained by through the Health Data Gateway by searching for 'Covid Symptom Study'.
Importing is benchmarked on the patients
, assessments
, and tests
tables.
ExeTera import is performed using ExeTera's import
command, details of which can be found on the ExeTera wiki.
The script generate_sql_import_scripts.py
is used to generate sql files that can be run via psql
. This script is run as follows:
python generate_sql_import_scripts.py <schema_filename> <table_name> <import_filename>
Pandas import can be carried out using the import_patients_pandas.py
script. Despite its name, it is used to import the three tables described above, as follows:
python import_patients_pandas.py <csv_file_name> <hdf5_file_name>
Dask import is carried out using the import_patients_dask.py
script. It is used as follows:
python import_patients_dask.py <csv_file_name> <hdf5_file_name>