Skip to content
Ben Murray edited this page Jun 23, 2021 · 17 revisions

ExeTeraEval

This repository contains the scripts used to benchmark ExeTera, Pandas, Dask and PostgreSQL relative to each other. It contains scripts for generating artificial data, and running benchmarks on both artificial and real data.

Import scripts

These scripts are used to import data from csv to hdf5. The data used in official benchmarks is the Covid Symptom Study, a copy of which (with the exception of a few fields containing identifiable information) can be obtained by through the Health Data Gateway by searching for 'Covid Symptom Study'.

Importing is benchmarked on the patients, assessments, and tests tables.

ExeTera import

ExeTera import is performed using ExeTera's import command, details of which can be found on the ExeTera wiki.

Pandas import

Pandas import can be carried out using the import_patients_pandas.py script. Despite its name, it is used to import the three tables described above, as follows:

python import_patients_pandas.py <csv_file_name> <hdf5_file_name>

Dask import

Dask import is carried out using the import_patients_dask.py script. It is used as follows:

python import_patients_dask.py <csv_file_name> <hdf5_file_name>

PostgreSQL import

The script generate_sql_import_scripts.py is used to generate sql files that can be run via psql. This script is run as follows:

python generate_sql_import_scripts.py <schema_filename> <table_name> <import_filename>

Reading scripts

The reading scripts are used to measure the performance of reading specific columns from the imported Covid Symptom Study datasets

ExeTera read

Pandas reading is measured using read_patients_exetera.py. It is called as follows:

python read_patients_exetera.py <imported hdf5 file> <column_count>

Pandas read

Pandas reading is measured using read_patients_pandas.py. It is called as follows:

python read_patients_pandas.py <imported hdf5 file> <column_count>

Dask read

Note, no dask read function was written in the end as Dask was unable to import either patients or assessments.

PostgreSQL read

PostgreSQL read is performed by carrying out a standard SELECT statement with \timing on set

Joins - Covid Symptom Study

ExeTera

ExeTera joins are carried out by the following scripts:

  • exetera_join_p_a.py: left join of assessments on the left and patients on the right
  • exetera_join_a_p.py: left join of patients on the left and assessments on the right
  • exetera_join_p_t.py: left_join of tests on the left and patients on the right
  • exetera_join_t_p.py: left join of patients on the left and tests on the right

These scripts don't take parameters and must be called separately.

Pandas

Pandas joins are carried out by the following scripts:

  • execute_hdf_pandas_p_to_a_join_scenario.py: left join of assessments on the left and patients on the right
  • execute_hdf_pandas_a_to_p_join_scenario.py: left join of patients on the left and assessments on the right
  • execute_hdf_pandas_p_to_t_join_scenario.py: left join of tests on the left and patients on the right
  • execute_hdf_pandas_t_to_p_join_scenario.py: left join of patients on the left and tests on the right
Clone this wiki locally