Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 1.95 KB

README.md

File metadata and controls

60 lines (45 loc) · 1.95 KB

Using Clickhouse OLAP to support Study View cohort queries (pilot)

Description

This repo will provision and run a Clickhouse instance with data from msk_met_2012, msk_ch_2020 and msk_imapct_2017 datahub studies. This Clickhouse instance can be used by a modified cBioPortal backend to run cohort/filter queries in Study View.

Connection with cBioPortal MySQL database

Clickhouse performs well for analytical queries (search on column values) but is less suitable to retrieve all column values on an entity (typically SELECT * FROM ...). In the current implementation the samples table contains a column with internal sample identifiers used in the cBioPortal MySQL database. This allows for efficient retrieval of sample objects (created with SELECT * FROM sample ... in the MySQL database) once Clickhouse has determined the correct sample identifiers in the cohort.

The clickhouse schema is defined in clickhouse_provisioning/ directory

Installation

  1. Edit the study_configs section in create_clickhouse_db_table_files.py file to reflect paths to msk_met_2012, msk_ch_2020 and msk_imapct_2017 datahub studies
study_configs = [
    {
        "study_dir": "/home/pnp300/git/datahub/public/msk_met_2021",
        "name": "msk_met_2021"
    },
    {
        "study_dir": "/home/pnp300/git/datahub/public/msk_ch_2020",
        "name": "msk_ch_2020"
    },
    {
        "study_dir": "/home/pnp300/git/datahub/public/msk_impact_2017",
        "name": "msk_impact_2017"
    }
]
  1. Create Clickhouse staging files in the clickhouse_provisioning directory (in this repo) by running the create_clickhouse_db_table_files.py script:
python3 create_clickhouse_db_table_files.py
  1. Provision and run Clickhouse by running the docker-compose.yml file:
docker-compose up

or for detached mode:

docker-compose up -d

This will start a Clickhouse instance with port 8123 exposed on the host system.