Skip to content

Commit

Permalink
Merge pull request #361 from broadinstitute/development
Browse files Browse the repository at this point in the history
Release 1.34.0
  • Loading branch information
bistline authored Aug 22, 2024
2 parents 1f0dcf5 + b820bbb commit a0742f6
Show file tree
Hide file tree
Showing 14 changed files with 3,220 additions and 65 deletions.
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
File Ingest Pipeline for Single Cell Portal

[![Build status](https://img.shields.io/circleci/build/github/broadinstitute/scp-ingest-pipeline.svg)](https://circleci.com/gh/broadinstitute/scp-ingest-pipeline)
[![Code coverage](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline/branch/master/graph/badge.svg)](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline)
[![Code coverage](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline/branch/main/graph/badge.svg)](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline)

The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.

Expand All @@ -27,21 +27,21 @@ cd scp-ingest-pipeline
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt
scripts/setup-mongo-dev.sh <PATH_TO_YOUR_VAULT_TOKEN> # E.g. ~/.github-token
source scripts/setup-mongo-dev.sh
```

### Docker

With Docker running and Vault active on your local machine, run:
With Docker running and `gcloud` authenticated on your local machine, run:

```
scripts/docker-compose-setup.sh -t <PATH_TO_YOUR_VAULT_TOKEN> # E.g. ~/.github-token
scripts/docker-compose-setup.sh
```

If on Apple silicon Mac (e.g. M1), and performance seems poor, consider generating a docker image using the arm64 base. Example test image: gcr.io/broad-singlecellportal-staging/single-cell-portal:development-2.2.0-arm64, usage:

```
scripts/docker-compose-setup.sh -i development-2.2.0-arm64 -t <PATH_TO_YOUR_VAULT_TOKEN>
scripts/docker-compose-setup.sh -i development-2.2.0-arm64
```

To update dependencies when in Docker, you can pip install from within the Docker Bash shell after adjusting your requirements.txt.
Expand Down Expand Up @@ -132,10 +132,10 @@ Pro-Tip: For local builds, you can try adding docker build options `--progress=p

### 2. Set up environment variables

Run the following to pull database-specific secrets out of vault (passing in the path to your vault token):
Run the following to pull database-specific secrets out of Google Secrets Manager (GSM):

```
source scripts/setup-mongo-dev.sh ~/.your-vault-token
source scripts/setup-mongo-dev.sh
```

Now run `env` to make sure you've set the following values:
Expand All @@ -152,7 +152,8 @@ DATABASE_HOST=<ip address>
Run the following to export out your default service account JSON keyfile:

```
vault read -format=json secret/kdux/scp/development/$(whoami)/scp_service_account.json | jq .data > /tmp/keyfile.json
GOOGLE_PROJECT=$(gcloud info --format="value(config.project)")
gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=default-sa-keyfile | jq > /tmp/keyfile.json
```

### 4. Start the Docker container
Expand Down
42 changes: 36 additions & 6 deletions ingest/anndata_.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,19 @@ def _field_template(self, field, precision):


try:
from ingest_files import IngestFiles
from ingest_files import DataArray, IngestFiles
from expression_files.expression_files import GeneExpression
from monitor import log_exception
from monitor import log_exception, bypass_mongo_writes
from validation.validate_metadata import list_duplicates
except ImportError:
# Used when importing as external package, e.g. imports in single_cell_portal code
from .ingest_files import IngestFiles
from .ingest_files import DataArray, IngestFiles
from .expression_files.expression_files import GeneExpression
from .monitor import log_exception
from .monitor import log_exception, bypass_mongo_writes
from .validation.validate_metadata import list_duplicates


class AnnDataIngestor(GeneExpression, IngestFiles):
class AnnDataIngestor(GeneExpression, IngestFiles, DataArray):
ALLOWED_FILE_TYPES = ['application/x-hdf5']

def __init__(self, file_path, study_file_id, study_id, **kwargs):
Expand Down Expand Up @@ -57,6 +57,36 @@ def basic_validation(self):
except ValueError:
return False

def create_cell_data_arrays(self):
"""Extract cell name DataArray documents for raw data"""
adata = self.obtain_adata()
cells = list(adata.obs_names)
# use filename denoting a raw 'fragment' to allow successful ingest and downstream queries
raw_filename = "h5ad_frag.matrix.raw.mtx.gz"
data_arrays = []
for data_array in GeneExpression.create_data_arrays(
name=f"{raw_filename} Cells",
array_type="cells",
values=cells,
linear_data_type="Study",
linear_data_id=self.study_file_id,
cluster_name=raw_filename,
study_file_id=self.study_file_id,
study_id=self.study_id
):
data_arrays.append(data_array)

return data_arrays

def ingest_raw_cells(self):
"""Insert raw count cells into MongoDB"""
arrays = self.create_cell_data_arrays()
if not bypass_mongo_writes():
self.load(arrays, DataArray.COLLECTION_NAME)
else:
dev_msg = f"Extracted {len(arrays)} DataArray for {self.study_file_id}:{arrays[0]['name']}"
IngestFiles.dev_logger.info(dev_msg)

@staticmethod
def generate_cluster_header(adata, clustering_name):
"""
Expand Down Expand Up @@ -117,7 +147,7 @@ def generate_metadata_file(adata, output_name):
headers = adata.obs.columns.tolist()
types = []
for header in headers:
if pd.api.types.is_number(adata.obs[header]):
if pd.api.types.is_numeric_dtype(adata.obs[header]):
types.append("NUMERIC")
else:
types.append("GROUP")
Expand Down
10 changes: 10 additions & 0 deletions ingest/expression_files/expression_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,16 @@ def is_raw_count_file(study_id, study_file_id, client):
QUERY = {"_id": study_file_id, "study_id": study_id}

study_file_doc = list(client[COLLECTION_NAME].find(QUERY)).pop()
# special handling of non-reference AnnData files to always return false
# this will allow normal extraction of expression data as raw count cells are already ingested during
# the "raw_counts" extract phase
if (
study_file_doc.get("file_type") == "AnnData" and
"ann_data_file_info" in study_file_doc.keys() and
not study_file_doc["ann_data_file_info"].get("reference_file")
):
return False

# Name of embedded document that holds 'is_raw_count_files is named expression_file_info.
# If study files does not have document expression_file_info
# field, "is_raw_count_files", will not exist.:
Expand Down
6 changes: 6 additions & 0 deletions ingest/ingest_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@
# Ingest AnnData - happy path processed expression data only extraction
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['processed_expression']"
# Ingest AnnData - happy path raw count cell name only extraction
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['raw_counts']"
# Ingest AnnData - happy path cluster and metadata extraction
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['cluster', 'metadata']" --obsm-keys "['X_umap','X_tsne']"
Expand Down Expand Up @@ -537,6 +540,9 @@ def extract_from_anndata(self):
"extract"
):
self.anndata.generate_processed_matrix(self.anndata.adata)

if self.kwargs.get('extract') and "raw_counts" in self.kwargs.get('extract'):
self.anndata.ingest_raw_cells()
self.report_validation("success")
return 0
# scanpy unable to open AnnData file
Expand Down
4 changes: 2 additions & 2 deletions ingest/monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,13 @@ def integrate_sentry():
See also: links to Sentry resources atop this module
"""

# Ultimately stored in Vault, passed in as environmen variable to PAPI
# Ultimately stored in GSM, passed in as environment variable to PAPI
sentry_DSN = os.environ.get("SENTRY_DSN")

if sentry_DSN is None:
# Don't log to Sentry unless its DSN is set.
# This disables Sentry logging in development and test (i.e.,
# environments without a SENTRY_DSN in their scp_config vault secret).
# environments without a SENTRY_DSN in their scp-config-json GSM secret).
return

sentry_logging = LoggingIntegration(
Expand Down
14 changes: 2 additions & 12 deletions scripts/docker-compose-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,18 @@ usage=$(
cat <<EOF
$0 [OPTION]
-i Set URL for GCR image; helpful if not using latest development
-t Set GitHub Vault token (e.g. ~/.github-token)
-h print this text
EOF
)

GCR_IMAGE=""
VAULT_TOKEN_PATH=""
while getopts "i:t:h" OPTION; do
while getopts "i:h" OPTION; do
case $OPTION in
i)
echo "### SETTING GCR IMAGE ###"
export GCR_IMAGE="$OPTARG"
;;
t)
echo "### SETTING VAULT TOKEN ###"
VAULT_TOKEN_PATH="$OPTARG"
;;
h)
echo "$usage"
exit 0
Expand All @@ -45,13 +40,8 @@ if [[ $GCR_IMAGE = "" ]]; then
export GCR_IMAGE="${IMAGE_NAME}:${LATEST_TAG}"
fi

if [[ $VAULT_TOKEN_PATH = "" ]]; then
echo "Did not provide VAULT_TOKEN_PATH"
exit 1
fi

echo "### SETTING UP ENVIRONMENT ###"
./scripts/ingest-local-setup.sh $VAULT_TOKEN_PATH
./scripts/ingest-local-setup.sh

docker pull $GCR_IMAGE

Expand Down
17 changes: 3 additions & 14 deletions scripts/ingest-local-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,14 @@
#
# Keep "Dev env vars" synced with `setup-mongo-dev.sh`

VAULT_TOKEN_PATH="$1"
if [[ -z "$VAULT_TOKEN_PATH" ]]
then
echo "You must provide a path to a GitHub token to proceed, e.g. ~/.github-token"
exit 1
fi
vault login -method=github token=$(cat $VAULT_TOKEN_PATH)
if [[ $? -ne 0 ]]
then
echo "Unable to authenticate into Vault"
exit 1
fi
GOOGLE_PROJECT=$(gcloud info --format="value(config.project)")

# Dev env vars
BROAD_USER=`whoami`
MONGODB_USERNAME='single_cell'
DATABASE_NAME='single_cell_portal_development'
MONGODB_PASSWORD=`vault read secret/kdux/scp/development/$BROAD_USER/mongo/user | grep password | awk '{ print $2 }' `
DATABASE_HOST=`vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname | grep ip | awk '{ $2=substr($2,2,length($2)-2); print $2 }' `
MONGODB_PASSWORD=$(gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=mongo-user | jq .password)
DATABASE_HOST=$(gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=mongo-hostname| jq -r '.ip[0]')
BYPASS_MONGO_WRITES='yes'
BARD_HOST_URL="https://terra-bard-dev.appspot.com"

Expand Down
Loading

0 comments on commit a0742f6

Please sign in to comment.