Skip to content

Commit

Permalink
Increase documentation on cloud-based data sources (#138)
Browse files Browse the repository at this point in the history
* update pycytominer-transform mentions

* update docs on cloud sources

* add custom css for h5 header display

* add issue link

* enhance cloud auth mentions

* Apply suggestions from code review

Co-authored-by: Gregory Way <[email protected]>

* simplify subheader for cloud data sources

---------

Co-authored-by: Gregory Way <[email protected]>
  • Loading branch information
d33bs and gwaybio authored Jan 9, 2024
1 parent 3ae1606 commit 85f447b
Show file tree
Hide file tree
Showing 6 changed files with 57 additions and 9 deletions.
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ references:
scope: "ExampleHuman"
notes: >-
ExampleHuman CellProfiler data is used to help validate expected results
for pycytominer-transform.
for CytoTable.
identifiers:
- description: "README.md with Citation Information"
type: url
Expand All @@ -74,7 +74,7 @@ references:
scope: "all_cellprofiler.sqlite"
notes: >-
CellProfiler generated data from NF1_SchwannCell_data project is used to help validate
expected results for pycytominer-transform.
expected results for CytoTable.
identifiers:
- description: "Github Link with Contributors"
type: url
Expand Down
4 changes: 2 additions & 2 deletions cytotable/presets.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Presets for common pycytominer-transform configurations.
Presets for common CytoTable configurations.
"""

config = {
Expand Down Expand Up @@ -206,5 +206,5 @@
},
}
"""
Configuration presets for pycytominer-transform
Configuration presets for CytoTable
"""
4 changes: 4 additions & 0 deletions docs/source/_static/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
div.body h5 {
font-size: 110%;
font-weight: bold;
}
8 changes: 4 additions & 4 deletions docs/source/architecture.data.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Data Architecture

Documentation covering data architecture for pyctyominer-transform.
Documentation covering data architecture for CytoTable.

## Sources

Data sources for pyctyominer-transform are measurement data created from other cell biology image analysis tools.
Data sources for CytoTable are measurement data created from other cell biology image analysis tools.

See below for a brief overview of these sources and data types.

Expand Down Expand Up @@ -42,7 +42,7 @@ erDiagram
Image ||--o{ Others : includes
```

The above diagram shows an example of image data relationships found within the data that is used by pycytominer-transform.
The above diagram shows an example of image data relationships found within the data that is used by CytoTable.
Namely: Each image may include zero or many compartment objects (Cytoplasm, Cells, Nuclei, etc)objects.

### Cytoplasm Compartment Data Relationships
Expand All @@ -64,6 +64,6 @@ erDiagram
Nuclei ||--|| Cytoplasm : related-to
```

The above diagram shows canonical relationships of the Cytoplasm compartment data to other compartments found within the data that is used by pycytominer-transform.
The above diagram shows canonical relationships of the Cytoplasm compartment data to other compartments found within the data that is used by CytoTable.
Each Cytoplasm object is related to Cells via the Parent_Cells field and Nuclei via the Parent_Nuclei field.
These Parent\_\* fields are ObjectNumbers in their respective compartments.
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,7 @@

# enable anchor creation
myst_heading_anchors = 3

# add custom css
html_static_path = ["_static"]
html_css_files = ["custom.css"]
42 changes: 41 additions & 1 deletion docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,59 @@ flowchart LR
classDef green fill:#97F0B4,stroke:#333
```

Data sources for pyctyominer-transform are measurement data created from other cell biology image analysis tools.
Data sources for CytoTable are measurement data created from other cell biology image analysis tools.
These measurement data are the focus of the data source content which follows.

### Data Source Locations

```{eval-rst}
Data sources may be provided to CytoTable using local filepaths or remote object-storage filepaths (for example, AWS S3, GCP Cloud Storage, Azure Storage).
We use `cloudpathlib <https://cloudpathlib.drivendata.org/~latest/>`_ under the hood to reference files in a unified way, whether they're local or remote.
```

#### Cloud Data Sources

CytoTable uses [cloudpathlib](https://cloudpathlib.drivendata.org/~latest/) to access cloud-based data sources.
CytoTable supports:

- [Amazon S3](https://en.wikipedia.org/wiki/Amazon_S3): `s3://bucket_name/object_name`
- [Google Cloud Storage](https://en.wikipedia.org/wiki/Google_Cloud_Storage): `gc://bucket_name/object_name`
- [Azure Blob Storage](https://en.wikipedia.org/wiki/Microsoft_Azure#Storage_services): `az://container_name/blob_name`

##### Cloud Service Configuration and Authentication

```{eval-rst}
Remote object storage paths which require authentication or other specialized configuration may use cloudpathlib client arguments (`S3Client <https://cloudpathlib.drivendata.org/~latest/api-reference/s3client/>`_, `AzureBlobClient <https://cloudpathlib.drivendata.org/~latest/api-reference/azblobclient/>`_, `GSClient <https://cloudpathlib.drivendata.org/~latest/api-reference/gsclient/>`_) and :code:`convert(..., **kwargs)` (:mod:`convert() <cytotable.convert.convert>`).
For example, remote AWS S3 paths which are public-facing and do not require authentication (like, or similar to, :code:`aws s3 ... --no-sign-request`) may be used via :code:`convert(..., no_sign_request=True)` (:mod:`convert() <cytotable.convert.convert>`).
```

Each cloud service provider may have different requirements for authentication (there is no fully unified API for these).
Please see the [cloudpathlib](https://cloudpathlib.drivendata.org/~latest/) client documentation for more information on which arguments may be used for configuration with specific cloud providers (for example, [`S3Client`](https://cloudpathlib.drivendata.org/stable/api-reference/s3client/), [`GSClient`](https://cloudpathlib.drivendata.org/stable/api-reference/gsclient/), or [`AzureBlobClient`](https://cloudpathlib.drivendata.org/stable/api-reference/azblobclient/)).

##### Cloud Service File Type Parsing Differences

Data sources retrieved from cloud services are not all treated the same due to technical constraints.
See below for a description of how each file type is treated for a better understanding of expectations.

__Comma-separated values (.csv)__:

CytoTable reads cloud-based CSV files directly.

__SQLite Databases (.sqlite)__:

CytoTable downloads cloud-based SQLite databases locally before other CytoTable processing.
This is necessary to account for differences in how [SQLite's virtual file system (VFS)](https://www.sqlite.org/vfs.html) operates in context with cloud service object storage.

Note: Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (`**kwarg`) passed through CytoTable to `cloudpathlib` called `local_cache_dir`. See [the cloudpathlib documentation on caching](https://cloudpathlib.drivendata.org/~latest/caching/#keeping-the-cache-around).
This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).

```{eval-rst}
A quick example of how this argument is used: :code:`convert(..., local_cache_dir="non_temporary_directory", ...)` (:mod:`convert() <cytotable.convert.convert>`).
```

Future work to enable direct SQLite data access from cloud locations for CytoTable will be documented within GitHub issue [CytoTable/#70](https://github.com/cytomining/CytoTable/issues/70).

### Data Source Types

Data source compatibility for CytoTable is focused (but not explicitly limited to) the following.
Expand Down

0 comments on commit 85f447b

Please sign in to comment.