diff --git a/docs/source/_static/custom.css b/docs/source/_static/custom.css index 3319b7de..4cc6b8b8 100644 --- a/docs/source/_static/custom.css +++ b/docs/source/_static/custom.css @@ -2,3 +2,19 @@ div.body h5 { font-size: 110%; font-weight: bold; } + +html body table td, +html body table th { + border: 1px solid #d6d6d6; + padding: 6px 13px; +} + + +html body table table th { + background: #eee; +} + +table { + border-spacing: 0; + border-collapse: collapse; +} diff --git a/docs/source/overview.md b/docs/source/overview.md index f5c1c0bc..f7ca3862 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -117,3 +117,163 @@ Specify the converted data destination using the :code:`convert(..., dest_path= ```{eval-rst} Parquet data destination type may be specified by using :code:`convert(..., dest_datatype="parquet", ...)` (:mod:`convert() `). ``` + +## Data Transformations + +CytoTable performs various types of data transformations. +This section help define terminology and expectations surrounding the use of this terminology. +CytoTable might use one or all of these depending on user configuration. + +### Data Chunking + + + + + + + +
OriginalChanges
+ +"Data source" + + + + + +
Col_ACol_BCol_C
1a0.01
2b0.02
+ +
+ +"Chunk 1" + + + + + +
Col_ACol_BCol_C
1a0.01
+ +"Chunk 2" + + + + +
Col_ACol_BCol_C
2b0.02
+ +
+ +_Example of data chunking performed on a simple table of data._ + +```{eval-rst} +Data chunking within CytoTable involves slicing data sources into "chunks" of rows which all contain the same columns and have a lower number of rows than the original data source. +CytoTable uses data chunking through the ``chunk_size`` argument value (:code:`convert(..., chunk_size=1000, ...)` (:mod:`convert() `)) to reduce the memory footprint of operations on subsets of data. +CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. :code:`convert(..., concat=False,join=False, ...)` (:mod:`convert() `). +Parquet "datasets" are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see `PyArrow documentation `_ or `Pandas documentation `_ on using source paths which are directories). +``` + +### Data Concatenations + + + + + + + +
OriginalChanges
+ +"Chunk 1" + + + + + +
Col_ACol_BCol_C
1a0.01
+ +"Chunk 2" + + + + +
Col_ACol_BCol_C
2b0.02
+ +
+ +"Concatenated data" + + + + + +
Col_ACol_BCol_C
1a0.01
2b0.02
+ +
+ +_Example of data concatenation performed on simple tables of similar data "chunks"._ + +Data concatenation within CytoTable involves bringing two or more data "chunks" with the same columns together as a unified dataset. +Just as chunking slices data apart, concatenation brings them together. +Data concatenation within CytoTable typically occurs using a [ParquetWriter](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to assist with composing a single file from many individual files. + +### Data Joins + + + + + + + + + + + + + +
OriginalChanges
+ +"Table 1" (notice __Col_C__) + + + + + +
Col_ACol_BCol_C
1a0.01
+ +"Table 2" (notice __Col_Z__) + + + + +
Col_ACol_BCol_Z
1a2024-01-01
+ +
+ +"Joined data" (as Table 1 left-joined with Table 2) + + + + +
Col_ACol_BCol_CCol_Z
1a0.012024-01-01
+ +
+Join Specification in SQL +
+ +```sql +SELECT * +FROM Table_1 +LEFT JOIN Table_2 ON +Table_1.Col_A = Table_2.Col_A; +``` + +
+ +_Example of a data join performed on simple example tables._ + +```{eval-rst} +Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset. +The word "join" here is interpreted through `SQL-based terminology on joins `_. +Joins may be specified in CytoTable using `DuckDB-style SQL `_ through :code:`convert(..., joins="SELECT * FROM ... JOIN ...", ...)` (:mod:`convert() `). +Also see CytoTable's presets found here: :data:`presets.config ` or via `GitHub source code for presets.config `_. +``` + +Note: data software outside of CytoTable sometimes makes use of the term "merge" to describe capabilities which are similar to join (for ex. [`pandas.DataFrame.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html). +Within CytoTable, we opt to describe these operations with "join" to avoid confusion with software development alongside the technologies used (for example, [DuckDB SQL](https://duckdb.org/docs/archive/0.9.2/sql/introduction) includes no `MERGE` keyword).