Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add version metadata to CytoTable Parquet output #134

Merged
merged 25 commits into from
Dec 20, 2023
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
165ed3f
add version detection utility
d33bs Dec 18, 2023
4edfbb9
manage semver with poetry-dynamic-versioning
d33bs Dec 19, 2023
71735b2
comments to further describe what's happening
d33bs Dec 19, 2023
3dd6394
update github actions workflows and simplify
d33bs Dec 19, 2023
af31b07
remove version util and lint
d33bs Dec 19, 2023
7008d8a
update pre-commit check versions
d33bs Dec 19, 2023
325f4fa
add docs on semver for release publishing process
d33bs Dec 19, 2023
933ff19
move setup-poetry appropriately
d33bs Dec 19, 2023
a522559
correct action location
d33bs Dec 19, 2023
2feef8d
readd version getter util and test
d33bs Dec 19, 2023
1a26cdf
add metadata writer
d33bs Dec 19, 2023
419bf82
simplify metadata parquet write util
d33bs Dec 19, 2023
70f8ddb
add a test for _write_parquet_table_with_metadata
d33bs Dec 19, 2023
c503214
move to constants module for reuse capabilities
d33bs Dec 19, 2023
f936390
update convert with constants and new writer fxn
d33bs Dec 19, 2023
a41d697
add tool.setuptools_scm to avoid warnings
d33bs Dec 19, 2023
616d9fc
linting update
d33bs Dec 20, 2023
6420cd2
Merge remote-tracking branch 'upstream/main' into data-versioned-output
d33bs Dec 20, 2023
540904f
move dunamai to dev deps and update try block
d33bs Dec 20, 2023
8d136d9
Apply suggestions from code review
d33bs Dec 20, 2023
04a136b
add additional notes about release drafts
d33bs Dec 20, 2023
8fe4a41
linting
d33bs Dec 20, 2023
33d09b5
Merge remote-tracking branch 'upstream/main' into data-versioned-output
d33bs Dec 20, 2023
09f4a09
expand docs on kwargs
d33bs Dec 20, 2023
8552190
add colons to docstring
d33bs Dec 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/actions/setup-poetry/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: Setup Environment and Cache
description: |
Setup poetry for use with GitHub Actions workflows.
Note: presumes pre-installed Python.
runs:
using: "composite"
steps:
- name: Setup poetry and poetry-dynamic-versioning
shell: bash
run: |
python -m pip install poetry poetry-dynamic-versioning
6 changes: 2 additions & 4 deletions .github/workflows/publish-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,8 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: install poetry
uses: abatilo/actions-poetry@v2
with:
poetry-version: "1.6.1"
- name: Setup for poetry
uses: ./.github/actions/setup-poetry
- name: poetry deps
run: poetry install
- name: Build documentation
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/publish-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,8 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: install poetry
uses: abatilo/actions-poetry@v2
with:
poetry-version: "1.6.1"
- name: Setup for poetry
uses: ./.github/actions/setup-poetry
- name: poetry deps
run: poetry install
- name: poetry build distribution content
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python_version }}
- name: Install poetry
run: pip install poetry
- name: Setup for poetry
uses: ./.github/actions/setup-poetry
- name: Install environment
run: poetry install --no-interaction --no-ansi
- name: Run sphinx-docs build test
Expand Down
18 changes: 9 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ default_language_version:
python: python3.10
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- id: check-toml
- repo: https://github.com/codespell-project/codespell
rev: v2.2.5
rev: v2.2.6
hooks:
- id: codespell
exclude: >
Expand All @@ -29,37 +29,37 @@ repos:
- mdformat-myst
- mdformat-gfm
- repo: https://github.com/adrienverge/yamllint
rev: v1.32.0
rev: v1.33.0
hooks:
- id: yamllint
- repo: https://github.com/psf/black
rev: 23.9.1
rev: 23.12.0
hooks:
- id: black
- repo: https://github.com/asottile/blacken-docs
rev: 1.16.0
hooks:
- id: blacken-docs
- repo: https://github.com/PyCQA/bandit
rev: 1.7.5
rev: 1.7.6
hooks:
- id: bandit
args: ["-c", "pyproject.toml"]
additional_dependencies: ["bandit[toml]"]
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
rev: 5.13.2
hooks:
- id: isort
- repo: https://github.com/jendrikseipp/vulture
rev: v2.9.1
rev: v2.10
hooks:
- id: vulture
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.5.1
rev: v1.7.1
hooks:
- id: mypy
- repo: https://github.com/PyCQA/pylint
rev: v3.0.0a7
rev: v3.0.3
hooks:
- id: pylint
name: pylint
Expand Down
4 changes: 4 additions & 0 deletions cytotable/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
"""
__init__.py for cytotable
"""

# note: version data is maintained by poetry-dynamic-versioning
d33bs marked this conversation as resolved.
Show resolved Hide resolved
__version__ = "0.0.0"

from .convert import convert
from .exceptions import (
CytoTableException,
Expand Down
74 changes: 74 additions & 0 deletions cytotable/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
"""
CytoTable: constants - storing various constants to be used throughout cytotable.
"""

import multiprocessing
import os
from typing import cast

from cytotable.utils import _get_cytotable_version

# read max threads from environment if necessary
# max threads will be used with default Parsl config and Duckdb
MAX_THREADS = (
multiprocessing.cpu_count()
if "CYTOTABLE_MAX_THREADS" not in os.environ
else int(cast(int, os.environ.get("CYTOTABLE_MAX_THREADS")))
)

# enables overriding default memory mapping behavior with pyarrow memory mapping
CYTOTABLE_ARROW_USE_MEMORY_MAPPING = (
os.environ.get("CYTOTABLE_ARROW_USE_MEMORY_MAPPING", "1") == "1"
)

DDB_DATA_TYPE_SYNONYMS = {
"real": ["float32", "float4", "float"],
"double": ["float64", "float8", "numeric", "decimal"],
"integer": ["int32", "int4", "int", "signed"],
"bigint": ["int64", "int8", "long"],
}

# A reference dictionary for SQLite affinity and storage class types
# See more here: https://www.sqlite.org/datatype3.html#affinity_name_examples
SQLITE_AFFINITY_DATA_TYPE_SYNONYMS = {
"integer": [
"int",
"integer",
"tinyint",
"smallint",
"mediumint",
"bigint",
"unsigned big int",
"int2",
"int8",
],
"text": [
"character",
"varchar",
"varying character",
"nchar",
"native character",
"nvarchar",
"text",
"clob",
],
"blob": ["blob"],
"real": [
"real",
"double",
"double precision",
"float",
],
"numeric": [
"numeric",
"decimal",
"boolean",
"date",
"datetime",
],
}

CYTOTABLE_DEFAULT_PARQUET_METADATA = {
"data-producer": "https://github.com/cytomining/CytoTable",
"data-producer-version": str(_get_cytotable_version()),
}
41 changes: 29 additions & 12 deletions cytotable/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,11 @@ def _source_chunk_to_parquet(
from cloudpathlib import AnyPath
from pyarrow import parquet

from cytotable.utils import _duckdb_reader, _sqlite_mixed_type_query_to_parquet
from cytotable.utils import (
_duckdb_reader,
_sqlite_mixed_type_query_to_parquet,
_write_parquet_table_with_metadata,
)

# attempt to build dest_path
source_dest_path = (
Expand Down Expand Up @@ -339,7 +343,7 @@ def _source_chunk_to_parquet(
# read data with chunk size + offset
# and export to parquet
with _duckdb_reader() as ddb_reader:
parquet.write_table(
_write_parquet_table_with_metadata(
table=ddb_reader.execute(
f"""
{base_query}
Expand All @@ -358,7 +362,7 @@ def _source_chunk_to_parquet(
"Mismatch Type Error" in str(e)
and str(AnyPath(source["source_path"]).suffix).lower() == ".sqlite"
):
parquet.write_table(
_write_parquet_table_with_metadata(
# here we use sqlite instead of duckdb to extract
# data for special cases where column and value types
# may not align (which is valid functionality in SQLite).
Expand Down Expand Up @@ -414,7 +418,8 @@ def _prepend_column_name(

import pyarrow.parquet as parquet

from cytotable.utils import CYTOTABLE_ARROW_USE_MEMORY_MAPPING
from cytotable.constants import CYTOTABLE_ARROW_USE_MEMORY_MAPPING
from cytotable.utils import _write_parquet_table_with_metadata

targets = tuple(metadata) + tuple(compartments)

Expand Down Expand Up @@ -499,7 +504,7 @@ def _prepend_column_name(
updated_column_names.append(column_name)

# perform table column name updates
parquet.write_table(
_write_parquet_table_with_metadata(
table=table.rename_columns(updated_column_names), where=table_path
)

Expand Down Expand Up @@ -569,8 +574,12 @@ def _concat_source_group(
import pyarrow as pa
import pyarrow.parquet as parquet

from cytotable.constants import (
CYTOTABLE_ARROW_USE_MEMORY_MAPPING,
CYTOTABLE_DEFAULT_PARQUET_METADATA,
)
from cytotable.exceptions import SchemaException
from cytotable.utils import CYTOTABLE_ARROW_USE_MEMORY_MAPPING
from cytotable.utils import _write_parquet_table_with_metadata

# build a result placeholder
concatted: List[Dict[str, Any]] = [
Expand Down Expand Up @@ -600,7 +609,9 @@ def _concat_source_group(
destination_path.parent.mkdir(parents=True, exist_ok=True)

# build the schema for concatenation writer
writer_schema = pa.schema(common_schema)
writer_schema = pa.schema(common_schema).with_metadata(
CYTOTABLE_DEFAULT_PARQUET_METADATA
)

# build a parquet file writer which will be used to append files
# as a single concatted parquet file, referencing the first file's schema
Expand Down Expand Up @@ -713,7 +724,7 @@ def _join_source_chunk(

import pyarrow.parquet as parquet

from cytotable.utils import _duckdb_reader
from cytotable.utils import _duckdb_reader, _write_parquet_table_with_metadata

# Attempt to read the data to parquet file
# using duckdb for extraction and pyarrow for
Expand Down Expand Up @@ -757,7 +768,7 @@ def _join_source_chunk(
)

# write the result
parquet.write_table(
_write_parquet_table_with_metadata(
table=result,
where=result_file_path,
)
Expand Down Expand Up @@ -797,7 +808,11 @@ def _concat_join_sources(

import pyarrow.parquet as parquet

from cytotable.utils import CYTOTABLE_ARROW_USE_MEMORY_MAPPING
from cytotable.constants import (
CYTOTABLE_ARROW_USE_MEMORY_MAPPING,
CYTOTABLE_DEFAULT_PARQUET_METADATA,
)
from cytotable.utils import _write_parquet_table_with_metadata

# remove the unjoined concatted compartments to prepare final dest_path usage
# (we now have joined results)
Expand All @@ -811,7 +826,7 @@ def _concat_join_sources(
shutil.rmtree(path=dest_path)

# write the concatted result as a parquet file
parquet.write_table(
_write_parquet_table_with_metadata(
table=pa.concat_tables(
tables=[
parquet.read_table(
Expand All @@ -826,7 +841,9 @@ def _concat_join_sources(
# build a parquet file writer which will be used to append files
# as a single concatted parquet file, referencing the first file's schema
# (all must be the same schema)
writer_schema = parquet.read_schema(join_sources[0])
writer_schema = parquet.read_schema(join_sources[0]).with_metadata(
CYTOTABLE_DEFAULT_PARQUET_METADATA
)
with parquet.ParquetWriter(str(dest_path), writer_schema) as writer:
for table_path in join_sources:
writer.write_table(
Expand Down
Loading