750 configure workers #777

mauch · 2024-11-04T00:16:08Z

Add configuration options to specify the number of cores to use for Dask operations. A sepearate config option is available for io operations and (num_workers_io).

Also add a config option to specify the size of partitions in MB where required (max_partition_mb).

These options are optional parameters with defaults so should not change the config from how it existed previously.

This PR addresses issue #750

…the while dataframe in memory.

…rformance hit).

…ask.

…ipeline.

…/vast-pipeline into 750_configure_workers

…rameters.

…/vast-pipeline into 750_configure_workers

* Organise v1.1.1-dev * Fix changelog formatting and update changelog instructions (#772) * Initial changelog formatting issues * Update changelog + instructions * Updated changelog * Updated Code of conduct (#773) * Updated Code of conduct * Updated changelog * Fixed grammar * Fix zenodo DOI * Fixed typo in README * Shorten forced fit measurement names (#734) * Shorten names * Updated changelog * Update clearpiperun to use raw SQL (#775) * timing and memory benchmark * delete raw initial * adding profiler * optimisation handling exceptions * Added logging * Updated delete_run * Fix syntax errors * Disable triggers to see if that fixes speed issues * Remove memory profiling * Reenabled logging * Add end of loop logging, remove tqdm * Remove all tqdm, improve logging slightly * Added timing * Fixed tqdm missing * Fix logging * Added units to logging * specify source id in logging * Toggle triggers * clean up clearpiperun * Other minor updates * Fix variable name * Correctly handle images and skyregions that are associated with multiple runs * PEP8 * Updated changelog * Remove commented code * Remove whitespace - don't know why the linter didn't pick this up * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Fix logging count * Clean up logging statements --------- Co-authored-by: Shibli Saleheen <[email protected]> Co-authored-by: Tom Mauch <[email protected]> * Quick memory optimisations (#776) * Use itertuples over iterrows since iterrows is an enormous memory hog. * Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory. * Decrease default partition size to 15MB * Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit). * Don't write forced parquets in parallel (No perfomance hit for this). * Dont overwrite input DataFrame when writing parquets. * Update CHANGELOG.md * Address review comments. * Copy YAML objects before revalidation so the can be garbage collected. * Appease flake8 * 750 configure workers (#777) * Use itertuples over iterrows since iterrows is an enormous memory hog. * Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory. * Decrease default partition size to 15MB * Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit). * Don't write forced parquets in parallel (No perfomance hit for this). * Initial configuration updates for processing options. * Dont overwrite input DataFrame when writing parquets. * Update CHANGELOG.md * Address review comments. * Copy YAML objects before revalidation so the can be garbage collected. * Appease flake8 * Add processing options as optional with defaults. * filter processing config to parallel association. * Add a funtion to determine the number of workers and partitions for Dask. * Use config values for num_workers and max_partition_size throughout pipeline. * Correct working in config template. * Update CHANGELOG.md * Remove unused imports. * Bump strictyaml to 1.6.2 * Use YAML 'null' to create Python None for all cores option. * Make None the default in `calculate_workers_and_partitions` instead of 0 * Updated run config docs * Allow null for num_workers_io and improve validation of processing parameters. * Update num_workers_io default in docs. --------- Co-authored-by: Dougal Dobie <[email protected]> * Prepare v1.2.0 release --------- Co-authored-by: Shibli Saleheen <[email protected]> Co-authored-by: Tom Mauch <[email protected]>

mauch added 18 commits October 22, 2024 14:27

Use itertuples over iterrows since iterrows is an enormous memory hog.

27de763

Drop sources_df columns before renaming id column to avoid a copy of …

b4c6088

…the while dataframe in memory.

Decrease default partition size to 15MB

e180a4e

Dont split (large-in-memory) list of DataFrames into dask bags (No pe…

9e9668d

…rformance hit).

Don't write forced parquets in parallel (No perfomance hit for this).

d7e7b21

Initial configuration updates for processing options.

2ee8052

Dont overwrite input DataFrame when writing parquets.

0a08b7b

Update CHANGELOG.md

ff6269c

Merge branch 'dev' into optimise_mem

2d82245

Address review comments.

67ea721

Copy YAML objects before revalidation so the can be garbage collected.

c98720b

Appease flake8

5c24a7d

Add processing options as optional with defaults.

cf8256e

filter processing config to parallel association.

4ca1c8b

Add a funtion to determine the number of workers and partitions for D…

8e710f9

…ask.

Merge branch 'optimise_mem' into 750_configure_workers

772ec63

Use config values for num_workers and max_partition_size throughout p…

8cd8eee

…ipeline.

Correct working in config template.

3fed937

mauch requested a review from ddobie November 4, 2024 00:16

Update CHANGELOG.md

a9e2b5a

mauch added the enhancement New feature or request label Nov 4, 2024

mauch and others added 7 commits November 4, 2024 11:23

Remove unused imports.

0f596d9

Merge branch 'dev' into 750_configure_workers

15783ef

Bump strictyaml to 1.6.2

0231f2c

Use YAML 'null' to create Python None for all cores option.

c23753a

Make None the default in calculate_workers_and_partitions instead of 0

5f9cd36

Merge branch '750_configure_workers' of https://github.com/askap-vast…

c0e6bd1

…/vast-pipeline into 750_configure_workers

Updated run config docs

137e425

ddobie approved these changes Nov 4, 2024

View reviewed changes

Allow null for num_workers_io and improve validation of processing pa…

f27869a

…rameters.

mauch added 2 commits November 5, 2024 11:27

Merge branch '750_configure_workers' of https://github.com/askap-vast…

5a8f826

…/vast-pipeline into 750_configure_workers

Update num_workers_io default in docs.

2789445

mauch merged commit abfd38d into dev Nov 5, 2024
5 checks passed

mauch deleted the 750_configure_workers branch November 5, 2024 00:53

mauch mentioned this pull request Nov 13, 2024

Use correct DataFrame object when calculating number of partitions #783

Open

ddobie mentioned this pull request Nov 13, 2024

Specify ncpu/workers/processes for each parallelisation step #750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

750 configure workers #777

750 configure workers #777

mauch commented Nov 4, 2024

750 configure workers #777

750 configure workers #777

Conversation

mauch commented Nov 4, 2024