Improve multi-process handling across CDMs #121

azimov · 2024-01-30T16:07:11Z

Running a job across multiple CDMs does not take advantage of multi-process execution within targets (which would also support clusters).
This means that job execution is often spending time waiting on I/O in terms of the CDMs or users have to configure the execution manually across the CDMs.

The individual tasks themselves are also not fine grained enough to allow some I/O blocking tasks to be separated from cpu intensive tasks (such as PLP, SCCS or CohortMethod that operate on local andromeda objects with multiple processes).

Though some aspects of this may be related to having multiple steps within an individual analytics package that would be difficult to resolve with the way tasks are currently set up, exposing the targets workflow could be significantly improved by use of meta-targets and usage of internal targets functions to spawn multiple jobs.

This would also give us the advanced functionality of targets (e.g. use of SLURM clusters to execute multiple jobs) but even if it didn't it would be healthy to uncouple our execution infrastructure away from targets, which is currently just really being used for dependency trees.

Current approach

send study execution per CDM to strategus
Strategus creates targets tasks script internally
Strategus executes tasks calling targets

Proposed approach

Strategus takes 1) analysis script 2) Cdms to execute on
Strategus creates targets list for targets file across configured cdms
User executes targets::tar_make in custom way (or call is just masked by targets).

Note, that in both cases the execution of results uploading tasks to a results db and execution of the meta-analysis step is still an optional target type. However, in the latter case we weill be able to clearly see a dependent task for all cdm executions.

Stretch goal

Allow package maintainers to split out tasks within analytics packages by exposing an interface that allows targets to see them. For example, in PLP there can be a single process task "pull covariates" and a multiprocess task "train models". Internally, we still take advantages of multithreaded calls e.g. in C++ code or external libraries but in this case there are multiple models that use independent parameters and/or hyperparameters so will finish execution at different times. The same applies in case of any study that has many Target/Comparator/Indication comparisons which will need independent propensity score models, for example.

The text was updated successfully, but these errors were encountered:

anthonysena added this to the v1.0.0 milestone Jan 30, 2024

anthonysena added execution targets labels Jul 16, 2024

anthonysena modified the milestones: v1.0.0, Backlog Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multi-process handling across CDMs #121

Improve multi-process handling across CDMs #121

azimov commented Jan 30, 2024

Improve multi-process handling across CDMs #121

Improve multi-process handling across CDMs #121

Comments

azimov commented Jan 30, 2024

Current approach

Proposed approach

Stretch goal