Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Validate dataset properties in jobs #1473

Open
wants to merge 5 commits into
base: release-jobs
Choose a base branch
from

Conversation

sbliven
Copy link
Member

@sbliven sbliven commented Oct 30, 2024

Description

Allows jobs to validate properties of the dataset.

Motivation

The validate action currently allows operators to enforce attributes of the DTO.

However, many jobs depend on jobParams.datasetLists to link a job to certain datasets. This allows validation of properties on datasets associated with the dataset.

The motivating example would be checking datasetLifecycle properties for certain jobs. These are currently hard-coded for special job types. This would allow them to be applied to custom job types as well.

configVersion: v1.0
jobs:
  - jobType: custom_archive
    create:
      auth: "#datasetAccess"
      actions:
        - actionType: validate
          datasets:
            datasetlifecycle.archivable:
              const: true

Changes:

  • Add datasets property to validate actions (only in create operations) to validate linked dataset properties

Tests included

(WIP pending tests and documentation)

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

official documentation info

Docs are included in SciCatProject/documentation#56

@sbliven sbliven self-assigned this Oct 30, 2024
@sbliven sbliven requested a review from despadam October 30, 2024 14:28
@sbliven
Copy link
Member Author

sbliven commented Nov 3, 2024

Documentation is now added to SciCatProject/documentation#56

@sbliven
Copy link
Member Author

sbliven commented Nov 5, 2024

The unit tests pass because I'm manually mocking the datasetService, so the core functionality works.

The mocha tests fail because the datasetService doesn't get injected properly by nest. I need to figure out how to properly initialize the ValidateCreateAction before merging this.

During a `create` operation, the `validate` action can take a
`datasets` option listing path/schema pairs that will be applied
to any datasets in jobParams.datasetList.

Datasets are fetched from the database during the DTO validation step
if needed.

Validation of archive/retrieve/public lifecycle properties is no longer
done automatically, but must be configured in the jobConfig file.

Details:
- Separate ValidateCreateJobAction (create) and ValidateJobAction (update)
- Fix 'Internal Server Error' from job controller if a session timed out.
  If the session times out then only public jobs will be visible.
- Update documentation for validate action
- Remove checkDatasetState from the job controller.
  This must now be configured with a validate action
- Add unit test
- Remove hard-coded job types and dataset states
@sbliven sbliven marked this pull request as ready for review December 18, 2024 10:56
@sbliven
Copy link
Member Author

sbliven commented Dec 18, 2024

I rebased this onto the current release-jobs.

Tests are currently failing due to a circular dependency in the module DI. I'm looking into how to fix it.

DatasetsService is request scoped, so it can't be added on module init.
This also avoids the circular dependency from import datasetService
during module initialization.
Mongoose documents do some Proxy shenanigans that confuses JSONPath.
They should first be converted to simple objects.
- Update error messages for archivable, retrievable, and public jobs.
  These used to be bespoke errors, but now follow the standard 'Invalid
  request' error message.
- Change dataset property validation to throw 409 errors, to be
  consistent with the previous behavior
- Add datasetList to many job DTOs that now validate it
- Set archivable/retrievable properly in test datasets
@sbliven
Copy link
Member Author

sbliven commented Feb 2, 2025

Tests are fixed, so this is ready for review. Some tests changed, so maybe double check that we intended the new behavior (eg I requiring more properties in the example jobs and datasets)

Comment on lines +8 to +10
datasets:
"datasetlifecycle.archivable":
const: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I am a bit confused with this approach. I understand that the idea is to avoid hardcoded job types and allow users to easily configure their own jobs, however isn't this now also hardcoded in the config file? And also now more prone to errors since it's provided by the user. I realize that for new facility-specific job types that we don't cover in the core, perhaps this is the only way. But thinking about it the other way round, why would someone define an archive job for datasets with archivable=false? Shouldn't this already be the default? Otherwise we always force users to add archivable=true for archive jobs even though is it self-explanatory.
So my suggestion would be that for archive/retrieve jobs we do the dataset lifecycle validation by default, and only ask users to provide this when they want to define a new job type. But that leads us back to hardcoded job types I think? I am happy to discuss this further, as I am clearly confused 😅 Perhaps I am misunderstanding the approach?

Copy link
Member Author

@sbliven sbliven Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would someone define an archive job for datasets with archivable=false?

They shouldn't. This was previously enforced by scicat for the type:archive job type; now it can also be configured with custom job names.

Shouldn't this already be the default?

I could see an argument for automatically adding this validate action to any job called archive (and the equivalent for retrieve and public as a special case, for backwards compatibility. Or at least warning if it isn't configured. I think that can be a later discussion, as I can think of several ways to implement it, most of which boil down to syntactic sugar for the syntax here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@despadam despadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps then we could proceed with this approach for now and discuss #1692 also with Max before making a final decision. I just have one question below.

Comment on lines +196 to +202
const filter = {
where: {
pid: {
$in: datasetIds,
},
},
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to understand how the datasetlifecycle state is being checked here. Could you provide an explanation please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

Successfully merging this pull request may close these issues.

2 participants