Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/thumbnails #73

Merged
merged 20 commits into from
Jun 3, 2024
Merged

Feature/thumbnails #73

merged 20 commits into from
Jun 3, 2024

Conversation

mgdaily
Copy link
Contributor

@mgdaily mgdaily commented May 16, 2024

Getting this PR out just to make sure that things mostly look okay. I still need to make sure this is documented correctly in the API docs and write an additional article in the OCS doc about how to ingest thumbnails. The tests are currently failing because this branch relies on changes made to the ocs_archive library (see that PR: observatorycontrolsystem/ocs_archive#11).

This PR adds a Thumbnail model to science archive. It has a foreign key relationship to frame, so a Frame may have many thumbnails. Frames can be filtered in the usual way, with thumbnails optionally included in the response by using the query param include_thumbnails=true. We also handle the case that a thumbnail arrives before its associated frame by serializing the payload into a Frame object, creating it, then creating the thumbnail (see ingester PR - we send the same frame payload along with a couple of extra bits of metadata to create the thumbnail and associate it).

I've tested this on dev (http://archive-api-dev.lco.gtn) by pulling down 500 frames, generating a small and large thumbnail for each, and ingesting each frame/small+large thumbnail in a random order to ensure that we handle the case where frames arrive first, or thumbnails arrive first.

I also tested the migration against the staging DB and it took less than 10 seconds, so we should have little downtime when deploying this.

Namely, we'd like to check that the thumbnail data coming in is valid using the frame serializer, and just use the filestore key and metadata to generate the URL. Then we create a minimal frame object and associate that with the thumbnail.
Also don't try to generate a url or filename for frames that don't yet have a version_set
Make the size choices for thumbnails configurable, clean up the frame creation on thumbnail create, also change up the fields on thumbnail slightly to just store basename/extension so we can derive the filename and minimize changes to the ingester.
@mgdaily mgdaily requested a review from jnation3406 May 16, 2024 22:49
@@ -87,7 +88,7 @@ class Meta:
# }

def create(self, validated_data):
version_data = validated_data.pop('version_set')
version_data = validated_data.pop('version_set') if 'version_set' in validated_data else {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the case that we may create a Frame that doesn't yet have a version set - this is when we create a thumbnail whose frame has not yet been created.

logger.info('Got request to process thumbnail', extra=logger_tags)
# Make sure we have the minimum information to make a frame object associated with the thumbnail if one doesn't already exist
frame_serializer = FrameSerializer(data=request.data)
if frame_serializer.is_valid():
Copy link
Contributor Author

@mgdaily mgdaily May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This view is probably the most complicated part of this. I've designed this so that we always expect a valid frame payload from the ingester so we're able to create a basic Frame object. I've done this to keep the changes in the ingester minimal - the ingester still uploads to s3 and gets the version information, but rather than storing the key and extension on the frame within the version_set, we grab it and store it on the thumbnail instead, and store the Frame without any version info.

Copy link
Member

@jnation3406 jnation3406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably add a ThumbnailFile class to the ocs_archive that only requires what it needs and plumb that through. Should make things a little more efficient and explicit.

archive/frames/models.py Show resolved Hide resolved
archive/frames/models.py Outdated Show resolved Hide resolved
@cached_property
def url(self):
metadata = self.frame.get_header_dict()
# include frame basename and size so that this passes metadata validation in the DataFile class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper approach would be to add a thumbnail type to the ocs_archive library - we might want to store thumbnails in their own /thumbnails directory in the filestore, not in raw or processed like it would now. This could also be done in the ocs_archive file type we create for thumbnails. It should also not need to use the related frames basename then, just its own basename and the frame fields (to get site/inst/dayobs/ directory) is probably enough. Also shouldn't need to get the header, which would make another DB call or table join - it just needs the site, inst, dayobs to determine its directory if we set that up in the ocs_archive thumbnail file class, and it can get those directly from the frame model.

if thumbnail_serializer.is_valid():
# Remove the version set as this version does not correspond to the frame object, but rather the thumbnail.
del frame_serializer.validated_data['version_set']
frame = frame_serializer.save(basename=request.data['frame_basename'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to just get the existing frame if it exists, or save/create it if it doesn't?

@jnation3406
Copy link
Member

Also have you thought about what effect this might have on things that use the archive and make frames queries. I.e. the frontend or other apps we have? What will happen to those if they query frames that have a thumbnail but not the actual frame yet?

@mgdaily
Copy link
Contributor Author

mgdaily commented May 17, 2024

Also have you thought about what effect this might have on things that use the archive and make frames queries. I.e. the frontend or other apps we have? What will happen to those if they query frames that have a thumbnail but not the actual frame yet?

Good point. The easiest way I can think of would be to filter out any Frames from the list view that don't have any version information associated with them. That way users only see the data that's been completely ingested.

We always want a frame associated with the thumbnail. Re-make migration and rename it.
Make sure we check for a frame in the thumbnail serializer before we attempt to create one. Also update the url property on the thumbnail model to use a simplified get_filestore_path method so that we don't need to pull all of the header information.

@cached_property
def url(self):
path = ThumbnailFile.get_filestore_path_from_frame_metadata(self.frame.site_id, self.frame.instrument_id,
Copy link
Contributor Author

@mgdaily mgdaily May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnation3406 you left a comment about the previous version of this function needing the header to construct the filestore path using the DataFile class. Problem was, we needed to construct the DataFile class with all valid metadata so it could determine the filestore path using only part of that metadata, which as you mentioned, would lead to unnecessary table joins and accesses. So what I've done in this commit was add a static method to the ThumbnailFile class to take in only the metadata needed to construct the filestore path. To preserve the DataFile interface (subclasses of DataFile need to implement get_filestore_path for the filestore to work), I just call the static method and use the data stored in the ThumbnailFile's metadata.

This allows the science archive to provide a minimal set of metadata without having to do too much DB work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay I guess, but if you want to avoid importing the ThumbnailFile directly, you could keep using the get_file_store_path util function but instead of passing the header, just pass a metadata dict of the header keys you need, i.e.
get_file_store_path(self.filename, {'site_id': self.frame.site_id, 'instrument_id': self.frame.instrument_id, 'observation_day': self.frame.observation_day.strftime('%Y%m%d')})

@mgdaily mgdaily requested a review from jnation3406 May 29, 2024 00:48
Copy link
Member

@jnation3406 jnation3406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one comment


@cached_property
def url(self):
path = ThumbnailFile.get_filestore_path_from_frame_metadata(self.frame.site_id, self.frame.instrument_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay I guess, but if you want to avoid importing the ThumbnailFile directly, you could keep using the get_file_store_path util function but instead of passing the header, just pass a metadata dict of the header keys you need, i.e.
get_file_store_path(self.filename, {'site_id': self.frame.site_id, 'instrument_id': self.frame.instrument_id, 'observation_day': self.frame.observation_day.strftime('%Y%m%d')})

@mgdaily mgdaily merged commit c7f29d9 into main Jun 3, 2024
11 of 12 checks passed
@mgdaily mgdaily deleted the feature/thumbnails branch June 3, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants