Potential high memory usage at new sources rms measurements step #649
Labels
help wanted
Extra attention is needed
low priority
Issue is not of immediate concern.
python
Pull requests that update Python code
When there are a very high number of new sources, likely single epoch, combined with a lot of images this has the potential for the new source analysis to get unwieldy.
In the example error below - this was from a run of short timescale images that are very susceptible to single epoch artefacts. There were roughly 3000 images in the run with each image having 1 - 10 measurements. Assuming half of the total measurements were single epoch new sources, so say average 5 per image, that's 3000 * 5 measurements that need to be measured in 2999 other images - just short of 45 million rms measurements required...
In particular this became a problem at the stage of the new source analysis where the data frames are merged after fetching the rms pixel measurements.
This could be reduced by addressing #327 and making sure the dataframes are as lightweight as possible. Also there may be scope to improve this dataframe stage of the new source analysis to avoid such a huge merge.
This problem can also be mitigated by tweaking the pipeline settings, namely raising the new source minimum rms image threshold in the config to a high value - this acts to pretty much 'turn off' the new source stage. Probably source monitoring should be turned off as well. Basic association could also be employed to eliminate many-to-one and many-to-many associations.
Eventually some stages of the pipeline will have to revisited in general to see how the pandas memory footprint can be reduced, either by refactoring or brining in other tools. The Dask Cluster transition (#335) could also open up other avenues in how to process the data.
The text was updated successfully, but these errors were encountered: