handling of dynamicDatasetCache is not thread safe #51

MatsSundqvist · 2020-03-13T10:26:02Z

It seems like the handling of dynamicDatasetCache needs to be made thread safe. Currently the dynamic datasets is not working in a stable way.
Own test with readlock and writelock seems to work well

rod-glover · 2020-10-15T00:31:03Z

@MatsSundqvist , I believe I may be encountering problems with the same root. We have configured a large collection of large datasets as dynamic datasets. We get frequent, unpredictable and inconsistent errors from ncWMS when these datasets are accessed (both by our web mapping app and by some test code I created). The obvious suspicion is that the dynamic dataset code is not handling these parallel requests safely, i.e., is not thread-safe.

I am a complete Java noob and so is everyone else in our shop (we primarily use Python and JS). Since ncWMS dynamic datasets (working stably) would solve some major headaches we have, I would appreciate any advice you may have on resolving this difficulty. In particular, to what did you refer in your remark

Own test with readlock and writelock seems to work well

?

Any other thoughts or advice are also very welcome.

guygriffiths · 2020-10-15T13:58:24Z

If you turn the dynamic dataset cache off, does it fix the issue? This would be helpful to know, since it would narrow down the issue to definitely being cache-based.
Are you able to manage your data as non-dynamic datasets? There is an admin API which could be used to automate dataset configuration as part of your general data-management, but obviously this depends on your specific use-case.

Any further info you can give us (e.g. what sorts of errors you're getting, how frequently you're accessing various things, etc.) may be helpful.

rod-glover · 2020-10-15T17:40:29Z

Thank you for your quick response, Guy.

As I note below, I don't think my particular issue is a caching issue, although it might well be related to a general issue with thread safety, which is to say requests that demand multiple access to the same objects and/or files. If you'd like me to file a separate issue on this, I'd be happy to do so. Meantime I am responding here:

Regarding dynamic datasets:

Dynamic dataset cache is off. I should have stated this, as it is directly relevant to @MatsSundqvist 's issue, which I believe is related but distinct.
We get two types of errors:
- LayerNotDefined: The layer <layer id> was not found on this server.
  - The layer (i.e., file) is in fact present.
- InvalidDimensionValue: <datetime> is not part of the temporal domain for the variable <variable>
  - The time is in fact valid in the file. FWIW, phenomenologically, this seems only to occur for the first time point (index 0) in any given file.
Both these errors exhibit a strange combination of persistent and transient behaviour, specifically:
- Our app provokes these errors fairly consistently, even to being consistent on which dataset(s) they are provoked on.
- And, requests at different times with the same parameters sometimes succeed.
- As an example, I have seen a map tile request fail (verified in the browser developer console as failing), then fill in as I panned the map and the request was repeated (verified in the dev console as succeeding, with exactly the same parameters).
- A Python test script I wrote specifically to issue controlled sets of requests (at a variable rate) shows the same patterns.
These requests typically are issued quickly:
- The application is a browser (JS) based web mapping client. Examining its network activity shows unsurprisingly that the web map launches dozens of requests for tiles more or less simultaneously.
- The test script can (under user control) issue requests at a very high rate, which is to say with sub-millisecond delays between requests. This is to simulate the application's behaviour in a controllable way.
- Another oddity is that for some (indeed, many) datasets, a high rate of requests does not provoke an error. For others, it does. And this behaviour varies over time, meaning that clusters of errors on a particular can disappear after the passage of some human-scale period of time (e.g., minutes). This is quite mystifying.

Regarding non-dynamic datasets:

We currently have about 12,000 datasets. That collection is updated and extended regularly.
We have a script (created in our shop some while ago for ncWMS 1.x, since updated) that creates an ncWMS configuration file with static (non-dynamic) dataset configurations. It takes our instance of ncWMS about 8 hours to start with this file. That makes updates to the configuration a bit of a problem, but it has to happen fairly regularly (on a period of days to weeks). You can see why we'd like to use dynamic datasets.
It may be that the admin API would be somewhat faster, as it would avoid the encoding and decoding of the configuration file. If you have any thoughts on the difference here, I would like to hear them. Even a relatively modest factor would help. Meantime I will look at the API.
Unless the admin API route is an order of magnitude faster, I think we will continue looking into dynamic datasets.

rod-glover · 2020-10-15T18:29:04Z

Regarding the Admin API, do you mean the REST API implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsApiServlet? That service implements only a GET request, and we would need a POST in order to load configuration info via it. Am I missing something?

mtsales · 2020-10-16T06:04:21Z

I'm facing the same problem as @rod-glover . This has been reported long time ago in Edal Reading-eScience-Centre/edal-java#106

I can read the datasets if they are defined in the ncWMS admin but it fails multiple times and randomly when using the datasets as dynamic datasets

The error is always related to reading the time coordinate
2020-10-16 07:26:59 ERROR GridCoordSys:352 - N/A: Error reading time coord= time
java.lang.ArrayIndexOutOfBoundsException: Index 159200 out of bounds for length 8092

After this error occurs, I'm no longer able to use the dataset. It's always reports
2020-10-16 07:51:55 WARN WmsServlet:2641 - Wms Exception caught: "The layer local/NDVI_5600m_2019.nc/NDVI was not found on this server"

Please notice I have cache for dynamic datasets disabled but the same happens when enabling cache

MatsSundqvist · 2020-10-16T07:14:36Z

here is my version of NcwmsCatalogue. I can not guarantee that it's flawless, but I experienced that it worked quite well. The original version did not work at all when several requests for tiles came in.
NcwmsCatalogue.zip

guygriffiths · 2020-10-16T09:52:52Z

@rod-glover The API is an admin API. It was written by a 3rd party and I don't think they documented it. It's implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsAdminServlet and takes POST requests to paths such as .../admin/updateConfig, .../admin/addDataset.

Thanks for the other information. I'll try this with a bunch of data today and see if I can replicate anything. For the timestep problem, can you confirm whether making a request to /wms?request=GetMetadata&version=1.3.0&item=timesteps&day=yyyy-mm-dd&layerName=<whatever> is showing a timestamp which is subsequently invalid in a GetMap request.

For the LayerNotDefined issue, this is a case of ncWMS failing and swallowing the error. I'm going to do some more investigation today, but if I can't replicate anything, I'll create a build which prints the stack trace to the Tomcat logs so you can (hopefully) get some information about the real cause of the error.

guygriffiths · 2020-10-16T10:10:40Z

@mtsales - ~~Can you provide a copy of the NDVI data which is failing, and the request string you're using? I've tried with some NDVI data I picked at random, but can't replicate the problem.~~ Actually I think I can replicate this (or at least get a similar error) with a bunch of concurrent requests. Let me investigate further

mtsales · 2020-10-16T10:29:53Z

@guygriffiths do you need the full stack trace?

guygriffiths · 2020-10-16T10:32:09Z

@mtsales - No, I don't think I need anything for now, thanks. I can trigger similar errors by making enough concurrent requests, which suggests a threading issue. I'll see how far I can get with it and report back.

guygriffiths · 2020-10-16T10:47:53Z

I've implemented a potential fix and would be grateful if you could try out the version here: https://github.com/Reading-eScience-Centre/ncwms/releases/tag/1.5.1-SNAPSHOT-dynamic-dataset-fix

Let me know if this fixes the issues, and if not, what the Tomcat logs say when datasets fail.

mtsales · 2020-10-16T11:13:00Z

@guygriffiths this seems to be working now but I need further testing. I'll let you know beginning of next week

I have dynamic datasets cache disabled but still get lots of exceptions (one per GetMap request) related to cache in ncWMS.log

2020-10-16 13:07:22 ERROR DataCatalogue:495 - Problem adding features to cache
java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.lang.Object jdk.internal.loader.AbstractClassLoaderValue$Sub.key accessible: module java.base does not "opens jdk.internal.loader" to unnamed module @724945

guygriffiths · 2020-10-16T14:42:26Z

@mtsales That's a strange one, I've not seen it before. Are you using Java 11? This latest build is expecting Java 11, and most of the results I get from searching that error list it as a specific Java 9 issue.

mtsales · 2020-10-17T08:21:00Z

@guygriffiths I'm using JDK 11.0.8 (as the only installed java) and Tomcat 9.0.39. I keep getting this exception once per GetMap request although the image renders correctly.

Attached a ncWMS.log (check the time stamp of the exceptions - one per GetMap request) and my config.xml where dynamic datasets cache is disabled.

ncwms.log
config.xml.txt

rod-glover · 2020-10-19T17:07:47Z

@guygriffiths I was off on Friday but will turn my attention to this today.

rod-glover · 2020-10-20T18:40:56Z

@guygriffiths , your update seems to have solved both error cases I encountered. I haven't worked it extremely hard yet, but test cases that fail for our current installation of 2.4.2 work on your snapshot. I will test it harder this afternoon and report back to you.

rod-glover · 2020-10-21T00:57:39Z

@guygriffiths , further testing indicates that the updated version works

for a wide range of datasets, including several that 2.4.2 was failing consistently on
for requests for many tiles simultaneously for the same dataset, across many datasets tested

I'd say this looks like a success. Thank you!

mtsales · 2020-10-21T06:04:58Z

@guygriffiths I can also confirm it solved the issues I was facing with dynamic datasets

guygriffiths · 2020-10-21T11:03:39Z

Good to hear it's working.

@mtsales - The error you're seeing is referring to the feature cache, which caches data for individual map tiles, meaning things like changing the colour scale very fast. I haven't been able to replicate it yet though.

rod-glover · 2020-10-22T16:22:38Z

@guygriffiths , a fairly rigorous testing programme for your fix shows no problems.

rod-glover · 2020-10-26T20:14:32Z

@guygriffiths , I do not want to hassle you, but I do have a question that may affect some deployment scheduling in my work group: Do you know when we might see an official release incorporating this fix?

guygriffiths · 2020-10-27T09:53:08Z

I don't have any plans for a release in the near future, since I'm reasonably busy with other projects. Are you able to use this release? A non-snapshot release would not be any different from this one, other than in name.

rod-glover · 2020-10-27T16:29:51Z

Yes, we can use the snapshot. Thank you.

guygriffiths mentioned this issue Dec 19, 2020

Problem when using Dynamic Services #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling of dynamicDatasetCache is not thread safe #51

handling of dynamicDatasetCache is not thread safe #51

MatsSundqvist commented Mar 13, 2020

rod-glover commented Oct 15, 2020 •

edited

Loading

guygriffiths commented Oct 15, 2020

rod-glover commented Oct 15, 2020 •

edited

Loading

rod-glover commented Oct 15, 2020 •

edited

Loading

mtsales commented Oct 16, 2020 •

edited

Loading

MatsSundqvist commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

guygriffiths commented Oct 16, 2020 •

edited

Loading

mtsales commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

mtsales commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

mtsales commented Oct 17, 2020

rod-glover commented Oct 19, 2020

rod-glover commented Oct 20, 2020

rod-glover commented Oct 21, 2020

mtsales commented Oct 21, 2020

guygriffiths commented Oct 21, 2020

rod-glover commented Oct 22, 2020

rod-glover commented Oct 26, 2020

guygriffiths commented Oct 27, 2020

rod-glover commented Oct 27, 2020

handling of dynamicDatasetCache is not thread safe #51

handling of dynamicDatasetCache is not thread safe #51

Comments

MatsSundqvist commented Mar 13, 2020

rod-glover commented Oct 15, 2020 • edited Loading

guygriffiths commented Oct 15, 2020

rod-glover commented Oct 15, 2020 • edited Loading

rod-glover commented Oct 15, 2020 • edited Loading

mtsales commented Oct 16, 2020 • edited Loading

MatsSundqvist commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

guygriffiths commented Oct 16, 2020 • edited Loading

mtsales commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

mtsales commented Oct 16, 2020

guygriffiths commented Oct 16, 2020

mtsales commented Oct 17, 2020

rod-glover commented Oct 19, 2020

rod-glover commented Oct 20, 2020

rod-glover commented Oct 21, 2020

mtsales commented Oct 21, 2020

guygriffiths commented Oct 21, 2020

rod-glover commented Oct 22, 2020

rod-glover commented Oct 26, 2020

guygriffiths commented Oct 27, 2020

rod-glover commented Oct 27, 2020

rod-glover commented Oct 15, 2020 •

edited

Loading

rod-glover commented Oct 15, 2020 •

edited

Loading

rod-glover commented Oct 15, 2020 •

edited

Loading

mtsales commented Oct 16, 2020 •

edited

Loading

guygriffiths commented Oct 16, 2020 •

edited

Loading