Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling of dynamicDatasetCache is not thread safe #51

Open
MatsSundqvist opened this issue Mar 13, 2020 · 23 comments
Open

handling of dynamicDatasetCache is not thread safe #51

MatsSundqvist opened this issue Mar 13, 2020 · 23 comments

Comments

@MatsSundqvist
Copy link

It seems like the handling of dynamicDatasetCache needs to be made thread safe. Currently the dynamic datasets is not working in a stable way.
Own test with readlock and writelock seems to work well

@rod-glover
Copy link

rod-glover commented Oct 15, 2020

@MatsSundqvist , I believe I may be encountering problems with the same root. We have configured a large collection of large datasets as dynamic datasets. We get frequent, unpredictable and inconsistent errors from ncWMS when these datasets are accessed (both by our web mapping app and by some test code I created). The obvious suspicion is that the dynamic dataset code is not handling these parallel requests safely, i.e., is not thread-safe.

I am a complete Java noob and so is everyone else in our shop (we primarily use Python and JS). Since ncWMS dynamic datasets (working stably) would solve some major headaches we have, I would appreciate any advice you may have on resolving this difficulty. In particular, to what did you refer in your remark

Own test with readlock and writelock seems to work well

?

Any other thoughts or advice are also very welcome.

@guygriffiths
Copy link
Collaborator

  • If you turn the dynamic dataset cache off, does it fix the issue? This would be helpful to know, since it would narrow down the issue to definitely being cache-based.
  • Are you able to manage your data as non-dynamic datasets? There is an admin API which could be used to automate dataset configuration as part of your general data-management, but obviously this depends on your specific use-case.

Any further info you can give us (e.g. what sorts of errors you're getting, how frequently you're accessing various things, etc.) may be helpful.

@rod-glover
Copy link

rod-glover commented Oct 15, 2020

Thank you for your quick response, Guy.

As I note below, I don't think my particular issue is a caching issue, although it might well be related to a general issue with thread safety, which is to say requests that demand multiple access to the same objects and/or files. If you'd like me to file a separate issue on this, I'd be happy to do so. Meantime I am responding here:

Regarding dynamic datasets:

  • Dynamic dataset cache is off. I should have stated this, as it is directly relevant to @MatsSundqvist 's issue, which I believe is related but distinct.
  • We get two types of errors:
    • LayerNotDefined: The layer <layer id> was not found on this server.
      • The layer (i.e., file) is in fact present.
    • InvalidDimensionValue: <datetime> is not part of the temporal domain for the variable <variable>
      • The time is in fact valid in the file. FWIW, phenomenologically, this seems only to occur for the first time point (index 0) in any given file.
  • Both these errors exhibit a strange combination of persistent and transient behaviour, specifically:
    • Our app provokes these errors fairly consistently, even to being consistent on which dataset(s) they are provoked on.
    • And, requests at different times with the same parameters sometimes succeed.
    • As an example, I have seen a map tile request fail (verified in the browser developer console as failing), then fill in as I panned the map and the request was repeated (verified in the dev console as succeeding, with exactly the same parameters).
    • A Python test script I wrote specifically to issue controlled sets of requests (at a variable rate) shows the same patterns.
  • These requests typically are issued quickly:
    • The application is a browser (JS) based web mapping client. Examining its network activity shows unsurprisingly that the web map launches dozens of requests for tiles more or less simultaneously.
    • The test script can (under user control) issue requests at a very high rate, which is to say with sub-millisecond delays between requests. This is to simulate the application's behaviour in a controllable way.
    • Another oddity is that for some (indeed, many) datasets, a high rate of requests does not provoke an error. For others, it does. And this behaviour varies over time, meaning that clusters of errors on a particular can disappear after the passage of some human-scale period of time (e.g., minutes). This is quite mystifying.

Regarding non-dynamic datasets:

  • We currently have about 12,000 datasets. That collection is updated and extended regularly.
  • We have a script (created in our shop some while ago for ncWMS 1.x, since updated) that creates an ncWMS configuration file with static (non-dynamic) dataset configurations. It takes our instance of ncWMS about 8 hours to start with this file. That makes updates to the configuration a bit of a problem, but it has to happen fairly regularly (on a period of days to weeks). You can see why we'd like to use dynamic datasets.
  • It may be that the admin API would be somewhat faster, as it would avoid the encoding and decoding of the configuration file. If you have any thoughts on the difference here, I would like to hear them. Even a relatively modest factor would help. Meantime I will look at the API.
  • Unless the admin API route is an order of magnitude faster, I think we will continue looking into dynamic datasets.

@rod-glover
Copy link

rod-glover commented Oct 15, 2020

Regarding the Admin API, do you mean the REST API implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsApiServlet? That service implements only a GET request, and we would need a POST in order to load configuration info via it. Am I missing something?

@mtsales
Copy link

mtsales commented Oct 16, 2020

I'm facing the same problem as @rod-glover . This has been reported long time ago in Edal Reading-eScience-Centre/edal-java#106

I can read the datasets if they are defined in the ncWMS admin but it fails multiple times and randomly when using the datasets as dynamic datasets

The error is always related to reading the time coordinate
2020-10-16 07:26:59 ERROR GridCoordSys:352 - N/A: Error reading time coord= time
java.lang.ArrayIndexOutOfBoundsException: Index 159200 out of bounds for length 8092

After this error occurs, I'm no longer able to use the dataset. It's always reports
2020-10-16 07:51:55 WARN WmsServlet:2641 - Wms Exception caught: "The layer local/NDVI_5600m_2019.nc/NDVI was not found on this server"

Please notice I have cache for dynamic datasets disabled but the same happens when enabling cache

@MatsSundqvist
Copy link
Author

here is my version of NcwmsCatalogue. I can not guarantee that it's flawless, but I experienced that it worked quite well. The original version did not work at all when several requests for tiles came in.
NcwmsCatalogue.zip

@guygriffiths
Copy link
Collaborator

@rod-glover The API is an admin API. It was written by a 3rd party and I don't think they documented it. It's implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsAdminServlet and takes POST requests to paths such as .../admin/updateConfig, .../admin/addDataset.

Thanks for the other information. I'll try this with a bunch of data today and see if I can replicate anything. For the timestep problem, can you confirm whether making a request to /wms?request=GetMetadata&version=1.3.0&item=timesteps&day=yyyy-mm-dd&layerName=<whatever> is showing a timestamp which is subsequently invalid in a GetMap request.

For the LayerNotDefined issue, this is a case of ncWMS failing and swallowing the error. I'm going to do some more investigation today, but if I can't replicate anything, I'll create a build which prints the stack trace to the Tomcat logs so you can (hopefully) get some information about the real cause of the error.

@guygriffiths
Copy link
Collaborator

guygriffiths commented Oct 16, 2020

@mtsales - Can you provide a copy of the NDVI data which is failing, and the request string you're using? I've tried with some NDVI data I picked at random, but can't replicate the problem. Actually I think I can replicate this (or at least get a similar error) with a bunch of concurrent requests. Let me investigate further

@mtsales
Copy link

mtsales commented Oct 16, 2020

@guygriffiths do you need the full stack trace?

@guygriffiths
Copy link
Collaborator

@mtsales - No, I don't think I need anything for now, thanks. I can trigger similar errors by making enough concurrent requests, which suggests a threading issue. I'll see how far I can get with it and report back.

@guygriffiths
Copy link
Collaborator

I've implemented a potential fix and would be grateful if you could try out the version here: https://github.com/Reading-eScience-Centre/ncwms/releases/tag/1.5.1-SNAPSHOT-dynamic-dataset-fix

Let me know if this fixes the issues, and if not, what the Tomcat logs say when datasets fail.

@mtsales
Copy link

mtsales commented Oct 16, 2020

@guygriffiths this seems to be working now but I need further testing. I'll let you know beginning of next week

I have dynamic datasets cache disabled but still get lots of exceptions (one per GetMap request) related to cache in ncWMS.log

2020-10-16 13:07:22 ERROR DataCatalogue:495 - Problem adding features to cache
java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.lang.Object jdk.internal.loader.AbstractClassLoaderValue$Sub.key accessible: module java.base does not "opens jdk.internal.loader" to unnamed module @724945

@guygriffiths
Copy link
Collaborator

@mtsales That's a strange one, I've not seen it before. Are you using Java 11? This latest build is expecting Java 11, and most of the results I get from searching that error list it as a specific Java 9 issue.

@mtsales
Copy link

mtsales commented Oct 17, 2020

@guygriffiths I'm using JDK 11.0.8 (as the only installed java) and Tomcat 9.0.39. I keep getting this exception once per GetMap request although the image renders correctly.

Attached a ncWMS.log (check the time stamp of the exceptions - one per GetMap request) and my config.xml where dynamic datasets cache is disabled.

ncwms.log
config.xml.txt

@rod-glover
Copy link

@guygriffiths I was off on Friday but will turn my attention to this today.

@rod-glover
Copy link

@guygriffiths , your update seems to have solved both error cases I encountered. I haven't worked it extremely hard yet, but test cases that fail for our current installation of 2.4.2 work on your snapshot. I will test it harder this afternoon and report back to you.

@rod-glover
Copy link

@guygriffiths , further testing indicates that the updated version works

  • for a wide range of datasets, including several that 2.4.2 was failing consistently on
  • for requests for many tiles simultaneously for the same dataset, across many datasets tested

I'd say this looks like a success. Thank you!

@mtsales
Copy link

mtsales commented Oct 21, 2020

@guygriffiths I can also confirm it solved the issues I was facing with dynamic datasets

@guygriffiths
Copy link
Collaborator

Good to hear it's working.

@mtsales - The error you're seeing is referring to the feature cache, which caches data for individual map tiles, meaning things like changing the colour scale very fast. I haven't been able to replicate it yet though.

@rod-glover
Copy link

@guygriffiths , a fairly rigorous testing programme for your fix shows no problems.

@rod-glover
Copy link

@guygriffiths , I do not want to hassle you, but I do have a question that may affect some deployment scheduling in my work group: Do you know when we might see an official release incorporating this fix?

@guygriffiths
Copy link
Collaborator

I don't have any plans for a release in the near future, since I'm reasonably busy with other projects. Are you able to use this release? A non-snapshot release would not be any different from this one, other than in name.

@rod-glover
Copy link

Yes, we can use the snapshot. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants