Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr Resiliency Issues #3647

Closed
nickumia-reisys opened this issue Jan 17, 2022 · 4 comments
Closed

Solr Resiliency Issues #3647

nickumia-reisys opened this issue Jan 17, 2022 · 4 comments
Labels
bug Software defect or bug component/ssb

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Jan 17, 2022

Component: Solr
Version: 8.11
Environment: cloud.gov

This issue is primarily for historical reference. The specific issues may be no-ops in reality..

How to reproduce

  1. Specify solr configuration
  2. Create solr service
  3. Load data into solr
  4. Load test (high volume or deep data)

or

  1. Specify solr configuration
  2. Create solr service
  3. Load some data into solr
  4. Wait a while
  5. Try to reindex the rest of the data

Expected behavior

Reliable Sub-200ms Response times
Resilient Operations

Actual behavior

  • Varaible (up to 2000ms) Response times

  • Shard lost. If the leader shard is lost before another replica can take over, it causes ckan to completely fail.
    image

  • Search-Reindex fails... (reason unknown, suspect: Zookeeper lost)
    image
    image
    image
    image

Sketch

[TO DO]

References

@nickumia-reisys nickumia-reisys added the bug Software defect or bug label Jan 17, 2022
@nickumia-reisys
Copy link
Contributor Author

I think we've proven that solr is not resilient across changes. If we define a specification and let it run, it's stable. But if we update-service, solr has a high probability of breaking..

More References for the future:

@jbrown-xentity
Copy link
Contributor

After discussion, we want to be able to raise the RAM limit that is hardcoded in the code.

  • Update solr brokerpak to allow for variable node size (RAM)
  • Merge and tag the release
  • Utilize this tag in the ssb broker to create more nodes and specify RAM (~32G)
    Focus on prod, but do in parallel on staging.

nickumia-reisys added a commit to GSA/catalog.data.gov that referenced this issue Jan 19, 2022
upon further research, this config seems to be more promising, see discussion in GSA/data.gov#3647
@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Jan 19, 2022

@nickumia-reisys
Copy link
Contributor Author

This is more of a historical issue now. It is specifically related to Solrcloud which we have abandoned in favor of the stability of Solr on ECS. Investigation into this would be covered by the following ticket. Although the specific investigation of SolrCloud would need to be a deliberate effort,

As more context for the future, per @jbrown-xentity,

Due to load/compatability issues, we do not plan to use solr cloud without significant research and development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/ssb
Projects
None yet
Development

No branches or pull requests

3 participants