-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BufferedAdd with many documents slowly increases flush cycle time #523
Comments
I have tried to reproduce this with a buffer size of 30,000 and a dataset of 12,000,000 documents against a local Solr on my laptop. There is a noticeable difference between smaller (< 100 KiB) and larger (~ 600 KiB) docs. There are periodical peaks in the time between flushes that I didn't look into because they're not what this issue is about. The trendlines are near horizontal. There is a small upward slope (mSmall docs = 0.0002 and mLarge docs = 0.0003), but not nearly as much as @rms2219 was experiencing. My document ids were generated as I think it's safe to assume that larger buffers lead to slower flushes if everything else remains the same. One thing that could cause a steadily growing delay is a loop bleeding over data in the next iteration. I tried this: set_time_limit(0);
ini_set('memory_limit', ini_get('suhosin.memory_limit') ?: -1);
$data = [
'name' => 'Test doc',
'ignored_cat' => [],
];
for ($i = 0; $i < 12000000; ++$i) {
$data['id'] = 'test_'.$i;
$data['ignored_cat'][] = $i;
$buffer->createDocument($data);
} I couldn't get it to complete even one flush cycle in a reasonable time. The size of the first buffer would already be a couple of gigabytes by the time it flushes. If something like this is at play, it's much more subtle. @mkalkbrenner @wickedOne Do you agree that this was either caused by BufferedAdd behaviour that has since been changed or by something external to the BufferedAdd implementation entirely? My working hypothesis is that the documents themselves grew slightly larger with every buffer. If it was existing data spanning a larger period—perhaps across multiple versions of the platform that generated it—newer documents might be naturally larger because over time more attributes (more fields for Solr) and/or more accurate/elaborate content (larger fields for Solr) would be stored that was never added to existing older documents. Bulk import of that kind of data in chronological order would result in ever increasing buffer sizes and thus slower flushes. We would need sample documents and the script that was used to fill the buffer to confirm or rule out this hypothesis. |
@rms2219 I know it's been a while since you reported this. Are you still experiencing it with the latest Solarium version? Could you provide sample documents evenly spread over the corpus and the script you used to add them to the buffer? If it is indeed caused by your documents slowly growing themselves, we wouldn't be able to tell from a contiguous block unless it spans multiple buffers. It would be even better if you could run a document size analysis on the full corpus yourself. |
@thomascorthals i like the analysis you've done! i do suspect something is bleeding in the bufferedAdd plugin as i've been experiencing similar behaviour. tried to look into it a couple of times without finding a major pitfall. using a NoopEventListener makes it a bit more performant tho |
@wickedOne You wouldn't happen to have a publicly available corpus and test script that can reproduce this consistently? There's one more thing I'm thinking that's just a wild guess at the moment. It might be something in the Solr configuration or the server it's running on; or something in the configuration of the machine Solarium is running on. If I try with the exact same script and data and can't reproduce it, we're probably looking at a cause outside Solarium. That would be a whole other rabbit hole. If I can reproduce it, we might get through the looking glass. |
it's not really publicly avaiable, but herer's the command i use to populate the cores. as you can see i disabe the the debug classloader for performance reasons just as my solrium client is equipped with no-operator dispaters. i usually run it with the default buffer size (200) as i've never seen noteworthy improvements increasing it. one of the cores is ~28.000.000 documents and ends up taking about ~45 min for a complete rebuild. feel free to ask for more information / code as this has been a thowrn in my side for quite a while 😎 |
I have been wondering if using an Haven't tried it yet because it would introduce a BC break for It would also change the timing of the next flush when setting a lower buffer size mid-run. That shouldn't make a functional difference for the user though. Not sure if this is worth pursuing as long as I can't reproduce this issue. |
willing to do a test run if you have an implementation. |
I have a working implementation in Averaged over 3 runs of both the current
|
thanks for the implementation and extensive analysis! |
Time to revisit this once again. #1037 added support for JSON requests to It still annoys me that I could never reproduce the issue to begin with. If the recursive calls in our code caused it, it will be gone completely with JSON requests. If it has something to do with the string manipulation, the effect could be much smaller. If it's something on the Solr side after all, it might be XML specific. But there is no way to tell for me. @wickedOne? 😉 |
I've also noticed that PHP 8.2 performs noticeably faster than previous versions in the benchmarks. |
I'm using the BufferedAdd plugin to index my documents. In some instances, I'll need to index tens of millions of documents. While testing, I set my buffer size to 30,000. I noticed that when my indexing process begins, the flushing those 30K documents takes about a second. After a couple million documents, that cycle takes a little longer (~1 second more). This goes on and on, and at around 10 or 11 million documents that have been indexed, I'm up to like 11 seconds between cycles. Is there something that Solarium is holding onto or writing to that would be slowing this down? I should add that this happens regardless of whether the Solr index is empty or whether it's already populated with millions of other documents.
The text was updated successfully, but these errors were encountered: