-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large xml file seems to not be "streaming", eatings GBs of Ram #29
Comments
@jakeonrails If this issue is still on your radar, can you see if the same problem occurs with current master? The behavior you describe does sound like a bug, but before digging deeper, I'd like to verify the issue wasn't resolved by upgrading Nokogiri to v1.5 |
@ezkl I can't take time to test this right now but I will try to test in the next couple days. I hope it does work now, since the parsing with sax-machine was a lot cleaner than what I resorted to back a few months ago, which is to use a monkey patched nokogiri Reader to parse out the chunk of XML for the node I want and pass that to sax-machine. |
@jakeonrails What method are you using to load the XML file? |
@ezkl I ended up using this technique here: http://stackoverflow.com/a/9223767/586983 You can see my original code which spurred me to write this github issue at the top of that SO question. |
@jakeonrails Thanks for the link and background info. I had a bit of a brain fart yesterday. Don't bother testing against HEAD at the moment. A streaming interface was implemented by @gregwebs, but his work wasn't ever merged into master (see: #18 and #24). I've been using a fork that includes Greg's work in production without issue for nearly a year, but never with XML files quite as large as yours. Once I've finished merging Greg's work, I'd love to get your feedback on performance w/ large files. |
I have been using my fork on files of about that size in production. |
so now we have 2 issues open, probably should close one |
Has this been merged? |
+1 |
I suppose the another question might be, would pauldix prefer that gregwebs be made the new maintainer of the ruby gem? It's a bit confusing having multiple versions. |
I will not be a maintainer, but @ezkl might |
If someone submits a PR I'll merge it in. |
Hi, I opened this PR #47, I think it should solve the problem. |
That is solving a different use case then what I had. My branch allows for giant streaming collections |
From my point of view, the main point to using the SAX interface is streaming, rather than reading into memory at once. Does current sax-machine release support any kind of streaming? No, I think? Curious what uses people have for SAX without streaming, but that's another topic I guess. |
Has this issue been abandoned? |
It looks like all effort made by @ezkl and @gregwebs is left way behind the current master, so it's not possible to review/merge these changes. I don't feel like streaming features will be added to sax-machine in nearest future, unless someone would be willing to reimplement/port that. So it basically stays usable for small xml files, especially for RSS/Atom feeds by using feedjira. For streaming, I'd suggest to consider using Nokogiri SAX or Ox SAX. |
I'm using sax-machine-1.2.0 and nokkogiri-1.6.3, parsing a 1GB xml file by passing an IO object to a SAXMachine parser, and it appears to be streaming, seeing as the virtual memory usage of the process doesn't go above 100MB.
The mixin passes I didn't test with Ox or Oga, but it looks like the Ox handler expects xml_text to be a string and can't currently stream from an IO. I don't see an obvious reason why making a StringIO from |
@torbiak Thanks for looking into this. In regard of IO parsing you're totally correct, I've changed Ox handler to support both String and IO, please see attached commit. I'm wondering why you're getting such good memory footprint results. Can I see a full example you're running? I'm thinking about putting some benchmarks together so your example could help. |
I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks,
@jakeonrails
The text was updated successfully, but these errors were encountered: