-
Notifications
You must be signed in to change notification settings - Fork 660
gumbo-next #295
base: master
Are you sure you want to change the base?
gumbo-next #295
Conversation
@vmg |
@vmg,
Note that the document-fragment is followed by the string "svg path". We will have to modify the html5lib_adapter_test.py to deal with these and because you pass in the beginning context as a GumboTag we will need to allow namespace to be passed in as well (to seperate out svg "a" or svg "title" from an html "a" or an html "title". Furthermore, we will have to grow the list of recognized tags in the gumbo enum to include all of the svg elements and all of the mathml elements so that an actual GumboTag enum exists with which to specify the context. Right now the svg path element has no enum assigned to it and so can't be specified as the context for the document fragment even if we wanted to.. FWIW, I already grow my own master to to include the extra 100 or so tags from svg and mathml if you want them. Also note, the hack to remove the template "content" line is still needed in the html5lib-tests master as "content" is not part of how gumbo has decided to implement templates. It was this hack that was removing the svg attributes that began with "content" from my test output which literally drove me crazy. |
@vmg, that hack to remove the content lines in templates in html5lib_adpater_test.py is also what messed up the alignment when nested (double or triple) templates were used with html5lib-tests-master. So instead of 12 failures altversion2 really had just 3 failures. This was, of course, before all of the new document fragment support and foreign document fragment tests were added. |
Thanks for the heads up. I'll see to implement the foreign contexts for fragments now. |
For your convenience: I've pushed a commit that unifies all the Obviously all the generated files are already checked in, so you don't need |
OK, I'll try to push some more work for the tip of the html5lib tests later this week. For now I would like to get this branch merged and/or at least approved by @nostrademons, as it is already passing 100% of html5lib 0.95 tests and it's a solid improvement over what we had before. Any comments or code review? |
… set it as appropriate.
…for lexing, to ease debugging non-printable characters.
…or. (The whole error handling really needs to be redone, it's not very helpful to users.)
…make sure their input mechanisms can accept this without relying on strlen.
@vmg
(Is this even a valid test failure? ) Either way, your tree is in excellent shape! |
The 'if there is no such element, then abort these steps and instead act as described in the "any other end tag" entry above' clause in the AAA results in the final |
@gdnedders |
FWIW, there is a newer version of the AAA in the template pull that I did not incorporate as it did not specifically deal with templates and so I did not think it was different from the version in master. I will incorporate the new AAA code from the official template pull request to see if that fixes that problem case. |
Yeah, I'd rewritten the AAA in the template b branch to bring it up to the current spec. I think my reason was that there was some bug I was trying to track down and it was impossible while the code was so far out of sync with the spec. I agree that we should get the template code merged into a unified branch first, so people can start building off it without worrying about conflicts. I think a useful milestone for releasing v0.10.0 would be to have it pass all html5lib tests from trunk. |
@vmg @nostrademons @gdnedders With this commit in place, we do pass the AAA test we failed earlier. |
@nostrademons If it would help my altversion3 branch has the full template stuff and the updated AAA algorithm on top of your master with no API changes from your master. If you have any interest let me know and I will create a pull or you can grab it yourself from that branch. |
These are the same commits as in @vmg's pull request, plus the fixed AAA algorithm and template tag tests, right? Yes, it'd help to have that around as a pull request. I'm still figuring out the GitHub way to get that merged onto a separate branch within google/gumbo - I think I need to create the new branch, checkout your altversion3 locally, merge the commits, and then push it upstream. I also want to cherry-pick a couple of vmg's fixes for the Travis Mac build back to 0.9.3 before doing the release for that. |
All of kevinhendricks & my commits in this pull request have now been merged. Status on @vmg's:
Thanks for all your hard work on this. Gumbo's progressed a lot since last week, and features that were on my vague far-future wishlist are nearly in. We're at 3 failing html5lib tests (once my pull request for handling gets in) now, and that includes all of the document-fragment ones. |
Thank you very mad for the thorough feed, @nostrademons! I've replied to your concerns on the individual commits. Could you please consider merging the Memory revamp in the 1.0 branch? It's a dependency on most of the stuff I'm working on, and I'd love to be able to track one of the upstream branches. |
FWIW, I agree here, the memory changes are needed in my tree as well and it would be nice to have them in an official tree. Passing the parser around where it is not needed just to allocate memory just clutters the code. And trying to devise your own memory allocators that are thread safe and overrun safe is not trivial and good reason why system malloc and free should not be replaced unless just for testing purposes. As for those who like to play with the tree as a dom, I think you overstate the memory issues. We need to relocate the html page relative to its images, so finding and updating links nodes after parsing is crucial for us. Right now, I have to end run all this by parsing the tree and then reserializing it on the fly with the changes and then reparse it! This is silly. Please don't handcuff the usefulness of your code. Instead, make a clear statement that the "gumbo-parser will not handle bug reports where the parse tree has ever been modified" and then let people who need to build on your code instead of having to permanently fork it. Kevin Sent from my iPad
|
@vmg if you are building off of v1.0.0 please see the patch in the newest issue here from me about broken rtc handling fix. It is not needed in your tree, only if you are rebasing off of v0.10.0, v1.0.0 or master here. It fixes the ruby related test failure leaving us with just 3 foreign document fragment failures to track down. |
@vmg, For completeness can you include in the docs which "mph" implementation you are using. I could not find an "mph" project that had a command line that grokked your options: /*
Where exactly can I download this?
|
I assume 'mph' refers to CMPH, the C Minimal Perfect Hash library. I'd like to see a more thorough pointer in the source code as well. |
So I did some investigation into the memory allocator patch. No firm decision yet, but I see a few issues, a question, maybe an alternative, and some extra data points: Removing the GumboParser argument from create_node and callers isn't going to work, unfortunately. The problem is the next/prev patch that was recently merged into 0.10.0. This creates an intrusive linked-list between nodes in document order (which is sometimes distinct from DOM order, since nodes may be moved around and reparented). It's necessary for BeautifulSoup compatibility, likely for a few other libraries (IIRC libxml also had a similar linked list), and for refactoring tools, where the best way to preserve the original formatting of the document is to build up a list of edit operations in reverse order of where they appear in the doc. The state needed to build this doubly-linked list is stored in the GumboParserState. One option might be to split the creation of the node and its insertion into the document into two functions. This is arguably a good idea anyway, but the reason we don't do this currently is because insert_element is called to reparent nodes in the AAA/foster-parenting cases, and so this would mess up the linked list upon reparenting and create a cycle in it. Since it's used for deletion of the output tree, this would lead to a crash on cleanup. Client code that wants to create a node and add it into the tree would face the same issue; unless they're careful to splice the next/prev pointers appropriately, it'll crash on cleanup. (This, incidentally, is a good example why I'm wary of encouraging people to treat the parse tree as mutable. I would much rather have people take a Gumbo parse tree and walk it to build up, say, a libxml DOM tree - this also has the benefit of making all the tooling based on libxml available to Gumbo.) Another option - one I'm warming up to, since it would fix #68, #6, and #25 as well - is to separate the parser and treebuilder, such that the parser invokes a number of callback functions to perform tree construction and has no direct knowledge of the particular tree data structure. html5lib, Hubbub, and html5ever all function like this, and IIUC the reason Servo didn't use Gumbo and instead created html5ever was so they could get fast DOM construction (there are probably all sorts of other reasons as well - Gumbo was never designed to be part of a browser engine, and wouldn't fulfill their main goal of showcasing Rust). The reason I didn't do this in Gumbo's original design is because there are complex interactions between parsing and the DOM tree in HTML5, with some nodes needing to be reparented and the existing tree sometimes needing to be inspected to determine what to do. Hubbub's interface consists of 18 callback functions, with nodes represented as void* and assumed to be reference counted, and Hubbub's functionality is less than Gumbo's (it doesn't record source positions, original text, or document order). So there's a definite complexity tradeoff, but perhaps it could be hidden by making the GumboNode* interface the default treebuilder and exposing the callbacks only for people who really need the additional flexibility and performance. As for memory allocator thread-safety - I'm curious, would the proposed design be able to handle the case where you have a per-request arena that is freed all at once when the request completes, with multiple requests in multiple threads? This is a really common pattern in server software - it's why custom allocators are supported to begin with - and it also applies to large-scale data processing, where you may want to parse a number of documents in parallel, free the parse trees once the document is done processing, and not pay any allocation cost beyond a pointer bump. It seems like a global allocator doesn't allow this - the point is to let each request/doc allocate into its own arena, not a global one. Am I missing anything here? Finally, some benchmarks. This is the current v1.0.0 candidate master: hacker_news.html: 4828 microseconds. This is for a dead-simple arena allocator that just allocates a gig of memory and doesn't worry about going over it. It's also a good proxy for how much time is spent in the memory allocator in total (~15-20%...yeowch, that could use some improvement), since this is about as close to a theoretical maximum as we can get (allocation is just an aligned pointer bump, deallocation is free). hacker_news.html: 4145 microseconds. This is for removing the custom memory allocator entirely. I wanted to float this option as well since I had a hunch that the pointer indirection on every allocation may kill as much performance as using a different allocator gains, but it turns out the penalty is only ~2-3%, not nearly as much as a good arena can get you. hacker_news.html: 4480 microseconds. I'd tried to get a benchmark for @vmg's patch as well, but was stymied by the create_node issue above. |
Thanks for the detailed analysis as usual, @nostrademons. Here are a couple notes:
|
FWIW,
|
Actually it does not. I finally found the mph-1.2.tar.gz in an old ftp linux repository. Once you have it, the generated code needs to be very slightly modified to allow for case conversion for lookup. Since I use a larger recognized tag set, I will document the process in a README of some sort and post it for you. One other side-effect is the need to use gnu sed when remaking the pieces as standard sed on BSD/Mac OSX will not work. But since automake and friends, and etc do not exist as standard on MacOSX any way, this is minor. Sent from my iPad
Update: This README is now posted as a "Issue" here as it seemed to be the easiest way to convey the information. |
Hmm, I guess clone_node does leak memory. I'd thought that I had a leak-checking custom allocator in the unit tests to catch things like this, but it looks like it was only in the tokenizer/utf8/utility tests. Anyway, that's an oversight - other parser-inserted nodes (eg. from create_element_from_token) are added to the next/prev list at the point that the parser creates them, and so clone_node should probably insert the cloned node right after the original. The more I think about it, the more I think Gumbo should just use an arena as its default allocator and remove the pluggable allocator machinery, because:
Another low-hanging memory trick I'd like to try is to fiddle with the default allocation sizes for vectors and stringbuffers. These were arbitrarily chosen; I could pretty easily run some statistical analysis on common attribute/child/string sizes to pick ones that would require minimal reallocs. With an arena, the amount of wasted memory is less of an issue, and avoiding a realloc will give much better locality. Do either of these sound like they'd solve your performance problems? Basically the change would be to remove the allocator functions, add an arena to GumboOutput that's destroyed when the parse tree is destroyed, and get a blanket ~15-20% speedup across the board from memory-allocation improvements. I unfortunately don't have Google's corpus to play with anymore, but I do have a few local corpora of ~100K documents each that I could run some stats over. |
Forgive me but what I know about arenas you could put in a thimble. My use case is within an C++ (Qt-based) ebook editing app that is multithreaded. We would be using it to parse anywhere from 1 to say 200 documents (chapters), We would not need the tree to be kept around after each parse but we do need a very small memory footprint, fast parsing time, and the ability to make small changes once parsed all in parallel in a multithreaded environment.
Reserving a gig of memory would not be something that fits within our or our users needs at all.
|
Unless I am missing something ... If instead you want the order after potential parser reshuffling and with implied nodes added, then walking the DOM with a depth first recursion should do that. Either way gumbo keeps more than enough state information in the DOM to create those links anytime after parsing without having to carry along a counter in the parser state.
|
I gotta side with @kevinhendricks here. Unfortunately we don't have much weight on the direction of the project, but it seems like many of the features being added or proposed to Gumbo just make common use cases (or supposedly common -- it seems like @kevinhendricks and us at GitHub are working on completely different projects, and yet we need the exact same features) either significantly more complex or significantly more expensive, with very dubious benefits. The The memory model is still in a bad spot, and switching to an arena seems like the sloppy approach to fixing the memory issues, instead of actually refactoring the existing code to be less wasteful when allocating. Again, it also makes any kind of DOM modifications more complex because of the need to insert external objects in the arena. I'm afraid I'm running out of open-source time, so I'm going to freeze our fork of Gumbo at v1.0.0, with my memory handling patches, and with the prev/next pointers removed to simplify all the external APIs, and carry on with that. I'd really love to be able to contribute more stuff upstream, but all my changes from now on will be based on the new memory model, and hence not trivially portable. You can follow the tree here: https://github.com/vmg/gumbo-parser/tree/v1.0.0 and of course you're free to cherry pick anything you want. |
@vmg sorry to hear that. I really would love to see your improvements that build off of the memory changes. If you do not mind, Sigil (the project I volunteer for) will use your repo as our starting point as our views on what is needed now and in the short term future seem to coincide: correctness first, fast parsing and small memory footprint second, and the ability to modify the gumbo tree before serializing (third). Please let me know if anyone is at all interested in a small simplistic set of easy-to-bolt-on changes that allow for limited xhtml parsing - in this case adding the ability to recognize self-closed non-void tags as being valid and injecting the proper end tag token to keep gumbo happy. These changes are easy to add after the fact and can be enabled and disabled with an an additional options flag. With these changes in place things like -
no longer freak out the parser. Speaking on behalf of Sigil and myself, thank you for your contributions. |
This may be a bit late, since it seems like you've both decided to work off of Vicente's branch, but I ran some numbers on allocator behavior, performance, and data shape on a segment of the CommonCrawl corpus. The code and cursory results are over in a new GitHub repository for Gumbo stats, and I've opened some pull requests for the fixes themselves: Summary:
And median speed/memory benchmarks for major releases & branches of Gumbo:
I'd encourage folks to run the benchmarks for themselves. They aren't particularly robust or portable, but they are simple and it shouldn't be too much work to get them working on multiple systems. There was a pretty wide variance in parse times and memory usage just working with different Common Crawl segments, but the ratios stayed pretty constant at 60us/1K doc length and 10K memory/1K doc length (5K without the arena). There are a number of low-hanging optimizations that I've already submitted pull requests for, and I'm leaning toward removing the custom allocators and making the default an arena for 1.0.0. This will have to go up for public comment, though, and I'm a little worried about its effect on embedded systems. Also, responses to various other messages posted here while I was writing code:
It's not a gig - more like 720K in the average case, 2.4M at the 95th percentile, and about 100M for the HTML5 spec itself, which is the largest single HTML doc I've come across (8M).
Hmm, actually this is true. I'm inclined to take them out and add this to the BeautifulSoup adaptor itself rather than pollute the C source, before releasing 0.10.0.
Why not use libxml or some other XML parser? The point of XHTML is that it's a reformulation of HTML that's legal XML, and so you can use an XML parser on it.
I guess I've done a poor job explaining why Gumbo is not intended to have a mutable parse tree: The point of Gumbo is to serve as a base for other libraries and tools, so that they do not have to implement the HTML5 parsing algorithm themselves. Its API is intentionally minimal, because that limits the amount that such a toolsmith needs to learn to get started. As soon as you start adding mutability to the library, you have to make a number of design decisions that are specific to the particular use-case that it's being used for. We've seen this in this thread already, with the tension between next/prev and your patches. You either lose the ability to use a custom allocator, or you need to pass the allocator & memory block into each mutation function, or you lose thread-safety, or you require that allocators that might otherwise run locklessly on local data need to use a shared global memory block and locks. You may find yourself wishing you had other information on the nodes, eg. a dirty bit for refactoring tools, or a RenderLayer for a browser, or a pointer to an editor buffer for a WYSIWYG editor, or a series of event handlers for a desktop app. I do not want to be the arbiter of which person's particular DOM variations go into Gumbo - I don't have time for that, nor do I really like making decisions that are guaranteed to please one person and piss off everybody else. So instead, I'm going to say no to all of them and piss everybody off. Instead, I recommend that people wrap the library with the DOM implementation that works for them. I recognize that not everyone will know what I mean by this or want to bother with this, so I created gumbo-libxml yesterday, which parses with Gumbo and gives you back a libxml2 document. This has several advantages over trying to manipulate Gumbo's DOM:
Users are, of course, free to use their own wrappers as necessary - it doesn't have to be libxml, but I figure that writing that one serves as a good example.
Each design choice made limits the usefulness of the library, and each design choice also comes with a set of advantages. Sometimes the limitations won't apply until the future, and sometimes they affect silent users of the library who are not so vocal with their opinions. The trick is to choose a set of design trade-offs that maximize its usefulness within the open-source ecosystem. Some examples from previous history of this library:
I've followed a fairly simple general guideline when deciding what to add and what not to, and that's "Can this functionality be implemented outside of the library, without knowledge of the HTML5 spec?" Because ultimately, the reason why this library exists is that parsing HTML sucks, and as few people as possible should have to write an HTML parser. Writing tree traversals does not suck; there is nothing wrong with constructing a data structure of your choice and throwing the Gumbo parse tree away. Writing scripting language bindings does not suck. Writing utility functions does not suck, though if the utility functions require knowledge of the HTML spec I consider them in-scope for inclusion. Anyway, regardless of whether you choose to use mainline Gumbo or your own private forks, thank you for your contributions so far. If you do intend to continue with the fork, I would like to make some API cleanups (basically moving the fragment context and namespace to GumboOptions instead of having a separate gumbo_parse_fragment function - this is to prevent a combinatorial explosion in functions if libraries need to add their own wrappers, so we don't get gumbo_parse_fragment_with_options_keeping_original_tree) before releasing v0.10.0 |
Yes, I saw some of your pull requests come in to google/gumbo-parser. I should probably do the same for xhtml based ebook chapters in epub2 and epub3 ebooks. There are many public repositories for books that are in the public domain that volunteers have created epubs for. I would guess that use of id and class together might be a bit higher, much less javascript, much larger text nodes than general purpose web-pages. Also epubs have much higher use of svg (for auto-zooming images to fill the screen) than normal web pages.
I will definitely download your GumboStats and give it a try on my own epub library just to see if it differs by much from what you have observed. Thank you.
I am glad to hear that. When a person first opens an ebook with 50 to 100 chapters, we are basically using QtConcurrent (multiple threads) to scan and clean all of those documents in parallel using multiple instances of gumbo. So keeping a small memory footprint is essential for us.
That is probably one of the biggest sticking points. If there is a way to do that without needing next/prev inside gumbo that would be grand.
We used to do that but Xerces is a resource hog, is cumbersome and slow, so is QDom. Please understand the epub3 spec calls for the xhtml serialization of the html5 spec. None of the xml based parsers grok it well. Sigil is not an ebook reader, it is an ebook editor, and it allows wysiwyg and code based editing of html5 documents and users can and do mess up the syntax and so having a robust html5 parser is essential. No libxml parser is robust enough (especially without an official html5 dtd to work with) and using python in a multithreaded environment is a joke (even with multiple virtual machines fired up - it basically has the BKL problem). I thought about using libxml2 but it seems to have many bugs and nuances and a horrible codebase (BADCAST anyone?). So gumbo has been a lifesaver for us. We can use it to replace Xerces and Tidy at least for html5/xhtml5 based documents. We are incorporating an embedded python 3.4 interpreter built in to use python's lxml do deal with the two pure xml documents (the content.opf and the toc.ncx) while using gumbo to deal with the xhtml serialization of html5 in a very robust way.
Understood, but why wrap a poor/heavy xml processor based DOM tree around a gumbo tree just to make simple modifications? Right now with vmg's tree and slight api changes I can easily replace attribute values (with no memory leaking) which helps when a user moves images, videos, audio, or any href/src, or, for that matter, when either the source or destination of the xhtml files are changed. I can create nodes and append nodes, and destroy nodes easily again without leaking memory. It simply is not that hard to do. Removing in-line styles and adding them to external stylesheets is another common issue. With those simple capabilities I can take care of that. These types of basic modifications are simple and safe to do in a load -> parse -> modify -> serialize -> destroy output sequence. There is no need for libxml, or any other DOM. A gumbo node tree is more than enough. You already have most of the calls needed in the code someplace, all they need is to remove the unneeded link to the parser (for memory allocation), remove static, add to header and preface it with gumbo_. This adds virtually no overhead, no noticeable increase in code size or complexity and makes the library just that much more valuable for fixing up the tree before re-serializing it.
That is of course your choice as you are the owner/creator of the project. There is of course a happy compromise:
This then protects your core code from needing to support multiple DOM implementations, but allows me (and others) to do what they need
As I said before, most xml libs have a hard time with html5 parsing without dtd or schemas which are no longer ("needed") and really do not bring much to the table. Xpath underneath does nothing more that node filtering by recursively walking the tree. I can do that myself directly. It is not hard.
I agree, see my happy compromise above ;-)
Agreed, but the the routines to safely create and destroy a node already exist. Again, this is not that hard to do.
And most of the functions I want already exist and could be stored in their own header and not needed for the majority of the people who just want to wrap their language of choice around the parser.
Or keep it and modify it slightly and safely (this is not hard) and then reserialize it with your changes and then throw it away! :-)
Understood, but why add anything if you do not need them, the interfaces and calls already exist, we just need to make a very few functions public and with that we could add our own edit.c code that does the rest completely separate from gumbo. If you look at this from my perspective: it appears you are making decisions that force the parser to be passed along and simplifying things when it is simply not needed, refusing to remove the static from just a few functions that already exist, tying everything to an arena allocator when one is not truly needed, implementing next/prev in parser order inside gumbo which would inhibit even simple tree changes (especially when it could have been done in the python interface), and then saying you were thinking about adding in callbacks and unnecessary complication for treewalkers/treebuilders. None of this in any way helps the general usability of your library and these choices (while completely yours to make) are in fact driving people away who were trying to actively help move the project forward and actually contributing patches and fixes and in vmg's case the promise of more. I really do not want to have to support my own fork of gumbo, but I need to be able to make small simple changes without dragging along an entire xml library. Please at least consider the "happy compromise".
Understood. And thank you for creating a wonderfully robust html5 parser! Please at least consider if the "happy compromise" is doable from your perspective. If so, I would be happy to stick around and help out when I can. I jut need to be able to make a few simple gumbo tree changes and with the right hooks could keep that code outside of the gumbo-parser itself but I need the hooks and none of the extra overhead in memory or anything else. I think you will see that would help many users. Either way, thanks! Take care, Kevin |
@nostrademons |
I'm thinking about this. Understand, though, that as maintainer I have a responsibility toward all the people who have already used Gumbo in their projects and all the people who will discover it in the future. New public APIs are a comprehension tax on all of them (how do you think Xerces and libxml got so complicated?), they're a restriction on the future evolution of the library (look at the problems caused by next/prev), and changes create unnecessary work for people who have already built software against previous versions of the library. I've got to do some real work for a bit, but I'll create some issue threads and invite public comment and check back on them in a few days. |
After @kevinhendricks kindly ported my TagSet patches on top of the current
template
branch, I've taken over and implemented the features that he requested (and that we had already implemented in our internal fork already ;)).This branch enables support for Fragment Parsing and Templates and has 100% coverage on the html5lib 0.95 tests (including the fragment tests, which have now been enabled). It is also significantly faster in many parsing cases thanks to the new tag lookup machinery. Some further optimizations have not been ported yet.
As they say, you cannot make an omelette without breaking some eggs (or APIs in this case). In order to support the new features (and to keep parity with our internal fork), a couple external APIs have been broken. This makes this branch a good candidate for a "development" or "next" branch that hopefully will become part of Gumbo 1.0.
Every change, optimization and design decision has been carefully documented on the commit messages of my 4 patches. The remainder commits by @kevinhendricks are assumed "correct" insofar they are based on the existing
template
branch and are a rebase/backport (we've also extensively tested them, and I performed some cursory review).Please consider this branch for merging. Thanks! :)
cc @nostrademons @aroben