gumbo-next #295

vmg · 2015-02-16T18:59:38Z

After @kevinhendricks kindly ported my TagSet patches on top of the current template branch, I've taken over and implemented the features that he requested (and that we had already implemented in our internal fork already ;)).

This branch enables support for Fragment Parsing and Templates and has 100% coverage on the html5lib 0.95 tests (including the fragment tests, which have now been enabled). It is also significantly faster in many parsing cases thanks to the new tag lookup machinery. Some further optimizations have not been ported yet.

As they say, you cannot make an omelette without breaking some eggs (or APIs in this case). In order to support the new features (and to keep parity with our internal fork), a couple external APIs have been broken. This makes this branch a good candidate for a "development" or "next" branch that hopefully will become part of Gumbo 1.0.

Every change, optimization and design decision has been carefully documented on the commit messages of my 4 patches. The remainder commits by @kevinhendricks are assumed "correct" insofar they are based on the existing template branch and are a rebase/backport (we've also extensively tested them, and I performed some cursory review).

Please consider this branch for merging. Thanks! :)

cc @nostrademons @aroben

kevinhendricks · 2015-02-16T20:29:29Z

@vmg
Wow fast work! Thank you!
I was using html5lib-tests master for my testing which probably (hopefully) explains the 12 failures out of 1532 tests. I will merge all of your changes into my altversion branch. Nicely done!

kevinhendricks · 2015-02-17T00:23:18Z

@vmg,
Just a head's up.. the html5lib-test-master for your vmg/development fails 35 times due to new document-fragment parsing in a foreign context tests I have never run. Here is a snippet of the test .dat file:

#data
<nobr>X
#errors
6: HTML start tag “nobr” in a foreign namespace context.
7: End of file seen and there were open elements.
6: Unclosed element “nobr”.
#document-fragment
svg path
#document
| <svg nobr>
|   "X"

Note that the document-fragment is followed by the string "svg path". We will have to modify the html5lib_adapter_test.py to deal with these and because you pass in the beginning context as a GumboTag we will need to allow namespace to be passed in as well (to seperate out svg "a" or svg "title" from an html "a" or an html "title".

Furthermore, we will have to grow the list of recognized tags in the gumbo enum to include all of the svg elements and all of the mathml elements so that an actual GumboTag enum exists with which to specify the context. Right now the svg path element has no enum assigned to it and so can't be specified as the context for the document fragment even if we wanted to.. FWIW, I already grow my own master to to include the extra 100 or so tags from svg and mathml if you want them.

Also note, the hack to remove the template "content" line is still needed in the html5lib-tests master as "content" is not part of how gumbo has decided to implement templates. It was this hack that was removing the svg attributes that began with "content" from my test output which literally drove me crazy.

kevinhendricks · 2015-02-17T00:50:19Z

@vmg, that hack to remove the content lines in templates in html5lib_adpater_test.py is also what messed up the alignment when nested (double or triple) templates were used with html5lib-tests-master. So instead of 12 failures altversion2 really had just 3 failures. This was, of course, before all of the new document fragment support and foreign document fragment tests were added.

vmg · 2015-02-17T09:41:57Z

Thanks for the heads up. I'll see to implement the foreign contexts for fragments now.

vmg · 2015-02-17T12:15:36Z

For your convenience: I've pushed a commit that unifies all the tag representations into the file tag.in. Adding tags to that file will automatically regenerate the C enum, the C string table, the C perfect hash and the Python string table to match tag.in. Keeping all the tags in sync should now be much more approachable.

Obviously all the generated files are already checked in, so you don't need sed or mph to generate the tag files when distributing the library. I hope this is acceptable.

vmg · 2015-02-17T14:35:11Z

OK, I'll try to push some more work for the tip of the html5lib tests later this week. For now I would like to get this branch merged and/or at least approved by @nostrademons, as it is already passing 100% of html5lib 0.95 tests and it's a solid improvement over what we had before.

Any comments or code review?

… set it as appropriate.

…tf8.c

…for lexing, to ease debugging non-printable characters.

…or. (The whole error handling really needs to be redone, it's not very helpful to users.)

…make sure their input mechanisms can accept this without relying on strlen.

…ng elements

kevinhendricks · 2015-02-17T16:03:34Z

@vmg
With your latest and the fixes for html5lib_adapter_test.py to properly handle templates, and with -DGUMBO_HTML5_TIP, the htmllib-tests master pass all tests except for the 20 foreign document-fragment tests (and those can be fixed quite easily) and one other test that I frankly think is wrong as it would violate the Noah's ark rule of 3 for formatting items (this is out of 1532 tests).:

FAIL: test_adoption01_16 (__main__.Html5libAdapterTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "html5lib_adapter_test.py", line 177, in test_func
    return self.impl(inner_html, input, expected, errors)
  File "html5lib_adapter_test.py", line 148, in impl
    error_msg.encode('ascii', 'xmlcharrefreplace') + '\n')
AssertionError: 

Input:
<b><b><b><b>x</b></b></b></b>y

Expected:
<html html>
  <html head>
  <html body>
    <html b>
      <html b>
        <html b>
          <html b>
            "x"
    "y"

Received:
<html html>
  <html head>
  <html body>
    <html b>
      <html b>
        <html b>
          <html b>
            "x"
      "y"

(Is this even a valid test failure? ) Either way, your tree is in excellent shape!

gsnedders · 2015-02-17T16:36:18Z

The 'if there is no such element, then abort these steps and instead act as described in the "any other end tag" entry above' clause in the AAA results in the final </b> being popped from the stack of open elements when it is reached, hence putting the "y" as the child of the body element. The test is correct as stands.

kevinhendricks · 2015-02-17T16:40:06Z

@gdnedders
Thank you for explaining that. This bug exists in all of trees here so far I believe. Everyone hates the AAA! If I get a free moment, I will spend some time trying to track down the failure.

kevinhendricks · 2015-02-17T17:04:37Z

FWIW, there is a newer version of the AAA in the template pull that I did not incorporate as it did not specifically deal with templates and so I did not think it was different from the version in master. I will incorporate the new AAA code from the official template pull request to see if that fixes that problem case.

nostrademons · 2015-02-17T18:23:03Z

Yeah, I'd rewritten the AAA in the template b branch to bring it up to the current spec. I think my reason was that there was some bug I was trying to track down and it was impossible while the code was so far out of sync with the spec.

I agree that we should get the template code merged into a unified branch first, so people can start building off it without worrying about conflicts. I think a useful milestone for releasing v0.10.0 would be to have it pass all html5lib tests from trunk.

kevinhendricks · 2015-02-17T18:43:46Z

@vmg @nostrademons @gdnedders
Please grab this commit which brings the AAA from your official template pull into vmg's:

kevinhendricks@9138711

With this commit in place, we do pass the AAA test we failed earlier.
So with this change and vmg's current tree, the only failing tests from html5lib-test-master are the foreign document-fragments (there are 20 of them). I believe vmg is reworking his document fragment stuff to fix these as the current interface does not allow a namespace to be passed as part of the document fragment context.

kevinhendricks · 2015-02-17T21:18:55Z

@nostrademons If it would help my altversion3 branch has the full template stuff and the updated AAA algorithm on top of your master with no API changes from your master. If you have any interest let me know and I will create a pull or you can grab it yourself from that branch.

nostrademons · 2015-02-17T22:04:16Z

These are the same commits as in @vmg's pull request, plus the fixed AAA algorithm and template tag tests, right? Yes, it'd help to have that around as a pull request. I'm still figuring out the GitHub way to get that merged onto a separate branch within google/gumbo - I think I need to create the new branch, checkout your altversion3 locally, merge the commits, and then push it upstream. I also want to cherry-pick a couple of vmg's fixes for the Travis Mac build back to 0.9.3 before doing the release for that.

nostrademons · 2015-02-18T09:10:18Z

All of kevinhendricks & my commits in this pull request have now been merged. Status on @vmg's:

GTest: Merged into 0.9.3.
Travis Mac: Merged into 0.9.3.
Memory: possible 1.0.0 candidate, but I'm a bit concerned about threadsafety and want to open up the idea of ditching the custom memory allocators for public discussion.
Perfect hashing & tag.in: Merged into 0.10.0.
Fragment parsing: Attempting to merge it into 0.10.0, but there are a number of test failures with html5lib. I added Kevin's snippet above; this fixed the template failures but left foreign-fragment ones. I added a Namespace param to gumbo_parse_fragment, and updated gumboc.py, html5lib.py, and html5lib_adapter_test.py to parse a string in the form of the namespaced document-fragment and convert it to the appropriate enums. There are still 3 html5lib test failures, and they look like they may be actual bugs in the fragment-case handling. It's a bit too late for me to look at them productively tonight; maybe someone else can take a crack at the current master.
element_in_specific_scope: Benchmark results show it's basically a wash, so I'm unlikely to merge unless there's some other benefit I'm overlooking.
SVG attribute replacements: with Kevin's template changes we now pass (almost) all html5lib trunk tests, so I'm unlikely to merge, and instead we'll work off the most recent spec.
Refactor ascii-helpers: likely 0.10.0 candidate
Expose gumbo_create_node & attribute helpers: I'm on the fence about this. On one hand I can see the utility of it, and client code is likely to reinvent it (badly) if not exposed. OTOH, I don't want to encourage users to treat the Gumbo parse tree like a DOM; it's not designed for safe arbitrary mutation of portions of the parse tree, and it's likely that users who do this will rapidly introduce memory leaks, inconsistent pointers, and buffer overflows. It'd be a v1.0.0 change if it does go in.

Thanks for all your hard work on this. Gumbo's progressed a lot since last week, and features that were on my vague far-future wishlist are nearly in. We're at 3 failing html5lib tests (once my pull request for handling gets in) now, and that includes all of the document-fragment ones.

vmg · 2015-02-18T11:23:04Z

Thank you very mad for the thorough feed, @nostrademons! I've replied to your concerns on the individual commits. Could you please consider merging the Memory revamp in the 1.0 branch? It's a dependency on most of the stuff I'm working on, and I'd love to be able to track one of the upstream branches.

kevinhendricks · 2015-02-18T15:33:09Z

FWIW, I agree here, the memory changes are needed in my tree as well and it would be nice to have them in an official tree. Passing the parser around where it is not needed just to allocate memory just clutters the code. And trying to devise your own memory allocators that are thread safe and overrun safe is not trivial and good reason why system malloc and free should not be replaced unless just for testing purposes.

As for those who like to play with the tree as a dom, I think you overstate the memory issues. We need to relocate the html page relative to its images, so finding and updating links nodes after parsing is crucial for us. Right now, I have to end run all this by parsing the tree and then reserializing it on the fly with the changes and then reparse it! This is silly. Please don't handcuff the usefulness of your code. Instead, make a clear statement that the "gumbo-parser will not handle bug reports where the parse tree has ever been modified" and then let people who need to build on your code instead of having to permanently fork it.

Kevin

Sent from my iPad

On Feb 18, 2015, at 6:23 AM, Vicent Marti [email protected] wrote:

Thank you very mad for the thorough feed, @nostrademons! I've replied to your concerns on the individual commits. Could you please consider merging the Memory revamp in the 1.0 branch? It's a dependency on most of the stuff I'm working on, and I'd love to be able to track one of the upstream branches.

—
Reply to this email directly or view it on GitHub.

kevinhendricks · 2015-02-18T17:50:25Z

@vmg if you are building off of v1.0.0 please see the patch in the newest issue here from me about broken rtc handling fix. It is not needed in your tree, only if you are rebasing off of v0.10.0, v1.0.0 or master here. It fixes the ruby related test failure leaving us with just 3 foreign document fragment failures to track down.

kevinhendricks · 2015-02-19T01:15:33Z

@vmg,

For completeness can you include in the docs which "mph" implementation you are using. I could not find an "mph" project that had a command line that grokked your options:

/*

* Generated with mph
* ./mph -d2 -m2 -c1.33 < tag.in | emitc -s -l
*/

Where exactly can I download this?

On Feb 18, 2015, at 6:23 AM, Vicent Marti [email protected] wrote:

Thank you very mad for the thorough feed, @nostrademons! I've replied to your concerns on the individual commits. Could you please consider merging the Memory revamp in the 1.0 branch? It's a dependency on most of the stuff I'm working on, and I'd love to be able to track one of the upstream branches.

—
Reply to this email directly or view it on GitHub.

nostrademons · 2015-02-19T04:37:44Z

I assume 'mph' refers to CMPH, the C Minimal Perfect Hash library. I'd like to see a more thorough pointer in the source code as well.

nostrademons · 2015-02-19T05:18:45Z

So I did some investigation into the memory allocator patch. No firm decision yet, but I see a few issues, a question, maybe an alternative, and some extra data points:

Removing the GumboParser argument from create_node and callers isn't going to work, unfortunately. The problem is the next/prev patch that was recently merged into 0.10.0. This creates an intrusive linked-list between nodes in document order (which is sometimes distinct from DOM order, since nodes may be moved around and reparented). It's necessary for BeautifulSoup compatibility, likely for a few other libraries (IIRC libxml also had a similar linked list), and for refactoring tools, where the best way to preserve the original formatting of the document is to build up a list of edit operations in reverse order of where they appear in the doc. The state needed to build this doubly-linked list is stored in the GumboParserState.

One option might be to split the creation of the node and its insertion into the document into two functions. This is arguably a good idea anyway, but the reason we don't do this currently is because insert_element is called to reparent nodes in the AAA/foster-parenting cases, and so this would mess up the linked list upon reparenting and create a cycle in it. Since it's used for deletion of the output tree, this would lead to a crash on cleanup. Client code that wants to create a node and add it into the tree would face the same issue; unless they're careful to splice the next/prev pointers appropriately, it'll crash on cleanup. (This, incidentally, is a good example why I'm wary of encouraging people to treat the parse tree as mutable. I would much rather have people take a Gumbo parse tree and walk it to build up, say, a libxml DOM tree - this also has the benefit of making all the tooling based on libxml available to Gumbo.)

Another option - one I'm warming up to, since it would fix #68, #6, and #25 as well - is to separate the parser and treebuilder, such that the parser invokes a number of callback functions to perform tree construction and has no direct knowledge of the particular tree data structure. html5lib, Hubbub, and html5ever all function like this, and IIUC the reason Servo didn't use Gumbo and instead created html5ever was so they could get fast DOM construction (there are probably all sorts of other reasons as well - Gumbo was never designed to be part of a browser engine, and wouldn't fulfill their main goal of showcasing Rust). The reason I didn't do this in Gumbo's original design is because there are complex interactions between parsing and the DOM tree in HTML5, with some nodes needing to be reparented and the existing tree sometimes needing to be inspected to determine what to do. Hubbub's interface consists of 18 callback functions, with nodes represented as void* and assumed to be reference counted, and Hubbub's functionality is less than Gumbo's (it doesn't record source positions, original text, or document order). So there's a definite complexity tradeoff, but perhaps it could be hidden by making the GumboNode* interface the default treebuilder and exposing the callbacks only for people who really need the additional flexibility and performance.

As for memory allocator thread-safety - I'm curious, would the proposed design be able to handle the case where you have a per-request arena that is freed all at once when the request completes, with multiple requests in multiple threads? This is a really common pattern in server software - it's why custom allocators are supported to begin with - and it also applies to large-scale data processing, where you may want to parse a number of documents in parallel, free the parse trees once the document is done processing, and not pay any allocation cost beyond a pointer bump. It seems like a global allocator doesn't allow this - the point is to let each request/doc allocate into its own arena, not a global one. Am I missing anything here?

Finally, some benchmarks. This is the current v1.0.0 candidate master:

hacker_news.html: 4828 microseconds.
html5_spec.html: 694811 microseconds.
xinhua.html: 20443 microseconds.
baidu.html: 5551 microseconds.
yahoo.html: 24543 microseconds.
google.html: 23458 microseconds.
bbc.html: 7956 microseconds.
arabic_newspapers.html: 5905 microseconds.
wikipedia.html: 36458 microseconds.

This is for a dead-simple arena allocator that just allocates a gig of memory and doesn't worry about going over it. It's also a good proxy for how much time is spent in the memory allocator in total (~15-20%...yeowch, that could use some improvement), since this is about as close to a theoretical maximum as we can get (allocation is just an aligned pointer bump, deallocation is free).

hacker_news.html: 4145 microseconds.
html5_spec.html: 561545 microseconds.
xinhua.html: 16711 microseconds.
baidu.html: 5420 microseconds.
yahoo.html: 23619 microseconds.
google.html: 23335 microseconds.
bbc.html: 7265 microseconds.
arabic_newspapers.html: 5094 microseconds.
wikipedia.html: 30599 microseconds.

This is for removing the custom memory allocator entirely. I wanted to float this option as well since I had a hunch that the pointer indirection on every allocation may kill as much performance as using a different allocator gains, but it turns out the penalty is only ~2-3%, not nearly as much as a good arena can get you.

hacker_news.html: 4480 microseconds.
html5_spec.html: 685927 microseconds.
xinhua.html: 19808 microseconds.
baidu.html: 5346 microseconds.
yahoo.html: 24038 microseconds.
google.html: 22804 microseconds.
bbc.html: 7762 microseconds.
arabic_newspapers.html: 6194 microseconds.
wikipedia.html: 36511 microseconds.

I'd tried to get a benchmark for @vmg's patch as well, but was stymied by the create_node issue above.

vmg · 2015-02-19T12:06:08Z

Thanks for the detailed analysis as usual, @nostrademons. Here are a couple notes:

I'm afraid that your next/prev patches leak memory all over the place. All the nodes created through clone_node are not attached to the linked list, so they fail to be freed by gumbo_document_destroy. It's not obvious to me whether they are not attached because they are parsed-inserted nodes (and hence don't appear on the original document, and shouldn't be on the list), or because you overlooked them. If they are not supposed to appear on the next/prev linked list, then other parser-inserted nodes shouldn't be there either, and you should change destroy_node to recurse again, instead of following the linked list. If you overlooked then, well, that has an easy fix.
The performance improvements from the arena are significant because the library as a whole does a very poor job at handling memory overall (with their lack of reallocs, the creation of numerous heap objects that shouldn't be necessary, the free/creation of many objects that could be reused, etc). If you're interested on using arenas for parsing, then the library itself needs to be designed around it. I see no reason why specific parts of the library couldn't be made to work in an arena, but this needs to be a conscious design decision, and not an user-pluggable after-thought. With the current approach, clients that cannot or do not want to use the arena pay a hefty performance cost because memory is generally mismanaged. That's one of the main motivators of my memory handling patch (and of other memory diet patches that I cannot upstream because they build on top of that one).
The callback approach to building the DOM seems rather complex. I am not sure of the benefits it's supposed to provide, but it supposes a major refactoring of the existing implementation -- it will take you months, and with the way the HTML5 standard is currently set (introspecting the current state of the DOM while parsing), it would take an intense amount of intrusive APIs to abstract the node creation process, and common use cases of a callback API (e.g. directly serializing the document) won't be possible, as the full DOM needs to be built in memory either way. I think at this point the best approach you can have is designing the helpers for node creation and insertion in a way that can be used cleanly by the parser while generating the DOM and by external programs that want to modify it. This is a much more humble and approachable goal that can be accomplished with much less code, and will serve all the users of the library just fine without drastically changing any existing APIs.

kevinhendricks · 2015-02-19T14:18:16Z

FWIW,
BS4 and current html5lib in python do not use or set the prev/next pointers. Since BS3 is not python 3 compatible, I have not tested it. Although document order in DOM can be useful (and is typically a freebe with serializing parsers in xml), it is typically more useful in html after the parser has fixed the invalid tree to create something approaching common sense.

Removing the GumboParser argument from create_node and callers isn't going to work, unfortunately. The problem is the next/prev patch that was recently merged into 0.10.0. This creates an intrusive linked-list between nodes in document order (which is sometimes distinct from DOM order, since nodes may be moved around and reparented). It's necessary for BeautifulSoup compatibility, likely for a few other libraries (IIRC libxml also had a similar linked list), and for refactoring tools, where the best way to preserve the original formatting of the document is to build up a list of edit operations in reverse order of where they appear in the doc. The state needed to build this doubly-linked list is stored in the GumboParserState.

—

Reply to this email directly or view it on GitHub.

kevinhendricks · 2015-02-19T14:26:49Z

Actually it does not. I finally found the mph-1.2.tar.gz in an old ftp linux repository. Once you have it, the generated code needs to be very slightly modified to allow for case conversion for lookup. Since I use a larger recognized tag set, I will document the process in a README of some sort and post it for you.

One other side-effect is the need to use gnu sed when remaking the pieces as standard sed on BSD/Mac OSX will not work. But since automake and friends, and etc do not exist as standard on MacOSX any way, this is minor.

Sent from my iPad

On Feb 18, 2015, at 11:37 PM, nostrademons [email protected] wrote:

I assume 'mph' refers to CMPH, the C Minimal Perfect Hash library. I'd like to see a more thorough pointer in the source code as well.

—
Reply to this email directly or view it on GitHub.

Update: This README is now posted as a "Issue" here as it seemed to be the easiest way to convey the information.

nostrademons · 2015-02-19T19:13:55Z

Hmm, I guess clone_node does leak memory. I'd thought that I had a leak-checking custom allocator in the unit tests to catch things like this, but it looks like it was only in the tokenizer/utf8/utility tests. Anyway, that's an oversight - other parser-inserted nodes (eg. from create_element_from_token) are added to the next/prev list at the point that the parser creates them, and so clone_node should probably insert the cloned node right after the original.

The more I think about it, the more I think Gumbo should just use an arena as its default allocator and remove the pluggable allocator machinery, because:

Gumbo's intended use is pretty much the textbook use-case for an arena. Long write-heavy initialization that's localized to a single function (parse), followed by a number of read operations on immutable data (user code), followed by en-masse destruction of all objects (gumbo_destroy_output).
We discourage holding on to the parse tree for long periods of time. One main disadvantage of arenas is that they tie up large blocks of memory, but if the intended usage is "Parse one document. Extract what you want from it. Free it. Parse the next", this is not a major concern.
It's unlikely that any one HTML document is large enough to cause significant memory pressure (although I'd like to verify this by collecting stats on memory usage before making the change). It could potentially be a problem if you're eg. diffing DOM trees of thousands of documents, but this seems like an unlikely use-case, and you're much more likely to be diffing DOM trees of ~2 documents.
I doubt we can get much more efficient than an aligned pointer bump (well, and a bounds-check if we want to do the arena right) for each allocation, with free destruction. It seems like this could save a lot of complexity in memory allocation, and yet give us even better performance.
One nice property of arenas for HTML parsing in particular is that the bulk of memory will be allocated in document order. This should give excellent locality for in-order traversals such as serializing a document or converting it to another tree representation. The cache effects here could give a significant speed-up to user code.
We don't have to worry about memory leaks anymore; deletion is trivial.

Another low-hanging memory trick I'd like to try is to fiddle with the default allocation sizes for vectors and stringbuffers. These were arbitrarily chosen; I could pretty easily run some statistical analysis on common attribute/child/string sizes to pick ones that would require minimal reallocs. With an arena, the amount of wasted memory is less of an issue, and avoiding a realloc will give much better locality.

Do either of these sound like they'd solve your performance problems? Basically the change would be to remove the allocator functions, add an arena to GumboOutput that's destroyed when the parse tree is destroyed, and get a blanket ~15-20% speedup across the board from memory-allocation improvements.

I unfortunately don't have Google's corpus to play with anymore, but I do have a few local corpora of ~100K documents each that I could run some stats over.

kevinhendricks · 2015-02-19T20:38:13Z

Forgive me but what I know about arenas you could put in a thimble. My use case is within an C++ (Qt-based) ebook editing app that is multithreaded. We would be using it to parse anywhere from 1 to say 200 documents (chapters), We would not need the tree to be kept around after each parse but we do need a very small memory footprint, fast parsing time, and the ability to make small changes once parsed all in parallel in a multithreaded environment.

One main disadvantage of arenas is that they tie up large blocks of memory,

Reserving a gig of memory would not be something that fits within our or our users needs at all.
I realize I may not be a typical downstream user of your library, but why make design choices that would limit the general usefulness and effectiveness of the library in some way. Wouldn't it just be simpler to go with the memory improvements that have already been offered? It seems sane and better memory management is much more important than next/prev support for an version of BS3 that is no longer being improved. Using BS4 and html5lib in now does not set these prev/next links either according to my tests. So I would think that keeping as small as possible memory footprint as possible should be a very important goal for this project.

On Feb 19, 2015, at 2:14 PM, nostrademons [email protected] wrote:

Hmm, I guess clone_node does leak memory. I'd thought that I had a leak-checking custom allocator in the unit tests to catch things like this, but it looks like it was only in the tokenizer/utf8/utility tests. Anyway, that's an oversight - other parser-inserted nodes (eg. from create_element_from_token) are added to the next/prev list at the point that the parser creates them, and so clone_node should probably insert the cloned node right after the original.

The more I think about it, the more I think Gumbo should just use an arena as its default allocator and remove the pluggable allocator machinery, because:

• Gumbo's intended use is pretty much the textbook use-case for an arena. Long write-heavy initialization that's localized to a single function (parse), followed by a number of read operations on immutable data (user code), followed by en-masse destruction of all objects (gumbo_destroy_output).
• We discourage holding on to the parse tree for long periods of time. One main disadvantage of arenas is that they tie up large blocks of memory, but if the intended usage is "Parse one document. Extract what you want from it. Free it. Parse the next", this is not a major concern.
• It's unlikely that any one HTML document is large enough to cause significant memory pressure (although I'd like to verify this by collecting stats on memory usage before making the change). It could potentially be a problem if you're eg. diffing DOM trees of thousands of documents, but this seems like an unlikely use-case, and you're much more likely to be diffing DOM trees of ~2 documents.
• I doubt we can get much more efficient than an aligned pointer bump (well, and a bounds-check if we want to do the arena right) for each allocation, with free destruction. It seems like this could save a lot of complexity in memory allocation, and yet give us even better performance.
• One nice property of arenas for HTML parsing in particular is that the bulk of memory will be allocated in document order. This should give excellent locality for in-order traversals such as serializing a document or converting it to another tree representation. The cache effects here could give a significant speed-up to user code.
• We don't have to worry about memory leaks anymore; deletion is trivial.
Another low-hanging memory trick I'd like to try is to fiddle with the default allocation sizes for vectors and stringbuffers. These were arbitrarily chosen; I could pretty easily run some statistical analysis on common attribute/child/string sizes to pick ones that would require minimal reallocs. With an arena, the amount of wasted memory is less of an issue, and avoiding a realloc will give much better locality.

Do either of these sound like they'd solve your performance problems? Basically the change would be to remove the allocator functions, add an arena to GumboOutput that's destroyed when the parse tree is destroyed, and get a blanket ~15-20% speedup across the board from memory-allocation improvements.

I unfortunately don't have Google's corpus to play with anymore, but I do have a few local corpora of ~100K documents each that I could run some stats over.

—
Reply to this email directly or view it on GitHub.

kevinhendricks · 2015-02-20T01:32:20Z

Unless I am missing something ...
If understanding the original parsing order is important for BS3 and BS4 (ie. before nodes get reshuffled), couldn't the prev and next fields be easily derived in python from the original tag and text original starting pointers? These are in fact directly comparable indicators of original text order. A simple algorithm that builds up a list of start positions vs node while keeping it sorted would be quite easy to add. Implied nodes would not have these values but they did not exist in the original document anyway.

If instead you want the order after potential parser reshuffling and with implied nodes added, then walking the DOM with a depth first recursion should do that.

Either way gumbo keeps more than enough state information in the DOM to create those links anytime after parsing without having to carry along a counter in the parser state.

Hmm, I guess clone_node does leak memory. I'd thought that I had a leak-checking custom allocator in the unit tests to catch things like this, but it looks like it was only in the tokenizer/utf8/utility tests. Anyway, that's an oversight - other parser-inserted nodes (eg. from create_element_from_token) are added to the next/prev list at the point that the parser creates them, and so clone_node should probably insert the cloned node right after the original.

—

Reply to this email directly or view it on GitHub.

vmg · 2015-02-20T12:33:17Z

I gotta side with @kevinhendricks here. Unfortunately we don't have much weight on the direction of the project, but it seems like many of the features being added or proposed to Gumbo just make common use cases (or supposedly common -- it seems like @kevinhendricks and us at GitHub are working on completely different projects, and yet we need the exact same features) either significantly more complex or significantly more expensive, with very dubious benefits.

The prev/next pointers are a significant cause of memory increase and complexity: internal APIs have changed just to keep track of the linked list, and it's now become extremely complex (or almost impossible) to insert foreign nodes into the parse tree without them leaking.

The memory model is still in a bad spot, and switching to an arena seems like the sloppy approach to fixing the memory issues, instead of actually refactoring the existing code to be less wasteful when allocating. Again, it also makes any kind of DOM modifications more complex because of the need to insert external objects in the arena.

I'm afraid I'm running out of open-source time, so I'm going to freeze our fork of Gumbo at v1.0.0, with my memory handling patches, and with the prev/next pointers removed to simplify all the external APIs, and carry on with that. I'd really love to be able to contribute more stuff upstream, but all my changes from now on will be based on the new memory model, and hence not trivially portable.

You can follow the tree here: https://github.com/vmg/gumbo-parser/tree/v1.0.0 and of course you're free to cherry pick anything you want.

kevinhendricks · 2015-02-20T14:57:25Z

@vmg sorry to hear that. I really would love to see your improvements that build off of the memory changes. If you do not mind, Sigil (the project I volunteer for) will use your repo as our starting point as our views on what is needed now and in the short term future seem to coincide: correctness first, fast parsing and small memory footprint second, and the ability to modify the gumbo tree before serializing (third).

Please let me know if anyone is at all interested in a small simplistic set of easy-to-bolt-on changes that allow for limited xhtml parsing - in this case adding the ability to recognize self-closed non-void tags as being valid and injecting the proper end tag token to keep gumbo happy. These changes are easy to add after the fact and can be enabled and disabled with an an additional options flag. With these changes in place things like -

<title/>
and 
<a id="blah" />

no longer freak out the parser.

Speaking on behalf of Sigil and myself, thank you for your contributions.

nostrademons · 2015-02-26T10:04:06Z

This may be a bit late, since it seems like you've both decided to work off of Vicente's branch, but I ran some numbers on allocator behavior, performance, and data shape on a segment of the CommonCrawl corpus. The code and cursory results are over in a new GitHub repository for Gumbo stats, and I've opened some pull requests for the fixes themselves:

Summary:

Traversal time is a tiny fraction (~1-2%) of parsing time. This should reassure anyone worried about converting the GumboNode tree to their own data structures.
There were a bunch of surprises in the overall shape of HTML documents:
- 99% of elements have zero attributes; of the remainder, 98% have one. These numbers are somewhat higher for modern websites (where usage of class/id is widespread), but there are an awful lot of Web1.0, table & formatting tag sites still on the net.
- 30% of elements have no children, another 65% have one, and only about 1 in 600 have more than 2.
- The vast majority of attribute names are 5 characters long (I assume this is 'class' & 'style'), with 2 ('id'), 3 ('src'), and 4 ('href') also being pretty popular.
Changing the default buffer & vector sizes is a big win on memory, while being a wash on CPU. Changing attributes from 4 >= 1 and stringbuffers from 10 >= 3 resulted in a reduction in median high-water-mark memory usage from 200K => 140K, with indistinguishable parse time. Changing to attributes=0 reduced this further to 100K, with a slight increase in CPU time from additional reallocs.
Adding a gumbo_stringbuffer_clear method instead of repeated init/destroy calls is another big win, saving roughly 5% on CPU and 10% on total bytes allocated.
Moving ownership of temporary buffers over to the final parse tree instead of doing a fresh allocation & copying was a big loss, costing about 5% in CPU, 25% in memory, and 50% in total bytes allocated. The reason is that most strings in HTML documents are 1-2 characters long, so replacing them with a raw temporary buffer that starts at 3 characters and doubles ends up allocating much more. (It works well with arenas, though, where an object that's freed and one that is not have the same memory usage.)
Eliminating configurable memory allocators has no significant effect on performance. Apparently the indirection of going through a function pointer is negligible. (Some time in the debugger seems to indicate that gcc can inline the default memory allocation functions since they're defined in the same compilation unit and never reset during parsing.)
Using realloc instead of malloc for growable buffers has limited effect. With glibc malloc, the parser can re-use the same block of memory only about % of the time. When it can, the savings are substantial (measured at ~75%, probably because re-use is more common with large blocks of memory). However, newer mallocs like tcmalloc or jemalloc basically can't re-use anything, because they use a series of object pools that are sized as powers of two, so resizing a buffer automatically forces it into the next pool. With jemalloc, there were a grand total of 6 successful reallocs in the corpus of 60,000 documents.
Arenas drastically reduce parsing time, but at the cost of increased memory usage. In typical use, Gumbo allocates about 2x the memory of the final parse tree over the course of parsing. This is the lower bound on effective arena memory usage; however, because arenas are allocated in large chunks, it's often more than that (we use a default of 240K with the median document requiring 3 chunks, so it's roughly 3-4x the baseline 0.10.0). Because the absolute numbers on memory usage are so low, however, I believe this is a good trade-off.

And median speed/memory benchmarks for major releases & branches of Gumbo:

v0.9.1: 5.3ms, 163K used.
v0.9.2: 3.9ms, 163K used.
v0.9.3: 3.8ms, 173K used.
v0.10.0: 3.5ms, 208K used. (Good job everyone who contributed to 0.10; it's both faster than 0.9.3 and includes more features.)
vmg/development: 3.3ms, 215K used.
arena: 2.9ms, 720k used.
gumbo_libxml: 3.1ms, memory used is the same as libxml.

I'd encourage folks to run the benchmarks for themselves. They aren't particularly robust or portable, but they are simple and it shouldn't be too much work to get them working on multiple systems. There was a pretty wide variance in parse times and memory usage just working with different Common Crawl segments, but the ratios stayed pretty constant at 60us/1K doc length and 10K memory/1K doc length (5K without the arena).

There are a number of low-hanging optimizations that I've already submitted pull requests for, and I'm leaning toward removing the custom allocators and making the default an arena for 1.0.0. This will have to go up for public comment, though, and I'm a little worried about its effect on embedded systems.

Also, responses to various other messages posted here while I was writing code:

Reserving a gig of memory would not be something that fits within our or our users needs at all.

It's not a gig - more like 720K in the average case, 2.4M at the 95th percentile, and about 100M for the HTML5 spec itself, which is the largest single HTML doc I've come across (8M).

If understanding the original parsing order is important for BS3 and BS4 (ie. before nodes get reshuffled), couldn't the prev and next fields be easily derived in python from the original tag and text original starting pointers?

Hmm, actually this is true. I'm inclined to take them out and add this to the BeautifulSoup adaptor itself rather than pollute the C source, before releasing 0.10.0.

Please let me know if anyone is at all interested in a small simplistic set of easy-to-bolt-on changes that allow for limited xhtml parsing

Why not use libxml or some other XML parser? The point of XHTML is that it's a reformulation of HTML that's legal XML, and so you can use an XML parser on it.

Again, it also makes any kind of DOM modifications more complex because of the need to insert external objects in the arena.

I guess I've done a poor job explaining why Gumbo is not intended to have a mutable parse tree:

The point of Gumbo is to serve as a base for other libraries and tools, so that they do not have to implement the HTML5 parsing algorithm themselves. Its API is intentionally minimal, because that limits the amount that such a toolsmith needs to learn to get started.

As soon as you start adding mutability to the library, you have to make a number of design decisions that are specific to the particular use-case that it's being used for. We've seen this in this thread already, with the tension between next/prev and your patches. You either lose the ability to use a custom allocator, or you need to pass the allocator & memory block into each mutation function, or you lose thread-safety, or you require that allocators that might otherwise run locklessly on local data need to use a shared global memory block and locks. You may find yourself wishing you had other information on the nodes, eg. a dirty bit for refactoring tools, or a RenderLayer for a browser, or a pointer to an editor buffer for a WYSIWYG editor, or a series of event handlers for a desktop app.

I do not want to be the arbiter of which person's particular DOM variations go into Gumbo - I don't have time for that, nor do I really like making decisions that are guaranteed to please one person and piss off everybody else. So instead, I'm going to say no to all of them and piss everybody off.

Instead, I recommend that people wrap the library with the DOM implementation that works for them. I recognize that not everyone will know what I mean by this or want to bother with this, so I created gumbo-libxml yesterday, which parses with Gumbo and gives you back a libxml2 document. This has several advantages over trying to manipulate Gumbo's DOM:

It comes with an API - an actual, documented one - for mutating the parse tree, adding nodes, moving nodes, freeing nodes, etc, which can keep everything in sync for you.
It comes with a serializer and a pretty-printer built in.
You get XPath, DTD validation, and all the other utilities built into the libxml2 distribution for free.
You get to use other third-party tools that work in the libxml ecosystem.
It's one less library and one less API for developers to learn, and their skills are transferrable across many different projects.
The benchmarks have shown that traversal time is negligible compared to parsing. Arena + gumbo_libxml is about 7% faster than vmg/development, including all the time needed to traverse Gumbo & allocate the libxml doc, and by throwing away the Gumbo parse tree, it makes the particular allocation scheme used for it immaterial.

Users are, of course, free to use their own wrappers as necessary - it doesn't have to be libxml, but I figure that writing that one serves as a good example.

I realize I may not be a typical downstream user of your library, but why make design choices that would limit the general usefulness and effectiveness of the library in some way. Wouldn't it just be simpler to go with the memory improvements that have already been offered?

Each design choice made limits the usefulness of the library, and each design choice also comes with a set of advantages. Sometimes the limitations won't apply until the future, and sometimes they affect silent users of the library who are not so vocal with their opinions. The trick is to choose a set of design trade-offs that maximize its usefulness within the open-source ecosystem. Some examples from previous history of this library:

The choice to write it in C was so that it would have easy bindings available in other languages, and so it could be used in projects whose maintainers are vehemently against C++. (FWIW, both I and the person who suggested this choice believe those people are wrong and there is little reason to choose C over C++, but as I'm sure you've noticed, it's hard to change opinions.) It has largely succeeded in this, with close to a dozen third-party language bindings available. The downside is that memory management & correctness became much more difficult, and there are many times I've been tracing pointers and puzzling out variable values and wished I had RAII or classes.
The choice to use structures rather than accessor functions was also to make 3rd-party bindings easy, since many scripting language FFIs have much more concise ways of defining C structures than functions. LLVM was also on my mind at the time, since when I started this library I had just been prototyping a JIT for a templating language and it is easier and faster to JIT against a well-defined memory layout than a collection of accessor functions.
The choice to not encourage mutability is a consequence of the last point. When you access everything through functions, you can enforce invariants on the tree, and you can make sure that there are no dangling pointers when a node is removed. When everything is naked structs, mutations become a minefield.
The bias against not adding new API functions is to keep the library easily learnable for new people who discover it and just want to write a quick tool. Every new function added is a tax on them; it makes it harder to learn the library, and harder to find the functions they do want.

I've followed a fairly simple general guideline when deciding what to add and what not to, and that's "Can this functionality be implemented outside of the library, without knowledge of the HTML5 spec?" Because ultimately, the reason why this library exists is that parsing HTML sucks, and as few people as possible should have to write an HTML parser. Writing tree traversals does not suck; there is nothing wrong with constructing a data structure of your choice and throwing the Gumbo parse tree away. Writing scripting language bindings does not suck. Writing utility functions does not suck, though if the utility functions require knowledge of the HTML spec I consider them in-scope for inclusion.

Anyway, regardless of whether you choose to use mainline Gumbo or your own private forks, thank you for your contributions so far. If you do intend to continue with the fork, I would like to make some API cleanups (basically moving the fragment context and namespace to GumboOptions instead of having a separate gumbo_parse_fragment function - this is to prevent a combinatorial explosion in functions if libraries need to add their own wrappers, so we don't get gumbo_parse_fragment_with_options_keeping_original_tree) before releasing v0.10.0

kevinhendricks · 2015-02-26T15:07:44Z

@nostrademons

This may be a bit late, since it seems like you've both decided to work off of Vicente's branch, but I ran some numbers on allocator behavior, performance, and data shape on a segment of the CommonCrawl corpus. The code and cursory results are over in a new GitHub repository for Gumbo stats, and I've opened some pull requests for the fixes themselves:

Yes, I saw some of your pull requests come in to google/gumbo-parser. I should probably do the same for xhtml based ebook chapters in epub2 and epub3 ebooks. There are many public repositories for books that are in the public domain that volunteers have created epubs for. I would guess that use of id and class together might be a bit higher, much less javascript, much larger text nodes than general purpose web-pages. Also epubs have much higher use of svg (for auto-zooming images to fill the screen) than normal web pages.

I'd encourage folks to run the benchmarks for themselves. They aren't particularly robust or portable, but they are simple and it shouldn't be too much work to get them working on multiple systems. There was a pretty wide variance in parse times and memory usage just working with different Common Crawl segments, but the ratios stayed pretty constant at 60us/1K doc length and 10K memory/1K doc length (5K without the arena).

I will definitely download your GumboStats and give it a try on my own epub library just to see if it differs by much from what you have observed. Thank you.

Reserving a gig of memory would not be something that fits within our or our users needs at all.

It's not a gig - more like 720K in the average case, 2.4M at the 95th percentile, and about 100M for the HTML5 spec itself, which is the largest single HTML doc I've come across (8M).

I am glad to hear that. When a person first opens an ebook with 50 to 100 chapters, we are basically using QtConcurrent (multiple threads) to scan and clean all of those documents in parallel using multiple instances of gumbo. So keeping a small memory footprint is essential for us.

If understanding the original parsing order is important for BS3 and BS4 (ie. before nodes get reshuffled), couldn't the prev and next fields be easily derived in python from the original tag and text original starting pointers?

Hmm, actually this is true. I'm inclined to take them out and add this to the BeautifulSoup adaptor itself rather than pollute the C source, before releasing 0.10.0.

That is probably one of the biggest sticking points. If there is a way to do that without needing next/prev inside gumbo that would be grand.

Please let me know if anyone is at all interested in a small simplistic set of easy-to-bolt-on changes that allow for limited xhtml parsing

Why not use libxml or some other XML parser? The point of XHTML is that it's a reformulation of HTML that's legal XML, and so you can use an XML parser on it.

We used to do that but Xerces is a resource hog, is cumbersome and slow, so is QDom. Please understand the epub3 spec calls for the xhtml serialization of the html5 spec. None of the xml based parsers grok it well. Sigil is not an ebook reader, it is an ebook editor, and it allows wysiwyg and code based editing of html5 documents and users can and do mess up the syntax and so having a robust html5 parser is essential. No libxml parser is robust enough (especially without an official html5 dtd to work with) and using python in a multithreaded environment is a joke (even with multiple virtual machines fired up - it basically has the BKL problem). I thought about using libxml2 but it seems to have many bugs and nuances and a horrible codebase (BADCAST anyone?). So gumbo has been a lifesaver for us. We can use it to replace Xerces and Tidy at least for html5/xhtml5 based documents. We are incorporating an embedded python 3.4 interpreter built in to use python's lxml do deal with the two pure xml documents (the content.opf and the toc.ncx) while using gumbo to deal with the xhtml serialization of html5 in a very robust way.

I guess I've done a poor job explaining why Gumbo is not intended to have a mutable parse tree:

The point of Gumbo is to serve as a base for other libraries and tools, so that they do not have to implement the HTML5 parsing algorithm themselves. Its API is intentionally minimal, because that limits the amount that such a toolsmith needs to learn to get started.

As soon as you start adding mutability to the library, you have to make a number of design decisions that are specific to the particular use-case that it's being used for. We've seen this in this thread already, with the tension between next/prev and your patches. You either lose the ability to use a custom allocator, or you need to pass the allocator & memory block into each mutation function, or you lose thread-safety, or you require that allocators that might otherwise run locklessly on local data need to use a shared global memory block and locks. You may find yourself wishing you had other information on the nodes, eg. a dirty bit for refactoring tools, or a RenderLayer for a browser, or a pointer to an editor buffer for a WYSIWYG editor, or a series of event handlers for a desktop app.

Understood, but why wrap a poor/heavy xml processor based DOM tree around a gumbo tree just to make simple modifications?

Right now with vmg's tree and slight api changes I can easily replace attribute values (with no memory leaking) which helps when a user moves images, videos, audio, or any href/src, or, for that matter, when either the source or destination of the xhtml files are changed. I can create nodes and append nodes, and destroy nodes easily again without leaking memory. It simply is not that hard to do. Removing in-line styles and adding them to external stylesheets is another common issue. With those simple capabilities I can take care of that. These types of basic modifications are simple and safe to do in a load -> parse -> modify -> serialize -> destroy output sequence. There is no need for libxml, or any other DOM. A gumbo node tree is more than enough. You already have most of the calls needed in the code someplace, all they need is to remove the unneeded link to the parser (for memory allocation), remove static, add to header and preface it with gumbo_. This adds virtually no overhead, no noticeable increase in code size or complexity and makes the library just that much more valuable for fixing up the tree before re-serializing it.

I do not want to be the arbiter of which person's particular DOM variations go into Gumbo - I don't have time for that, nor do I really like making decisions that are guaranteed to please one person and piss off everybody else. So instead, I'm going to say no to all of them and piss everybody off.

That is of course your choice as you are the owner/creator of the project. There is of course a happy compromise:

build prev/next links from the original starting position information inside of the python interface not in gumbo
remove the need to pass the parser all over the place especially when it is to simply allocate and deallocate memory
allow a few specific public hooks in attributes, vectors, and destroy, and create nodes to be public (they already exist). Then users who
want to create their own gumbo editor.c can do so without interfering with your parser at all
make the arena implementation a build time option with -DUSEARENA and fall back to normal malloc/free/realloc otherwise
And based on what you have written above, you should be totally against overly complicating the library with callbacks and other things needed to
create separate treewalker/treebuilder interfaces.

This then protects your core code from needing to support multiple DOM implementations, but allows me (and others) to do what they need
I for one (I can't speak for vmg) would be quite happy with that approach.

Instead, I recommend that people wrap the library with the DOM implementation that works for them. I recognize that not everyone will know what I mean by this or want to bother with this, so I created gumbo-libxml yesterday, which parses with Gumbo and gives you back a libxml2 document. This has several advantages over trying to manipulate Gumbo's DOM:

As I said before, most xml libs have a hard time with html5 parsing without dtd or schemas which are no longer ("needed") and really do not bring much to the table. Xpath underneath does nothing more that node filtering by recursively walking the tree. I can do that myself directly. It is not hard.

I realize I may not be a typical downstream user of your library, but why make design choices that would limit the general usefulness and effectiveness of the library in some way. Wouldn't it just be simpler to go with the memory improvements that have already been offered?

Each design choice made limits the usefulness of the library, and each design choice also comes with a set of advantages. Sometimes the limitations won't apply until the future, and sometimes they affect silent users of the library who are not so vocal with their opinions. The trick is to choose a set of design trade-offs that maximize its usefulness within the open-source ecosystem.

I agree, see my happy compromise above ;-)

• The choice to not encourage mutability is a consequence of the last point. When you access everything through functions, you can enforce invariants on the tree, and you can make sure that there are no dangling pointers when a node is removed. When everything is naked structs, mutations become a minefield.

Agreed, but the the routines to safely create and destroy a node already exist. Again, this is not that hard to do.

• The bias against not adding new API functions is to keep the library easily learnable for new people who discover it and just want to write a quick tool. Every new function added is a tax on them; it makes it harder to learn the library, and harder to find the functions they do want.

And most of the functions I want already exist and could be stored in their own header and not needed for the majority of the people who just want to wrap their language of choice around the parser.

I've followed a fairly simple general guideline when deciding what to add and what not to, and that's "Can this functionality be implemented outside of the library, without knowledge of the HTML5 spec?" Because ultimately, the reason why this library exists is that parsing HTML sucks, and as few people as possible should have to write an HTML parser. Writing tree traversals does not suck; there is nothing wrong with constructing a data structure of your choice and throwing the Gumbo parse tree away.

Or keep it and modify it slightly and safely (this is not hard) and then reserialize it with your changes and then throw it away! :-)

Writing scripting language bindings does not suck. Writing utility functions does not suck, though if the utility functions require knowledge of the HTML spec I consider them in-scope for inclusion.

Understood, but why add anything if you do not need them, the interfaces and calls already exist, we just need to make a very few functions public and with that we could add our own edit.c code that does the rest completely separate from gumbo.

If you look at this from my perspective: it appears you are making decisions that force the parser to be passed along and simplifying things when it is simply not needed, refusing to remove the static from just a few functions that already exist, tying everything to an arena allocator when one is not truly needed, implementing next/prev in parser order inside gumbo which would inhibit even simple tree changes (especially when it could have been done in the python interface), and then saying you were thinking about adding in callbacks and unnecessary complication for treewalkers/treebuilders.

None of this in any way helps the general usability of your library and these choices (while completely yours to make) are in fact driving people away who were trying to actively help move the project forward and actually contributing patches and fixes and in vmg's case the promise of more. I really do not want to have to support my own fork of gumbo, but I need to be able to make small simple changes without dragging along an entire xml library. Please at least consider the "happy compromise".

Anyway, regardless of whether you choose to use mainline Gumbo or your own private forks, thank you for your contributions so far. If you do intend to continue with the fork, I would like to make some API cleanups (basically moving the fragment context and namespace to GumboOptions instead of having a separate gumbo_parse_fragment function - this is to prevent a combinatorial explosion in functions if libraries need to add their own wrappers, so we don't get gumbo_parse_fragment_with_options_keeping_original_tree) before releasing v0.10.0

Understood. And thank you for creating a wonderfully robust html5 parser! Please at least consider if the "happy compromise" is doable from your perspective. If so, I would be happy to stick around and help out when I can. I jut need to be able to make a few simple gumbo tree changes and with the right hooks could keep that code outside of the gumbo-parser itself but I need the hooks and none of the extra overhead in memory or anything else. I think you will see that would help many users. Either way, thanks!

Take care,

Kevin

kevinhendricks · 2015-02-26T17:08:33Z

@nostrademons
After looking at it further, the public simplified (parser removed) vect (vect.c), attribute (atrribute.c), and create and free nodes (parser.c) which vmg has in his current v1.0.0 build are enough to act as hooks in to allow a completely separate editor.c implementation using only those few routines. Any one who wanted to have his/her own DOM or simply make small modifications to the current gumbo tree (after that parser has completed its work) could use those hooks and not touch the basic parser in any way (assuming they were there to work with).

nostrademons · 2015-02-26T20:09:01Z

I'm thinking about this.

Understand, though, that as maintainer I have a responsibility toward all the people who have already used Gumbo in their projects and all the people who will discover it in the future. New public APIs are a comprehension tax on all of them (how do you think Xerces and libxml got so complicated?), they're a restriction on the future evolution of the library (look at the problems caused by next/prev), and changes create unnecessary work for people who have already built software against previous versions of the library.

I've got to do some real work for a bit, but I'll create some issue threads and invite public comment and check back on them in a few days.

vmg force-pushed the vmg/development branch from 56f2cf9 to b55b638 Compare February 16, 2015 19:39

nostrademons and others added 23 commits February 17, 2015 15:48

Add a token type for CDATA.

0340cad

Add a state flag for whether the tokenizer is in a cdata section, and…

f9a515f

… set it as appropriate.

Add CDATA handling to parser, including a test for it.

58d5fad

Add test for CDATA sections not in foreign content.

fa3a71d

Fix a couple comment issues (line-wrapping, unfinished comments) in u…

2b804fa

…tf8.c

Print the decimal value of the current character in the debug output …

8b867b4

…for lexing, to ease debugging non-printable characters.

Add test for unsafe cdata.

3f6012a

Fix missing case statement for GUMBO_TOKEN_CDATA in handle_parser_err…

fe28c18

…or. (The whole error handling really needs to be redone, it's not very helpful to users.)

Additional debugging instructions.

b6c9617

Add a test for utf8iterator_maybe_consume_match followed by a null.

adc4c76

Update parser and tokenizer tests with testcases for null CDATA, and …

29f48f2

…make sure their input mechanisms can accept this without relying on strlen.

Fix handling of nulls in CDATA sections.

7fea4b5

First pass at getting template changes on top of new master

4383a40

Update python interface for template changes

d8f369d

Add in template parser.cc tests and fixes for parser.c

975cfcf

Recognize templates in serialize and prettyprint

ac84d02

Add in - Fix additional html5lib tests google#291

ed9c9e5

Fix template not handled to spec in handle_in_table

4d1efca

Fix bug in handle_in_template to meet spec

61fc188

Add in require rtc tag

f236a8c

Fix bug in reset appropriate insertion mode for select

7d433e0

Merge in implementation of get_appropriate_insertion_location

befeb12

Add get_appropriate_insertion_location to reconstruct active formatti…

a2f9e41

…ng elements

vmg force-pushed the vmg/development branch from 06e797e to b6dcb36 Compare February 17, 2015 15:48

attribute: Export the attribute helpers

37479c5

nostrademons mentioned this pull request Feb 26, 2015

Proposal: Limited mutability API #311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gumbo-next #295

gumbo-next #295

vmg commented Feb 16, 2015

kevinhendricks commented Feb 16, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

vmg commented Feb 17, 2015

vmg commented Feb 17, 2015

vmg commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

gsnedders commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

nostrademons commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

nostrademons commented Feb 17, 2015

nostrademons commented Feb 18, 2015

vmg commented Feb 18, 2015

kevinhendricks commented Feb 18, 2015

kevinhendricks commented Feb 18, 2015

kevinhendricks commented Feb 19, 2015

nostrademons commented Feb 19, 2015

nostrademons commented Feb 19, 2015

vmg commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

nostrademons commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

kevinhendricks commented Feb 20, 2015

vmg commented Feb 20, 2015

kevinhendricks commented Feb 20, 2015

nostrademons commented Feb 26, 2015

kevinhendricks commented Feb 26, 2015

kevinhendricks commented Feb 26, 2015

nostrademons commented Feb 26, 2015

gumbo-next #295

Are you sure you want to change the base?

gumbo-next #295

Conversation

vmg commented Feb 16, 2015

kevinhendricks commented Feb 16, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

vmg commented Feb 17, 2015

vmg commented Feb 17, 2015

vmg commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

gsnedders commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

nostrademons commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

kevinhendricks commented Feb 17, 2015

nostrademons commented Feb 17, 2015

nostrademons commented Feb 18, 2015

vmg commented Feb 18, 2015

kevinhendricks commented Feb 18, 2015

kevinhendricks commented Feb 18, 2015

kevinhendricks commented Feb 19, 2015

nostrademons commented Feb 19, 2015

nostrademons commented Feb 19, 2015

vmg commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

nostrademons commented Feb 19, 2015

kevinhendricks commented Feb 19, 2015

kevinhendricks commented Feb 20, 2015

vmg commented Feb 20, 2015

kevinhendricks commented Feb 20, 2015

nostrademons commented Feb 26, 2015

kevinhendricks commented Feb 26, 2015

kevinhendricks commented Feb 26, 2015

nostrademons commented Feb 26, 2015