Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: DOM conversion not correctly recognizing prefixed attributes indicating non-HTML namespaces #1839

Open
garretwilson opened this issue Sep 15, 2022 · 0 comments

Comments

@garretwilson
Copy link
Contributor

garretwilson commented Sep 15, 2022

My first description of the bug was incorrect as to the behavior, because I was inadvertently testing on the wrong element. However there is a small bug in attribute processing related to namespaces. I'll leave some of the background information here for those new to (X)HTML namespaces, though.

The description page for jsoup says:

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

The WHATWG HTML5 Specification § 2.1.3 XML compatibility says:

Except where otherwise stated, all elements defined or mentioned in this specification are in the HTML namespace ("http://www.w3.org/1999/xhtml"), and all attributes defined or mentioned in this specification have no namespace.

Let me explain a little about what the WHATWG specification means here, for those who are not very familiar with XML and (X)HTML namespaces. The WHATWG is not saying that attributes should not be processed in a namespace-aware manner. Rather it means that if an attribute has no prefix, it doesn't inherit the currently defined namespace as elements do; instead, the namespace is set to null. And if an attribute does have a prefix, then it does get a namespace; an example from the spec is xlink:href.

In either case, whether the attribute has a namespace defined or not, the attribute must be given a "local name". The spec doesn't say it in this paragraph, but throughout the specification it discuses matching the "local name" of attributes. What I have described has been standard namespace processing for well over 20 years! A non-namespace-aware (X)HTML processor will only set the "node name" and "node value" of a node (including Attr), but a namespace-aware processor will also set its "namespace URI" and "local name".

In addition the DOM provides two sets of methods: those that are not namespace aware, and those that are. The methods that are namespace aware usually have …NS() on the end, and take a namespace parameter.

This code in W3CDom.W3CBuilder creates attributes from the jsoup node tree:

private void copyAttributes(org.jsoup.nodes.Node source, Element el) {
  for (Attribute attribute : source.attributes()) {
    String key = Attribute.getValidKey(attribute.getKey(), syntax);
    if (key != null) { // null if couldn't be coerced to validity
      el.setAttribute(key, attribute.getValue());
    }
  }
}

This code assumes that W3CDom is not namespace aware, so it is using Element.setAttribute(…). But if this W3CDom instance is namespace aware, it should be using Element.setAttributeNS(…). In fact there is a namespaceAware flag in W3CDom.W3CBuilder, so the code should look like this:

private void copyAttributes(org.jsoup.nodes.Node source, Element el) {
  for (Attribute attribute : source.attributes()) {
    String key = Attribute.getValidKey(attribute.getKey(), syntax);
    if (key != null) { // null if couldn't be coerced to validity
      if(namespaceAware) {  // partial fix for #1839
        el.setAttributeNS(null, key, attribute.getValue());  // partial fix for #1839
      } else {
        el.setAttribute(key, attribute.getValue());
      }
    }
  }
}

The existing code works, however, with the current XML implementation. You seem to be getting lucky here, because apparently el.setAttribute(key, attribute.getValue()) "does the right thing" and sets the local name to that given, and sets the namespace URI to null. You are doing something similar (although in the opposite way) when you create create elements using doc.createElementNS(namespace, tagName). Presumably if the factory isn't namespace aware, it just "does the right thing" when you assign a namespace. But I don't know if this behavior is actually required by the API spec or not; it may be dependent on the XML implementation the JVM happens to be using. Personally I would check the flag and use the correct namespace-aware or non-namespace-aware API each time, both for elements and for attributes.

But the bigger problem (which admittedly probably only affects less than 0.1% of documents in the wild) is that the code for converting attributes to the DOM completely ignores any indicated namespace. Attributes can have namespaces too, if they have a prefix (e.g. the xlink:href attribute I mentioned in the description). So for a full fix, you need to be doing a similar check for a prefix as you do with elements, using the namespacesStack (because attributes use the same namespace inheritance algorithm for explicit namespace declarations), with one change: if an attribute does not indicate a namespace (i.e. it has no prefix), it does not use the currently defined default namespace, but instead gets a null namespace.

As I mention at the beginning of this description, I originally filed this bug thinking that namespace-aware attributes were broken altogether, but I had been testing on the wrong node. Nevertheless there is a minor bug here that would surface if anyone were to ever try to use an attribute from a non-HTML namespace.

@garretwilson garretwilson changed the title Bug: DOM conversion not correctly processing attributes in a namespace-aware manner Bug: DOM conversion not correctly recognizing prefixed attributes indicating non-HTML namespaces Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant