-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaping of note content in RDF export #81
Comments
This is not all that crazy; it is pretty standard to escape HTML in Atom/RSS. I think that implementing the RDF changes (effectively, permitting parseType="Literal") is relatively trivial. What might be harder is writing code that reliably converts the output of TinyMCE from HTML to valid XML so that we don't end up with corrupt RDF by doing this. |
TinyMCE output should be valid XHTML already. |
We can probably ignore it for titles until Zotero has real rich text support for other fields, since there's no way to enforce proper use of
We still need to namespace this out of the main RDF namespace and into the XHTML one, but maybe that is easy too. |
@dstillman Even if TinyMCE's output is XHTML, what do we do about Word copy/paste and people who use the HTML editor? We could escape it in that case, but then someone parsing Zotero RDF with a plain XML parser might be confused by the inconsistency (and to an RDF parser there shouldn't be any difference anyway). |
We have TinyMCE set to automatically clean up code. |
What are the issues with storing data that contains |
Since it would be the same to an XML parser, I'm not sure putting things in a CDATA section is worth the hassle (unless people are really parsing Zotero RDF without an XML parser, which seems like a bad idea to me). I admit I'm not entirely convinced there's anything wrong with the status quo. Even if our HTML happens to be XHTML, as a standard, XHTML is dead. |
I don't follow. From Avram's initial post (shortened): To make it look nice: `rdf:value 1256 to 1272` Clearly invalid xml and will not parse. `rdf:value<![CDATA[ 1256 to 1272]]>` Valid and looks nice. I feel like I'm missing something about this discussion though. |
What I'm saying is that there is no difference between the first and last example to a parser, i.e., (new DOMParser()).parseFromString("<value><h6>1256 to 1272</h6><p>"+
"&nbsp;</p></value>", "text/xml").documentElement.firstChild.nodeValue
== (new DOMParser()).parseFromString("<value><![CDATA[<h6>1256 to 1272</h6><p>"+
" </p>]]></value>", "text/xml").documentElement.firstChild.nodeValue
/*
true
*/ If someone is trying to parse these files with something besides an XML parser, they are doing it wrong. I don't think it's worth putting any effort into this for the pleasure of people who sit down and read the XML, because that's not what it's meant for. If there's any change needed, it'd be to something more like the second example: <rdf:value parseType="literal"><h6>1256 to 1272</h6>
<p> </p></rdf:value> This is valid and should look the same to as the other examples to an RDF parser, but is (slightly) cleaner and easier to deal with if you want to parse note contents with an XML parser. However, in the age of HTML5, I'm not convinced that people should be parsing (X)HTML with an XML parser at all as a matter of principle, because XHTML is dead. |
That's probably overstating things—there's still a valid XML serialization of HTML5 (XHTML5), and I imagine TinyMCE will continue to produce valid XML markup even when it adds HTML5 elements. The general point is reasonable, but there is something to be said for readability, and if we don't see ourselves outputting non-XML HTML at any point, is there any reason not to use the literal example? We do embed XHTML directly in various server API modes, so this would be consistent with that. |
Translator fixes and additions
People in the TEI world have noticed that our RDF export makes a mess of HTML tags in item data:
We are presumably doing the same with things like
<i>
in item titles. A proper solution to this, as suggested in the linked thread on eXist-TEIXML, is to namespace those tags. We would also need to replace non-XML entities like
.Unfortunately, this behavior has its roots in the underlying Tabulator RDF engine; I don't how we'd convince it to handle this with namespacing.
I would like help on this, if we have anyone still on the team who has experience with the RDF engine.
The text was updated successfully, but these errors were encountered: