Escaping of note content in RDF export #81

avram · 2011-12-21T06:47:42Z

People in the TEI world have noticed that our RDF export makes a mess of HTML tags in item data:

<rdf:value>&lt;h6>1256 to 1272&lt;/h6>
&lt;p>&amp;nbsp;&lt;/p>
&lt;p>page 32&amp;nbsp; roll 1218a 1272 John the Clerk against William de Grendon regarding the warrant of 8 acres&lt;/p>
&lt;p>page 40 ditto&lt;/p>
&lt;p>&amp;nbsp;p108 roll 144 1269 Claim by Margery who was the wife of Henry of Ashbourne&amp;nbsp; re dower from various individuals including Stephen of Ireton the third part of an acre of meadow in Snelston, and ?( William de ) Hulton in Clifton .&amp;nbsp; William de hylton gives up dower amongst others.&amp;nbsp; Makes one wonder whether&amp;lt;per corresp='#williamofhultonclerk' role='m'&amp;gt;William de Hulton&amp;lt;/per&amp;gt; and William the clerk are the same person.&lt;/p>
&lt;p>page 109 ditto Roger is the son of Henry of Ashbourne and is in the custody of Margaret countess Derby&amp;nbsp; and lands in the custody of Edmund king's son&lt;/p>
&lt;p>page 9&amp;nbsp; and 10 1258 Information re Henry of Ashbourne.&amp;nbsp; Holds a court. Case of villeinage.&amp;nbsp; Confirms Henry heir of&amp;nbsp; Robert of Ashbourne.&amp;nbsp; Stephen of Ireton one of the pledges for Henry.&lt;/p>
</rdf:value>
</bib:Memo>

We are presumably doing the same with things like <i> in item titles. A proper solution to this, as suggested in the linked thread on eXist-TEIXML, is to namespace those tags. We would also need to replace non-XML entities like  .

Unfortunately, this behavior has its roots in the underlying Tabulator RDF engine; I don't how we'd convince it to handle this with namespacing.

I would like help on this, if we have anyone still on the team who has experience with the RDF engine.

The text was updated successfully, but these errors were encountered:

simonster · 2011-12-21T06:59:49Z

This is not all that crazy; it is pretty standard to escape HTML in Atom/RSS. I think that implementing the RDF changes (effectively, permitting parseType="Literal") is relatively trivial. What might be harder is writing code that reliably converts the output of TinyMCE from HTML to valid XML so that we don't end up with corrupt RDF by doing this.

dstillman · 2011-12-21T15:54:59Z

TinyMCE output should be valid XHTML already.

avram · 2011-12-21T18:08:21Z

We can probably ignore it for titles until Zotero has real rich text support for other fields, since there's no way to enforce proper use of <i>, etc. in other fields at present.

I think that implementing the RDF changes (effectively, permitting parseType="Literal") is relatively trivial.

We still need to namespace this out of the main RDF namespace and into the XHTML one, but maybe that is easy too.

simonster · 2012-01-26T05:27:51Z

@dstillman Even if TinyMCE's output is XHTML, what do we do about Word copy/paste and people who use the HTML editor? We could escape it in that case, but then someone parsing Zotero RDF with a plain XML parser might be confused by the inconsistency (and to an RDF parser there shouldn't be any difference anyway).

dstillman · 2012-01-26T07:59:39Z

We have TinyMCE set to automatically clean up code. <div><blink>Foo</blink></div (without the trailing ">") entered into the code editor becomes <div>Foo</div>.

aurimasv · 2012-05-14T07:14:38Z

What are the issues with storing data that contains < in CDATA sections?

simonster · 2012-05-16T03:56:11Z

Since it would be the same to an XML parser, I'm not sure putting things in a CDATA section is worth the hassle (unless people are really parsing Zotero RDF without an XML parser, which seems like a bad idea to me). I admit I'm not entirely convinced there's anything wrong with the status quo. Even if our HTML happens to be XHTML, as a standard, XHTML is dead.

aurimasv · 2012-05-16T08:07:55Z

I don't follow. From Avram's initial post (shortened):
<rdf:value><h6>1256 to 1272</h6> <p>&nbsp;</p></rdf:value>

To make it look nice:

`rdf:value

1256 to 1272

`

Clearly invalid xml and will not parse.

`rdf:value<![CDATA[

1256 to 1272

]]>`

Valid and looks nice.

I feel like I'm missing something about this discussion though.

simonster · 2012-05-16T10:00:40Z

What I'm saying is that there is no difference between the first and last example to a parser, i.e.,

(new DOMParser()).parseFromString("<value>&lt;h6>1256 to 1272&lt;/h6>&lt;p>"+
"&amp;nbsp;&lt;/p></value>", "text/xml").documentElement.firstChild.nodeValue 
== (new DOMParser()).parseFromString("<value><![CDATA[<h6>1256 to 1272</h6><p>"+
"&nbsp;</p>]]></value>", "text/xml").documentElement.firstChild.nodeValue
/*
true
*/

If someone is trying to parse these files with something besides an XML parser, they are doing it wrong. I don't think it's worth putting any effort into this for the pleasure of people who sit down and read the XML, because that's not what it's meant for.

If there's any change needed, it'd be to something more like the second example:

<rdf:value parseType="literal"><h6>1256 to 1272</h6>
<p>&#160;</p></rdf:value>

This is valid and should look the same to as the other examples to an RDF parser, but is (slightly) cleaner and easier to deal with if you want to parse note contents with an XML parser. However, in the age of HTML5, I'm not convinced that people should be parsing (X)HTML with an XML parser at all as a matter of principle, because XHTML is dead.

dstillman · 2012-05-16T19:16:03Z

However, in the age of HTML5, I'm not convinced that people should be parsing (X)HTML with an XML parser at all as a matter of principle, because XHTML is dead.

That's probably overstating things—there's still a valid XML serialization of HTML5 (XHTML5), and I imagine TinyMCE will continue to produce valid XML markup even when it adds HTML5 elements. The general point is reasonable, but there is something to be said for readability, and if we don't see ourselves outputting non-XML HTML at any point, is there any reason not to use the literal example? We do embed XHTML directly in various server API modes, so this would be consistent with that.

Translator fixes and additions

socheres pushed a commit to socheres/translators that referenced this issue Apr 7, 2020

Merge pull request zotero#81 from ubtue/mkannan

777d69e

Translator fixes and additions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaping of note content in RDF export #81

Escaping of note content in RDF export #81

avram commented Dec 21, 2011

simonster commented Dec 21, 2011

dstillman commented Dec 21, 2011

avram commented Dec 21, 2011

simonster commented Jan 26, 2012

dstillman commented Jan 26, 2012

aurimasv commented May 14, 2012

simonster commented May 16, 2012

aurimasv commented May 16, 2012

simonster commented May 16, 2012

dstillman commented May 16, 2012

Escaping of note content in RDF export #81

Escaping of note content in RDF export #81

Comments

avram commented Dec 21, 2011

simonster commented Dec 21, 2011

dstillman commented Dec 21, 2011

avram commented Dec 21, 2011

simonster commented Jan 26, 2012

dstillman commented Jan 26, 2012

aurimasv commented May 14, 2012

simonster commented May 16, 2012

aurimasv commented May 16, 2012

1256 to 1272

1256 to 1272

simonster commented May 16, 2012

dstillman commented May 16, 2012