You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A workaround I could try for this is modify all block elements by adding a single space on the end of them, and then using textContent, but I'm uncertain whether its actually smart or even possible to rely on an elements style.display property.
The text was updated successfully, but these errors were encountered:
The result is multiple words get bunched to getter into long "invalid" words.
This becomes a problem when you index the scraped text and want to use ngram search on it.
https://xp.readthedocs.io/en/stable/developer/search/query-functions/ngram.html
I don't know how similar the cheerio evaluator's textContent works in comparison to the browser variant,
but it might be behaving correctly.
https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent
One might have slightly better results using innerText but that is not supported by surgeon (yet).
Notice that textContent ignores
<br/>
while innerText does not:http://perfectionkills.com/the-poor-misunderstood-innerText/
A workaround I could try for this is modify all block elements by adding a single space on the end of them, and then using textContent, but I'm uncertain whether its actually smart or even possible to rely on an elements style.display property.
The text was updated successfully, but these errors were encountered: