URLNormalizer - removeFragments should ignore SPA pages #1061

stejacob · 2024-09-11T18:59:30Z

https://opensource.norconex.com/commons/lang/v2/apidocs/com/norconex/commons/lang/url/URLNormalizer.html?is-external=true#removeFragment--

The removeFragments option in the URL normalizer seems to be removing the pound sign (#) in URLs used in Single Page Application (SPA) schemes.

For example, the URL https://forces.ca/en/events/#/details/14742 should remain intact, even when the removeFragments option is applied.

I believe the fragment should only be removed if it appears at the end of the URL, after the last forward slash, like in https://forces.ca/en/career/emergency-medicine/#sec-training.

Thank you.

essiembre · 2024-09-13T02:25:26Z

Hello Stephen!

I am marking this as a feature request to add it as an extra normalization option. In the meantime, you can achieve an equivalent with replacements and regular expression. Here is an example (untested):

<urlNormalizer class="GenericURLNormalizer">
  <normalization>
    <!-- Your current normalizations here --->
  </normalizations>
  <replacements>
    <replace>
      <match>(.*?)(#[^\/]*)$</match>
      <replacement>$1</replacement>
    </replace>
  </replacements>
</urlNormalizer>

Does that work for you?

stejacob · 2024-09-16T19:11:48Z

Hello Pascal,

That approach should work.

We previously used a similar pattern, but we’ll give yours a try—it might be safer than the one we developed since we are not experts in Java regex. :)

Here’s the pattern we used:

<replacements> <replace> <match>(.*)(#[^/]*$)</match> <replacement>$1</replacement> </replace> </replacements>

Thank you very much, Pascal.

essiembre · 2024-10-15T01:55:51Z

3.1.0-SNAPSHOT was just released and now supports a new normalization rule: removeTrailingFragment. It behaves teh same as removeFragment except for only considering a hashtag to be a fragment if after the last URL segment (/...)

Please give it a try and confirm.

stejacob · 2024-10-16T18:25:50Z

Excellent Pascal. We will give a try and let you know. Thank very much.

essiembre added the feature-request label Sep 13, 2024

essiembre added the resolved label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLNormalizer - removeFragments should ignore SPA pages #1061

URLNormalizer - removeFragments should ignore SPA pages #1061

stejacob commented Sep 11, 2024

essiembre commented Sep 13, 2024

stejacob commented Sep 16, 2024 •

edited

Loading

essiembre commented Oct 15, 2024

stejacob commented Oct 16, 2024

URLNormalizer - removeFragments should ignore SPA pages #1061

URLNormalizer - removeFragments should ignore SPA pages #1061

Comments

stejacob commented Sep 11, 2024

essiembre commented Sep 13, 2024

stejacob commented Sep 16, 2024 • edited Loading

essiembre commented Oct 15, 2024

stejacob commented Oct 16, 2024

stejacob commented Sep 16, 2024 •

edited

Loading