Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLNormalizer - removeFragments should ignore SPA pages #1061

Open
stejacob opened this issue Sep 11, 2024 · 4 comments
Open

URLNormalizer - removeFragments should ignore SPA pages #1061

stejacob opened this issue Sep 11, 2024 · 4 comments

Comments

@stejacob
Copy link

https://opensource.norconex.com/commons/lang/v2/apidocs/com/norconex/commons/lang/url/URLNormalizer.html?is-external=true#removeFragment--

The removeFragments option in the URL normalizer seems to be removing the pound sign (#) in URLs used in Single Page Application (SPA) schemes.

For example, the URL https://forces.ca/en/events/#/details/14742 should remain intact, even when the removeFragments option is applied.

I believe the fragment should only be removed if it appears at the end of the URL, after the last forward slash, like in https://forces.ca/en/career/emergency-medicine/#sec-training.

Thank you.

@essiembre
Copy link
Contributor

Hello Stephen!

I am marking this as a feature request to add it as an extra normalization option. In the meantime, you can achieve an equivalent with replacements and regular expression. Here is an example (untested):

<urlNormalizer class="GenericURLNormalizer">
  <normalization>
    <!-- Your current normalizations here --->
  </normalizations>
  <replacements>
    <replace>
      <match>(.*?)(#[^\/]*)$</match>
      <replacement>$1</replacement>
    </replace>
  </replacements>
</urlNormalizer>

Does that work for you?

@stejacob
Copy link
Author

stejacob commented Sep 16, 2024

Hello Pascal,

That approach should work.

We previously used a similar pattern, but we’ll give yours a try—it might be safer than the one we developed since we are not experts in Java regex. :)

Here’s the pattern we used:

<replacements> <replace> <match>(.*)(#[^/]*$)</match> <replacement>$1</replacement> </replace> </replacements>

Thank you very much, Pascal.

@essiembre
Copy link
Contributor

3.1.0-SNAPSHOT was just released and now supports a new normalization rule: removeTrailingFragment. It behaves teh same as removeFragment except for only considering a hashtag to be a fragment if after the last URL segment (/...)

Please give it a try and confirm.

@stejacob
Copy link
Author

Excellent Pascal. We will give a try and let you know. Thank very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants