Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLNormalizer - encodeNonURICharacters giving an error #1063

Open
stejacob opened this issue Sep 16, 2024 · 7 comments
Open

URLNormalizer - encodeNonURICharacters giving an error #1063

stejacob opened this issue Sep 16, 2024 · 7 comments

Comments

@stejacob
Copy link

On the Forces.ca webpage (https://forces.ca/fr/temps-partiel/), we found an issue with an href link pointing to https://forces.ca/"%url%/".

When using the normalization with the encodeNonURICharacters option, it throws an error because the link is not well-formed.

We attempted to apply a replacement, but normalization runs before any replacements are made. If we could perform the replacement before normalization, it would likely resolve the issue.

For now, we're exploring other solutions. Thank you for your understanding.

@essiembre
Copy link
Contributor

What is the desired outcome with such a URL? Do you know if it is supposed to be a valid URL? When I access it in my browser, the server sends a "Bad Request." What would you replace the URL with so it resolves to a valid page that can be downloaded properly?

If the URL is indeed bad, even if appropriately encoded, this exception should be harmless, and you can ignore it.

If the concern is to keep your logs clean, I suggest one of the following:

  1. Turn off logging for the GenericURLNormalizer in the log4j2.xml file:
    <Logger name="com.norconex.collector.http.url.impl.GenericURLNormalizer" level="OFF" additivity="false">
      <AppenderRef ref="Console"/>
    </Logger>
  1. Filter out faulty URLs before they reach the URL normalizer with a reference filter. E.g.:
      <referenceFilters>
        <filter class="ReferenceFilter" onMatch="exclude">
          <valueMatcher method="regex">.*%url%.*</valueMatcher>
        </filter>
      </referenceFilters>

Does this help resolve your issue?

@stejacob
Copy link
Author

stejacob commented Sep 17, 2024

Hello Pascal,

Yes, it seems like the issue is due to a faulty URL. We’ve already asked the webmaster to address the problem, but we’re not sure when it will be resolved.

Your second option would definitely work for us—thanks again for the suggestion.

From what the search admins have shared, when the error occurs, the crawler either stops working or halts at a certain point. But I have notice that.

The ticket serves two purposes: to report the issue and to potentially raise an enhancement request for the URL normalizer. This type of issue is very rare, so your workaround that you propose would be a good solution for us.

For the enhancement, it would be helpful if we had the ability to perform pre-normalization and post-normalization using the replacement tag.

Here is a suggestion:
<urlNormalizer> <preNormalizations> <replacement>....</replacement> </preNormalizations> <normalizations>...</normalizations> <postNormalizations> <replacement>....</replacement> </postNormalizations> <urlNormalizer>

Again, we’re really pleased with our decision to go with Norconex Crawler for our transition to Elasticsearch. The features and support have been outstanding.

Thanks!

@essiembre
Copy link
Contributor

Very much appreciated!

In my tests, the exception did not stop the crawler. It would be nice of you to share a config file that can reproduce the crawler stopping if you have one, as I think that would be a bug.

I am marking this as a feature request to allow replacements before and after normalization rules. Like what you propose, I am considering adding support for configuring multiple URL Normalizers so that you can mix and match them in the desired order.

@stejacob
Copy link
Author

Regarding the issue, I was simply relaying what my admin team reported, but I believe there are multiple factors contributing to that behavior. I don’t think it's an actual bug. If we can accurately reproduce the issue, I'll open a new bug report.

Thank you so much Pascal.

@essiembre
Copy link
Contributor

3.1.0-SNAPSHOT was just released and now supports multiple URL normalizers, which will address your initial request. Example:

<urlNormalizers>
  <urlNormalizer class="GenericURLNormalizer">
    <replacements>
      <replace>
        <match>...</match>
        <replacement>...</replacement>
      </replace>
    </replacements>
  </urlNormalizer>
  <urlNormalizer class="GenericURLNormalizer">
      <normalizations>
        addWWW,
        decodeUnreservedCharacters,
        encodeNonURICharacters,
        encodeSpaces,
        lowerCaseSchemeHost,
        removeDefaultPort,
        removeDuplicateSlashes,
        removeQueryString,
        secureScheme,
        upperCaseEscapeSequence
    </normalizations>
    <replacements>
      <replace>
        <match>...</match>
        <replacement>...</replacement>
      </replace>
    </replacements>
  </urlNormalizer>
</urlNormalizers>

Please give it a try and confirm.

@stejacob
Copy link
Author

Wow, I wasn’t expecting this future request to be completed already. Amazing!

Does the urlNormalizer run in the sequence defined in the file?

We’ll give it a try and keep you posted. Thanks again.

@essiembre
Copy link
Contributor

Yes, they run in the order you define them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants