URLNormalizer - encodeNonURICharacters giving an error #1063

stejacob · 2024-09-16T19:33:09Z

On the Forces.ca webpage (https://forces.ca/fr/temps-partiel/), we found an issue with an href link pointing to https://forces.ca/"%url%/".

When using the normalization with the encodeNonURICharacters option, it throws an error because the link is not well-formed.

We attempted to apply a replacement, but normalization runs before any replacements are made. If we could perform the replacement before normalization, it would likely resolve the issue.

For now, we're exploring other solutions. Thank you for your understanding.

essiembre · 2024-09-17T02:04:14Z

What is the desired outcome with such a URL? Do you know if it is supposed to be a valid URL? When I access it in my browser, the server sends a "Bad Request." What would you replace the URL with so it resolves to a valid page that can be downloaded properly?

If the URL is indeed bad, even if appropriately encoded, this exception should be harmless, and you can ignore it.

If the concern is to keep your logs clean, I suggest one of the following:

Turn off logging for the GenericURLNormalizer in the log4j2.xml file:

    <Logger name="com.norconex.collector.http.url.impl.GenericURLNormalizer" level="OFF" additivity="false">
      <AppenderRef ref="Console"/>
    </Logger>

Filter out faulty URLs before they reach the URL normalizer with a reference filter. E.g.:

      <referenceFilters>
        <filter class="ReferenceFilter" onMatch="exclude">
          <valueMatcher method="regex">.*%url%.*</valueMatcher>
        </filter>
      </referenceFilters>

Does this help resolve your issue?

stejacob · 2024-09-17T18:14:29Z

Hello Pascal,

Yes, it seems like the issue is due to a faulty URL. We’ve already asked the webmaster to address the problem, but we’re not sure when it will be resolved.

Your second option would definitely work for us—thanks again for the suggestion.

From what the search admins have shared, when the error occurs, the crawler either stops working or halts at a certain point. But I have notice that.

The ticket serves two purposes: to report the issue and to potentially raise an enhancement request for the URL normalizer. This type of issue is very rare, so your workaround that you propose would be a good solution for us.

For the enhancement, it would be helpful if we had the ability to perform pre-normalization and post-normalization using the replacement tag.

Here is a suggestion:
<urlNormalizer> <preNormalizations> <replacement>....</replacement> </preNormalizations> <normalizations>...</normalizations> <postNormalizations> <replacement>....</replacement> </postNormalizations> <urlNormalizer>

Again, we’re really pleased with our decision to go with Norconex Crawler for our transition to Elasticsearch. The features and support have been outstanding.

Thanks!

essiembre · 2024-09-17T19:15:43Z

Very much appreciated!

In my tests, the exception did not stop the crawler. It would be nice of you to share a config file that can reproduce the crawler stopping if you have one, as I think that would be a bug.

I am marking this as a feature request to allow replacements before and after normalization rules. Like what you propose, I am considering adding support for configuring multiple URL Normalizers so that you can mix and match them in the desired order.

stejacob · 2024-09-23T17:55:19Z

Regarding the issue, I was simply relaying what my admin team reported, but I believe there are multiple factors contributing to that behavior. I don’t think it's an actual bug. If we can accurately reproduce the issue, I'll open a new bug report.

Thank you so much Pascal.

essiembre · 2024-10-15T01:53:43Z

3.1.0-SNAPSHOT was just released and now supports multiple URL normalizers, which will address your initial request. Example:

<urlNormalizers>
  <urlNormalizer class="GenericURLNormalizer">
    <replacements>
      <replace>
        <match>...</match>
        <replacement>...</replacement>
      </replace>
    </replacements>
  </urlNormalizer>
  <urlNormalizer class="GenericURLNormalizer">
      <normalizations>
        addWWW,
        decodeUnreservedCharacters,
        encodeNonURICharacters,
        encodeSpaces,
        lowerCaseSchemeHost,
        removeDefaultPort,
        removeDuplicateSlashes,
        removeQueryString,
        secureScheme,
        upperCaseEscapeSequence
    </normalizations>
    <replacements>
      <replace>
        <match>...</match>
        <replacement>...</replacement>
      </replace>
    </replacements>
  </urlNormalizer>
</urlNormalizers>

Please give it a try and confirm.

stejacob · 2024-10-16T18:44:20Z

Wow, I wasn’t expecting this future request to be completed already. Amazing!

Does the urlNormalizer run in the sequence defined in the file?

We’ll give it a try and keep you posted. Thanks again.

essiembre · 2024-10-17T18:33:37Z

Yes, they run in the order you define them.

essiembre added the feature-request label Sep 17, 2024

essiembre added the resolved label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLNormalizer - encodeNonURICharacters giving an error #1063

URLNormalizer - encodeNonURICharacters giving an error #1063

stejacob commented Sep 16, 2024

essiembre commented Sep 17, 2024

stejacob commented Sep 17, 2024 •

edited

Loading

essiembre commented Sep 17, 2024

stejacob commented Sep 23, 2024

essiembre commented Oct 15, 2024

stejacob commented Oct 16, 2024

essiembre commented Oct 17, 2024

URLNormalizer - encodeNonURICharacters giving an error #1063

URLNormalizer - encodeNonURICharacters giving an error #1063

Comments

stejacob commented Sep 16, 2024

essiembre commented Sep 17, 2024

stejacob commented Sep 17, 2024 • edited Loading

essiembre commented Sep 17, 2024

stejacob commented Sep 23, 2024

essiembre commented Oct 15, 2024

stejacob commented Oct 16, 2024

essiembre commented Oct 17, 2024

stejacob commented Sep 17, 2024 •

edited

Loading