Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler on update: Checksum remains unchanged even if HTML document has been modified #1071

Open
stejacob opened this issue Oct 21, 2024 · 3 comments

Comments

@stejacob
Copy link

Hi,

We are running a demo site on a Microsoft IIS web server and using the latest version of Norconex Crawler.

We've configured both the documentChecksummer and metadataChecksummer, but we’re noticing that the checksum value remains the same even after modifying the HTML file on the server.

We've tried using both the "Last Modified" field and the MD5 checksum on specific fields, but the document continues to be rejected because the checksum generated remains unchanged, even when the HTML document has been modified.

....
Line  8680: 14:12:28.256 [es-node2.deimscloud.mil.ca#3] INFO  REJECTED_UNMODIFIED - http://es-node2.deimscloud.mil.ca/about.html - MD5DocumentChecksummer - Checksum=4529f56e11c85023cd3b815ffd1c2b1e|

....
	Line 10840: 14:30:39.085 [es-node2.deimscloud.mil.ca#3] INFO  REJECTED_UNMODIFIED - http://es-node2.deimscloud.mil.ca/about.html - MD5DocumentChecksummer - Checksum=4529f56e11c85023cd3b815ffd1c2b1e|

Any insights or suggestions would be greatly appreciated. Thanks.

@essiembre
Copy link
Contributor

Can you confirm something changed in the fields you use to create the MD5 checksum? A change anywhere else will not change the checksum.

If that is not the case, can you share your configuration?

@essiembre
Copy link
Contributor

Also, the checksum is created AFTER the document is imported. That means you need to ensure the fields you use to create the checksum are still present in the document after it was imported.

@stejacob
Copy link
Author

stejacob commented Oct 22, 2024

Morning Pascal,

Our configuration is:

....

<documentChecksummer class="MD5DocumentChecksummer"
                     combineFieldsAndContent="true"
                     keep="true"
                     toField="checksum">
  <fieldMatcher ignoreCase="false"
                ignoreDiacritic="false"
                method="CSV"
                partial="false"
                replaceAll="false">
        title,body_content,description</fieldMatcher>
</documentChecksummer>

.....

<metadataChecksummer class="com.norconex.collector.core.checksum.impl.GenericMetadataChecksummer" keep="true" targetField="metachecksum">
<sourceFieldsRegex>title|description</sourceFieldsRegex>
</metadataChecksummer>

Thanks Pascal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants