Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recompress and Re-indexing Errors #4

Open
logpanic opened this issue Jan 30, 2020 · 0 comments
Open

Recompress and Re-indexing Errors #4

logpanic opened this issue Jan 30, 2020 · 0 comments

Comments

@logpanic
Copy link

We've run into two issues while trying to recompress and re-index some of our older ARCs.

1): When running warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz we get:

IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC

Could anyone elaborate on what's going on here/suggest possible work around?

2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:

UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 403: ordinal not in range(128)

We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant