-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Requests / questions on use --> Pipe, Readme #7
Comments
You can use pipe to cdxj-indexer by passing in The If you are using CommonCrawl data, I believe the cdxj indices should also be available for download along with the WARCs so you don't need to reindex the data.. The trick with parallel processing is finding the boundaries of the gzip records, which may not be too bad.. |
Thanks for the prompt reply, and the pointer with regard to pipe, using With regard to CommonCrawl, while I am using that data, I am functionally only interested in website homepages. I took a step back and realized that I am in a good position to multiprocess on a per-file basis, as my eventual files match this format / naming convention, which I get the impression will make sense to you: This may be off-topic for this issue, if so, I understand: |
Few Feature requests and/or requests for help using cdxj-indexer!
--> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects. Nothing like global pandemic to make time for hobby projects for myself. haha
One of the first things I tried was trying to pipe the output of a command to cdxj-indexer, but that simply does not work. Whats the recommended method to get the output of a command run in this fashion? (Forgive me if I'm missing something primitive, still learning.)
--> While simple bash scripts are the most likely culprit for trying to pipe to cdxj-indexer, I have a gzip hardware accelerator (FPGA, real world throughput over 1GB per second in either direction), which would work really well if I could pipe
my-fast-funzip file.warc.gz | cdxj-indexer
. I am working on a python wrapper formy-fast-funzip
though, as this need keeps popping up.As well, when looking at --help, I see some other flags which I am having trouble finding documentation on, such as --compress and --lines. Is there a more robust readme kicking about somewhere that I simply missed?
Lastly multiprocessing would be a god-send
My machine's cpu threads are relatively slow, as it's an old server, but similarly, it's a server with 48 cores, 96 threads.
Generally speaking, I am likely not the only one who will find their way here by working with CommonCrawl warcs. I have ~40TB of warc.gz data to work through, so the use of the gzip FPGA and multiprocessing would reduce the time required for this step by a few orders of magnitude.
I'll likely work on a multiprocessing solution myself. In the past, I've handled multiprocessed writing to one file with the logging library. I believe the cdxj format is fine with an arbitrary line order, as I see sorting functionality here, is that correct?
--> Unless I hear a someone volunteer to help a beginner clean their code up, I likely won't make a pull request.
TLDR:
The text was updated successfully, but these errors were encountered: