Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Requests / questions on use --> Pipe, Readme #7

Open
jwest75674 opened this issue Sep 7, 2020 · 2 comments
Open

Feature Requests / questions on use --> Pipe, Readme #7

jwest75674 opened this issue Sep 7, 2020 · 2 comments

Comments

@jwest75674
Copy link

jwest75674 commented Sep 7, 2020

Few Feature requests and/or requests for help using cdxj-indexer!
--> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects. Nothing like global pandemic to make time for hobby projects for myself. haha

One of the first things I tried was trying to pipe the output of a command to cdxj-indexer, but that simply does not work. Whats the recommended method to get the output of a command run in this fashion? (Forgive me if I'm missing something primitive, still learning.)
--> While simple bash scripts are the most likely culprit for trying to pipe to cdxj-indexer, I have a gzip hardware accelerator (FPGA, real world throughput over 1GB per second in either direction), which would work really well if I could pipe my-fast-funzip file.warc.gz | cdxj-indexer. I am working on a python wrapper for my-fast-funzip though, as this need keeps popping up.

As well, when looking at --help, I see some other flags which I am having trouble finding documentation on, such as --compress and --lines. Is there a more robust readme kicking about somewhere that I simply missed?

Lastly multiprocessing would be a god-send
My machine's cpu threads are relatively slow, as it's an old server, but similarly, it's a server with 48 cores, 96 threads.
Generally speaking, I am likely not the only one who will find their way here by working with CommonCrawl warcs. I have ~40TB of warc.gz data to work through, so the use of the gzip FPGA and multiprocessing would reduce the time required for this step by a few orders of magnitude.

I'll likely work on a multiprocessing solution myself. In the past, I've handled multiprocessed writing to one file with the logging library. I believe the cdxj format is fine with an arbitrary line order, as I see sorting functionality here, is that correct?
--> Unless I hear a someone volunteer to help a beginner clean their code up, I likely won't make a pull request.

TLDR:

  1. Is there a method to pipe to cdxj-indexer? If not, this is a feature request
  2. Multiprocessing capabilities for those of us with more warc data than time to wait.
@ikreymer
Copy link
Member

ikreymer commented Sep 8, 2020

You can use pipe to cdxj-indexer by passing in - as the input filename, eg: cat ./my-warc.warc.gz | cdxj-indexer -

The --compress and --lines features are designed to generate a compressed index, similar to the one used by CommonCrawl. It compresses every N lines and also produces an outer secondary index.
Yes, the docs need a bit of updating!

If you are using CommonCrawl data, I believe the cdxj indices should also be available for download along with the WARCs so you don't need to reindex the data..

The trick with parallel processing is finding the boundaries of the gzip records, which may not be too bad..
I'll track this issue to update the README, I don't think I'll have time for parallel processing in the near future unfortunately..

@jwest75674
Copy link
Author

Thanks for the prompt reply, and the pointer with regard to pipe, using - is obviously new to me.

With regard to CommonCrawl, while I am using that data, I am functionally only interested in website homepages.
My project to date has revolved around sorting through the dataset to extract only homepages, and ideally nothing rated XXX (lol).
As such the existing indexes are of little value to me (this is why I am here, looking to reindex.)

I took a step back and realized that I am in a good position to multiprocess on a per-file basis, as my eventual files match this format / naming convention, which I get the impression will make sense to you:
CC-MAIN-2020-16_cdx-00040.warc.gz

This may be off-topic for this issue, if so, I understand:
At first glance, it seems most straight forward to output to respective CC-MAIN-2020-16_cdx-00040.cdxj indexes.
Is this expected to negatively impact search time during playback? (Having many many small indexes, instead of a single, large, sorted index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants