Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers should also be able to process corpus #37

Open
turicas opened this issue Jun 10, 2012 · 0 comments
Open

Workers should also be able to process corpus #37

turicas opened this issue Jun 10, 2012 · 0 comments

Comments

@turicas
Copy link
Contributor

turicas commented Jun 10, 2012

Currently workers can only process documents -- broker gets the document information from MongoDB and pass it to worker's wrapper (then it calls the worker's main function, passing the document as a parameter).
With the current approach we can't create worker for, for example, create a corpus wordcloud analysis (or any analysis that needs to process an entire corpus instead of a document).
We could just change a little the code so broker can get an entire corpus from MongoDB and pass it to worker's wrapper, but there is a problem with this simple approach: a corpus is much larger than a document (since it is a collection of documents, each of them with its own analysis) and is not a good idea to pass an entire corpus from broker process to worker process (multiprocessing uses pickle for this job, with temporary files to save the pickled objects).
So, the best way to do it is getting data from MongoDB inside worker process, but it is not good to provide MongoDB access to the worker. Then, I think we need a solution like this:

  • Broker should pass MongoDB access information to workers.wrapper when the worker needs to work on a corpus (key from = 'document' in worker's __meta__).
  • workers.wrapper should connect to MongoDB, get the entire corpus in a lazy-way and pass this lazy-object to worker's main function.
  • workers.wrapper should also pass corpus-specific information, like it does for documents (for example, to worker know the results of previous analysis, as in worker freqdist: we need the key tokens that is the output of worker tokenizer).

There is a problem when we permit workers to do corpus analysis: if a corpus change (document added, modified or deleted), we need to re-run all the analysis. We must create a way to re-schedule the corpus pipeline when a job of addition/modification/deletion of a document from that corpus finish (probably we'll need a heuristic to do not schedule 100 corpus pipelines (for the same corpus) when we add 100 documents to the corpus).

Note: maybe a map-reduce approach should be better, for example: passing each document for a worker.map function and then all the resulting information to worker.reduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant