Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybossa.js assumes stable order #2

Open
jtbates opened this issue Sep 5, 2016 · 6 comments
Open

pybossa.js assumes stable order #2

jtbates opened this issue Sep 5, 2016 · 6 comments

Comments

@jtbates
Copy link

jtbates commented Sep 5, 2016

While the user is working on a task, pybossa.js loads the next task using the offset parameter. After the user completes the task and the task run is submitted, pybossa.js assumes that the task it retrieved previously at offset=1 is now at offset=0 and that newtask?offset=1 will return a different task. If Issue #1 is fixed but the scheduler still violates this assumption, there can be an error where the task presenter tells the user that they have completed all tasks while there is still one left.

Here is what could happen:

  1. There are three candidate tasks that the user has not completed: x y z .
  2. The task presenter retrieves task x at offset=0 and y at offset=1.
  3. The user completes task x and the task run is submitted.
  4. There are now two candidate tasks left: y z.
  5. The user is presented the preloaded task y.
  6. Another request is made to newtask?offset=1 and the response is again y.
  7. pybossa.js recognizes that this is the same task that is now presented and requests newtask?offset=2. The response is empty as there are only two candidate tasks left.
  8. The user completes task y. The next preloaded task is empty and the task presenter tells the user that they have completed all tasks leaving z unfinished.
@therealmarv
Copy link
Contributor

Big thanks for the deep analysis. I will use your steps for test and validation once rework on the random scheduler is done. The random scheduler is currently not intended to use for production setups (it breaks PYBOSSA currently) and needs to be migrated to the current PYBOSSA.

@georgeslabreche
Copy link

georgeslabreche commented Oct 13, 2016

For our Decode Darfur microtasking project, I implemented this random task fetching function in task_repository.py:

    def get_random_ongoing_task(self, project_id, user_id, user_ip):

        # If an authenticated user requests for a random task
        if user_id is not None:
            sql = text('''
                SELECT * FROM "task"
                LEFT JOIN "task_run"
                ON task_run.task_id = task.id
                WHERE task.project_id = :project_id
                AND task.state = 'ongoing'
                AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_id != :user_id OR task_run.user_id IS NULL)))
                ORDER BY random() LIMIT 1;
            ''')

            task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_id=user_id)).fetchone()
            return Task(task_row_proxy)

        # If an anonymous user requests for a random task
        elif user_ip is not None:
            sql = text('''
                SELECT * FROM "task"
                LEFT JOIN "task_run"
                ON task_run.task_id = task.id
                WHERE task.project_id = :project_id
                AND task.state = 'ongoing'
                AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_ip != :user_ip OR task_run.user_ip IS NULL)))
                ORDER BY random() LIMIT 1;
            ''')

            task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_ip=user_ip)).fetchone()
            return Task(task_row_proxy)

        # Normally, this is an unreachable return statement
        return None

The queries make sure that the retrieved random task fulfills the following three rules:

  1. The task's status is ongoing.
  2. The task does not have task runs, or...
  3. ... if it has task runs, then none of them are marked as authored by the requesting user's ID or IP.

Randomness is handled directly in the query rather than at the application level with random.choice().

@teleyinex
Copy link
Member

GREAT. This scheduler was never meant to be used in a production environment. It is a template for developing new ones like yours @georgeslabreche

@georgeslabreche
Copy link

@teleyinex it certainly did help as a template to get started! Thank you!

@jtbates
Copy link
Author

jtbates commented Oct 19, 2016

@georgeslabreche Is there some reason to prefer that the randomness be handled in the query rather than at the application level?

The suggested code does not solve the problem identified in this issue, which is that pybossa.js assumes that the task scheduler provides a stable ordering with respect to the offset in between task runs. I think this could be fixed by setting the random seed by user. That could be done at the query level with setseed or at the application level (as I did in my pull request). There still seems to be a bug there though, but I haven't had a chance to go back to review it.

@teleyinex
Copy link
Member

The best thing is always doing it at the DB level, so that it's handled there, and the scheduler only gets a set of available tasks for the user. Then, PYBOSSA.JS will only take care of taking the current task or the next one for the user from that random set of tasks for the user.

I guess, that what you want is that you need that the randomness should be stable for a given user, so PYBOSSA.JS always uses correctly the offset, is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants