Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict on test set rows #706

Closed
mb706 opened this issue Oct 19, 2022 · 5 comments
Closed

predict on test set rows #706

mb706 opened this issue Oct 19, 2022 · 5 comments

Comments

@mb706
Copy link
Collaborator

mb706 commented Oct 19, 2022

go through each pipeop and let it predict directly, since some pipeops need the predicted set

@mb706
Copy link
Collaborator Author

mb706 commented Jan 26, 2024

  • PipeOpTaskPreproc checks if there are test set rows, if so calls .predict_task
  • tests should verify that all pipeops that have "Task" inputs and outputs handle this correctly: $predict() and $train()[test set rows, ] are the same. The autotests have some flag that indicates if $predict() is random? in expect_datapreproc_pipeop_class
  • some pipeops that do task preprocessing but dont inherit from pipeoptaskpreproc are worrying here: pipeopimpute. It should probably just inherit from pipeoptaskpreproc.
  • PipeOpTaskPreproc always handles test rows if they are present. GraphLearner removes test rows in most cases, unless a PipeOp needs them. Determines this by checking each PipeOp: Does it contain a $uses_test_rows active binding / field? (choose not to just look for attached $learner, if present. Problem here: e.g. ImputeLearner should not forward its 'needs-test-set' since it can not be used)
  • GraphLearner can set its own relevant property

@sebffischer
Copy link
Member

sebffischer commented Jan 29, 2024

There is a memory inefficiency with the suggested approach:

Let's say we have a task with data of the form

row_id y x1
1
...
n_use
n_use + 1
...
n_use + n_test

Lets say we apply two preprocessing operations to x1 in two subsequent PipeOps.

In the approach we discussed, we first task$cbind() the preprocessed train rows (pipeop$.train_task()and then task$rbind() the preprocessed test rows, resulting from pipeop$.predict_task().
The problem with this is now, that the task$rbind() of the test rows, happens in both preprocessing PipeOps.
This means in both PipeOps, a table of the form below is rbinded.

row_id y x1
1
...
n_test

The wasteful thing here is that we rbind column y, as it contains information that is simply already present in the tasks's backend and is now being duplicated. This problem comes worse when there are many columns that are left untouched by a preprocessing pipeop.

The virtual backend that would result of applying two preprocessing operations on x1 would internally store n_use + n_test + n_test * n_pipeop rows, (containing all un-preprocessed columns + newly created columns) which is not so great.
Fortunately there is a more memory efficient way to achieve the same results, which works (at least) for the standard scenario of having distinct and unique use rows and test rows (which is usually the case, except for bootstrapping and insample resampling). That approach requires the following:

  • we modify mlr3's task$cbind() function, to support $cbind()-ing test rows (straightforward)
  • Preprocessing PipeOps must (I think) call $.train_dt() and$.predict_dt(), then rbind their result and cbind them to the task. (not so straightforward, because we are changing the structure of PipeOpTaskPreproc)

@sebffischer
Copy link
Member

In addition to that, retrieving the use roles will become more and more expensive, because of the many DataBackendRbinds. This is, because calling databackendrbind$data(), first goes through the rbinded backend (b2) to find the rows and then through b1.

@sebffischer
Copy link
Member

sebffischer commented Jan 29, 2024

Some possible solutions:

  1. Accept this inefficiency (I am not a huge fan)
  2. Hack our way around it (removing the internal cbinded backend containing the preprocesses use rows after $.train_task(), then rbind its data with the preprocessed test rows and $cbind() it again to the task) (I am also not a huge fan)
  3. Change the $.train_task() method everywhere (maybe (?))
  4. Somehow not $rbind() the test and use tasks, to avoid the problem (I think this should be considered)

@mb706
Copy link
Collaborator Author

mb706 commented Aug 21, 2024

Now solved by #770

@mb706 mb706 closed this as completed Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants