Combined method of RF and Symbolic regression models #273
-
Hi Dr. Miles Cranmer, Thank you for this wonderful tool. I am currently trying to get a mathematical operation (as an index) to estimate the impact of weather stress on agricultural yields by using your package "pysr", based on 40 years of county-monthly level data. I have 9 features that were narrowed down from the Random Forest model (Wadeker et al., 2020, 2021) and I would like to get one complete equation to capture partial dependence (accumulated local effect) plots of RF for the final relationship. Questions are as follows:
1.1) Compared to the important features (with interaction effects) from the RF model, the most important features do not show up in the final pysr equations (of multiple runs with different random states). I am confused by this inconsistency. Do you have any input on this? 1.2) In addition, the final equation often chooses a couple of features with two or three times of repetitions. Is there any way to force which features to be included in a final equation? 1.3) I assume that the final features absorb (or minor improvements to be selected) the effects of the remaining features that did not show up in the final equation? Or do I need to adjust the default parameters such as the ones for mutations or migration parts? 1.4) Due to the inconsistencies of selected features, I am not sure if a combination of RF+symbolic model could be good, in addition to the fact that pysr does not support plotting. Any inputs?
2.1) With changes in parameters such as populations and denoise, the running speed was incredibly slow (even on HPC) and it does not seem to stop with the timeout function as well. 2.2) FYI, I’ve constrained some parameters in a way like Lemos et al., 2022 and Wadeker et al., 2020 but left the others to default values. I have been testing with different parameters based on "PySRRegressor reference" page but with my limited understanding, it has not been easy. What are your strategies to best optimize parameters with different datasets?
Thank you again for your input and time in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
If they don't show up over multiple different runs, maybe those features aren't actually important? Computing feature importance with RF isn't a guarantee. Note that with the latest versions of PySR after 2022 (after Jay's paper was done) selecting important features isn't really important, because the crossover operation does this implicitly. So I would just give all the features you have and PySR should be okay to handle it.
Not sure I follow. If a feature doesn't improve performance, it won't be included. PySR tries to find simple and accurate expressions; the simple aspect means fewer features. You could write a custom loss function that requires features to be used (see #161 for related question) but not sure why you would want that.
Not sure I follow.
I would just use PySR by itself and run it for long enough.
You can plot PySR outputs with matplotlib, like:
The
My general tips would be to avoid using redundant operators, like how I run from IPython on the head node of a slurm cluster. Passing
Since I am running in IPython, I can just hit "q" to stop the job, tweak the hyperparameters, and then start the search again. Some things I try out to see if they help:
Very rarely I might also try tuning the mutation weights, the crossover probability, or the optimization parameters. I never use For large datasets I usually just randomly sample ~1000 points or so. In case all the points matter, I might use If I find the equations get very complex and I'm not sure if they are numerically precise, I might set Hope this helps! |
Beta Was this translation helpful? Give feedback.
If they don't show up over multiple different runs, maybe those features aren't actually important? Computing feature importance with RF isn't a guarantee. Note that with the latest versions of PySR after 2022 (after Jay's paper was done) selecting important features isn't really important, because the crossover operation does this implicitly. So I would just give all the features you have and PySR should be okay to h…