Running SR on a distributed cluster #644

chrhck · 2024-06-07T09:21:45Z

chrhck
Jun 7, 2024

We have been SymbolicRegression.jl for our research but have hit a point where the equation search takes too long on a single compute node (with ~16 cores or so). We're now looking into using our distributed computing resources (we have both an MPI cluster as well as a slurm HPC cluster available) and were wondering if you have used SR in such an environment before or might know someone who has? we're hoping to not having to re-invent the wheel for writing the entire orchestration code (copy from slack DM)

Answered by MilesCranmer

Jun 7, 2024

Hey Christian,

Yes, I use SR like this in my own work. Basically if you just set parallelism=:multiprocessing then you can either:

Pass the process objects explicitly to the procs parameter (whether those procs are on the same node, or multiple nodes, etc.)
Or, set, for example, numprocs=num_nodes * num_cores, addprocs_function=addprocs_slurm , and SR.jl will try its best to set it up for you

I usually do the 2nd out of convenience

Just be sure to launch SR only once, on a single node, from a single task on the slurm job. ClusterManagers.jl will run srun internally for you.

View full answer

MilesCranmer · 2024-06-07T10:27:35Z

MilesCranmer
Jun 7, 2024
Maintainer

Hey Christian,

Yes, I use SR like this in my own work. Basically if you just set parallelism=:multiprocessing then you can either:

Pass the process objects explicitly to the procs parameter (whether those procs are on the same node, or multiple nodes, etc.)
Or, set, for example, numprocs=num_nodes * num_cores, addprocs_function=addprocs_slurm , and SR.jl will try its best to set it up for you

I usually do the 2nd out of convenience

Just be sure to launch SR only once, on a single node, from a single task on the slurm job. ClusterManagers.jl will run srun internally for you.

4 replies

chrhck Jun 17, 2024
Author

Thanks Miles. A followup question: We've been struggling to scale up SR to our cluster. We see very long idle times on the worker nodes at startup (up to a few hours)
and after loading a previously saved state (up to 17h)

We're running this on a slurm cluster with 64 nodes (20 CPU each).
Our dataset is very large (100k - 5000k datapoints). We've been experimenting with the batch-size, number of populations, and ncycles_per_iteration but so far have not found a combination that reduces the idle times. Any suggestions on what we could to to improve our resource utilization?

MilesCranmer Jun 17, 2024
Maintainer

Our dataset is very large (100k - 5000k datapoints)

This is much larger than is typically needed. Have you tried downsampling it? Symbolic regression is much less likely to overfit compared to other machine learning algorithms as it is not as expressive, so you can work with much fewer datapoints.

In fact the highest I will go in my own applications is only ~10k points. If my true dataset is 100k points I will just randomly subsample it to 10k. (And you can try to do importance sampling if your random subsampling does not capture all types of behavior observed in your dataset)

chrhck Jun 17, 2024
Author

Our problem is 5D and the data are pretty noisy - we can revisit subsampling the data again, but previously we found that we get better results with the full dataset. If we have to remain with the full dataset - is there anything we can do to improve performance?

MilesCranmer Jun 17, 2024
Maintainer

If the data is noisy, one option that I've used in the past is to first fit a flexible ML model to the data first, with an MSE loss, which acts to average out the noise. Usually I will use a Gaussian Process with some flexible kernel, like the built-in one here:

PySR/pysr/denoising.py

Lines 9 to 28 in 89e991d

    
           def denoise( 
        
               X: ndarray, 
        
               y: ndarray, 
        
               Xresampled: Optional[ndarray] = None, 
        
               random_state: Optional[np.random.RandomState] = None, 
        
           ) -> Tuple[ndarray, ndarray]: 
        
               """Denoise the dataset using a Gaussian process.""" 
        
               from sklearn.gaussian_process import GaussianProcessRegressor 
        
               from sklearn.gaussian_process.kernels import RBF, ConstantKernel, WhiteKernel 
        
               gp_kernel = RBF(np.ones(X.shape[1])) + WhiteKernel(1e-1) + ConstantKernel() 
        
               gpr = GaussianProcessRegressor( 
        
                   kernel=gp_kernel, n_restarts_optimizer=50, random_state=random_state 
        
               ) 
        
               gpr.fit(X, y) 
        
               if Xresampled is not None: 
        
                   return Xresampled, cast(ndarray, gpr.predict(Xresampled)) 
        
               return X, cast(ndarray, gpr.predict(X))

However, GPs scale poorly with number of points... Even 100k will be horrendously slow. So you might want to use a multi-layer perceptron or XGBoost¹.

Fit those to the data first, then generate and evaluate on some grid of inputs. The result should have less noise (though there might be biases introduced here, so do this with care).

Then, fit PySR on those "noiseless" samples. With only 5 features you probably can get a decent result with <1k points.

And with that result you can evaluate on the larger dataset to evaluate it.

Speaking of which, maybe PySR's denoise option should default to XGBoost when there is too much data... ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running SR on a distributed cluster #644

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running SR on a distributed cluster #644

chrhck Jun 7, 2024

Replies: 1 comment · 4 replies

MilesCranmer Jun 7, 2024 Maintainer

chrhck Jun 17, 2024 Author

MilesCranmer Jun 17, 2024 Maintainer

chrhck Jun 17, 2024 Author

MilesCranmer Jun 17, 2024 Maintainer

Footnotes

chrhck
Jun 7, 2024

Replies: 1 comment 4 replies

MilesCranmer
Jun 7, 2024
Maintainer

chrhck Jun 17, 2024
Author

MilesCranmer Jun 17, 2024
Maintainer

chrhck Jun 17, 2024
Author

MilesCranmer Jun 17, 2024
Maintainer