How many datapoints are needed? #684

cmougan · 2024-07-29T13:03:30Z

cmougan
Jul 29, 2024

I am running an experiment that is quite expensive. Therefore each datapoint takes time and cost budget.
What is the minimum number of datapoints that are needed to have some reliability?

Is there any documentation/references/previous discussion?

MilesCranmer · 2024-07-29T13:56:09Z

MilesCranmer
Jul 29, 2024
Maintainer

It depends on how complex you want the equation to be, and also noise, and how many operators you are searching with. But symbolic regression is pretty data-efficient as equations are not that expressive, so you can get away with very very few datapoints.

What I would basically do is a train/validation/test split. Train on the train data, and evaluate the model on the validation dataset. Are the predictions good/bad? If the predictions are bad, you need more data. If the predictions are close in performance to the results on the training data, then you probably have enough data.

https://scikit-learn.org/stable/modules/cross_validation.html

0 replies

cmougan · 2024-07-29T17:07:47Z

cmougan
Jul 29, 2024
Author

We are looking at something like this. 10-12 datapoints, around 2/3 features.

Your package provides a loss that can be calculated in train/test, but there are no pvalues right?

1 reply

MilesCranmer Jul 29, 2024
Maintainer

Can you give more details? I don’t quite follow what you are working on or asking about, sorry.

cmougan · 2024-08-03T09:23:27Z

cmougan
Aug 3, 2024
Author

After some work, I have realised that the question is off.
The approach here is not from classical stats (where we are looking for a pvalue) but an ML one, where we have a loss function that we can compute in test set.

Still, for a function like the above. X =[x1,n] and Y. There are many functions that in this range can have the same shape.

How do you provide scientifical statistical validity to the equatian?

1 reply

MilesCranmer Aug 3, 2024
Maintainer

how do you provide scientifical statistical validity to the equatian

There’s a heuristic that’s typically used (and is the default in PySR). You can also do cross-validation splits.

However, there’s no right answer. There’s no Bayesian information criterion here that makes sense. I could talk for hours about the philosophy of this… the practical side is to just pick some kind of metric and go with it. If you want interpretability, design a metric for that (hard to define though). If you want accuracy, cross validation is probably good.

cmougan · 2024-08-04T09:02:25Z

cmougan
Aug 4, 2024
Author

In binary classification I have seen some people using Classifier Two Sample Tests:

Model-independent detection of new physics signals using interpretable SemiSupervised classifier tests https://projecteuclid.org/journals/annals-of-applied-statistics/volume-17/issue-4/Model-independent-detection-of-new-physics-signals-using-interpretable-SemiSupervised/10.1214/22-AOAS1722.short
Revisiting Classifier Two-Sample Tests https://arxiv.org/abs/1610.06545

In binary classification is easier, as it naturally allows for a hypothesis testing.
Perhaps in SR something similar can be achieved.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many datapoints are needed? #684

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How many datapoints are needed? #684

cmougan Jul 29, 2024

Replies: 4 comments · 2 replies

MilesCranmer Jul 29, 2024 Maintainer

cmougan Jul 29, 2024 Author

MilesCranmer Jul 29, 2024 Maintainer

cmougan Aug 3, 2024 Author

MilesCranmer Aug 3, 2024 Maintainer

cmougan Aug 4, 2024 Author

cmougan
Jul 29, 2024

Replies: 4 comments 2 replies

MilesCranmer
Jul 29, 2024
Maintainer

cmougan
Jul 29, 2024
Author

MilesCranmer Jul 29, 2024
Maintainer

cmougan
Aug 3, 2024
Author

MilesCranmer Aug 3, 2024
Maintainer

cmougan
Aug 4, 2024
Author