Cross-validation error as PySR objective function? #659

sbalasbas3 · 2024-07-04T08:42:04Z

sbalasbas3
Jul 4, 2024

Hello. As the title suggests, I am trying to write a custom objective function based on the cross-validation error. However, I am not really well-versed in Julia. This is my amateur attempt at writing a cross-validation error-based objective function:

from pysr import PySRRegressor
import numpy as np

# Writing an objective function based on 60/40 cross-validation error
objective = """
using SymbolicRegression

# Function to perform 60/40 cross-validation
function cross_validation_objective(tree, dataset::Dataset{T, L}, options)::L where {T, L}
    n = dataset.n
    train_size = Int(round(0.6 * n))
    train_idx = 1:train_size
    valid_idx = (train_size + 1):n

    train_idx = filter(x -> x <= n, train_idx)
    valid_idx = filter(x -> x <= n, valid_idx)

    train_data = Dataset(dataset.X[train_idx, :], dataset.y[train_idx])
    valid_data = Dataset(dataset.X[valid_idx, :], dataset.y[valid_idx])

    prediction_valid, flag_valid = eval_tree_array(tree, valid_data.X, options)

    if !flag_valid
        error = sum((prediction_valid .- valid_data.y) .^ 2) / length(valid_idx)
        return error
    else
        return L(Inf)
    end
end
"""

model = PySRRegressor(
    niterations=100,
    populations=20,
    loss_function=objective,
    binary_operators=["+", "-", "*", "/"],
    verbosity=1,
)

# Example dataset
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.rand(100)     # 100 target values

# Use PySR for model fitting
model.fit(X, y)

I am getting a Julia error if I use this code. It appears to be related to accessing invalid index ranges. Is there already a code for some sort of cross-validation error already available here?

Answered by MilesCranmer

Jul 4, 2024

Maybe just do it from the Python side? It should be faster too as then you aren't doing the split every single evaluation, but only once:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

model.fit(X_train, y_train)

train_loss = np.mean(np.square(y_train - model.predict(X_train, index=-1)))
test_loss = np.mean(np.square(y_test - model.predict(X_test, index=-1)))

(And, by the way, Julia indexes with column-major order, so you would write the row first, feature second, like X[:, train_idx])

View full answer

MilesCranmer · 2024-07-04T08:51:30Z

MilesCranmer
Jul 4, 2024
Maintainer

Maybe just do it from the Python side? It should be faster too as then you aren't doing the split every single evaluation, but only once:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

model.fit(X_train, y_train)

train_loss = np.mean(np.square(y_train - model.predict(X_train, index=-1)))
test_loss = np.mean(np.square(y_test - model.predict(X_test, index=-1)))

(And, by the way, Julia indexes with column-major order, so you would write the row first, feature second, like X[:, train_idx])

1 reply

sbalasbas3 Jul 16, 2024
Author

Thank you very much for this response. Now it is working as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-validation error as PySR objective function? #659

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cross-validation error as PySR objective function? #659

sbalasbas3 Jul 4, 2024

Replies: 1 comment · 1 reply

MilesCranmer Jul 4, 2024 Maintainer

sbalasbas3 Jul 16, 2024 Author

sbalasbas3
Jul 4, 2024

Replies: 1 comment 1 reply

MilesCranmer
Jul 4, 2024
Maintainer

sbalasbas3 Jul 16, 2024
Author