Can PySR handle recursive functions? #540

jabogithub · 2024-02-02T04:31:21Z

jabogithub
Feb 2, 2024

I am very impressed by the capabilities of PySR. I am just starting with it and it appears very useful to me!
I am planning to use it for deriving formulas from experimental data in the Physics domain amongst others.

But I have a question.
Can PySR handle recursive functions?
Suppose I have dataset that was generated by using a recursive function (such as the data representing the famous Mandelbrot set). Is PySR able to output as a solution a recursive function?

If this is not possible directly, how could I formulate the problem so that PySR could solve it, perhaps even partially or approximately?

The main goal is to show that the Mandelbrot set, although it looks infinitely complex to a human observer because it shows infinitely new and complex structure if you zoom into it infinitely wide, in reality it’s structure has a very low (algoritmic) complexity because of the fact that it’s generating (recursive) function is so simple: Zn+1 = Zn^2 + C.
I hope to be able to prove that by using PySR finding a relatively small formula that can (approximately ?) generate data that can represent the Mandelbrot set. But the real generating formula is recursive in nature, so I am stuck for the moment because I am not sure on 2 things:

how exactly to present this problem to as input to PySR (how to structure the dataset that represents the output of the generating recursive function so that can be effectively used by PySR to solve the problem, and what other inputs and parameters to set)
Can PySR output a recursive function as a solution? Perhaps in a indirect form or format?

I hope someone can help me overcome this hurdle!

Answered by MilesCranmer

Feb 3, 2024

Hi @jabogithub,

Really interesting question! I spent some time on this out of interest and managed to get it to work. This lets you rediscover the Mandelbrot set recursive relation from random samples of the Mandelbrot set.

First, here's a Julia version:

using SymbolicRegression
using MLJBase: machine, fit!, predict
using Random: Random

# Mandelbrot set
function in_mandelbrot(c; max_iteration = 1000)
    z = zero(c)
    iteration = 0
    while abs(z) <= 2.0 && iteration < max_iteration
        z = z^2 + c
        iteration += 1
    end
    return (abs(z) <= 2.0 && iteration == max_iteration)
end

"""
    my_loss(tree, dataset, options)

Custom loss function for sets defined by recursi…

View full answer

MilesCranmer · 2024-02-03T17:23:24Z

MilesCranmer
Feb 3, 2024
Maintainer

Hi @jabogithub,

Really interesting question! I spent some time on this out of interest and managed to get it to work. This lets you rediscover the Mandelbrot set recursive relation from random samples of the Mandelbrot set.

First, here's a Julia version:

using SymbolicRegression
using MLJBase: machine, fit!, predict
using Random: Random

# Mandelbrot set
function in_mandelbrot(c; max_iteration = 1000)
    z = zero(c)
    iteration = 0
    while abs(z) <= 2.0 && iteration < max_iteration
        z = z^2 + c
        iteration += 1
    end
    return (abs(z) <= 2.0 && iteration == max_iteration)
end

"""
    my_loss(tree, dataset, options)

Custom loss function for sets defined by recursive functions.

- `tree`: The expression tree defining the recursive function. The first feature
    will be used to store the output of the function.
- `dataset`: The dataset to evaluate the function on, of shape (n_features, n_samples).
- `options`: The options for the symbolic regression model.
"""
function my_loss(tree, dataset::Dataset{T,L}, options::Options) where {T,L}
    num_rows = size(dataset.X, 2)
    max_iteration = 1000
    boundary = 2.0

    X = copy(dataset.X)  # Copy dataset for modification
    for i in axes(X, 2)
        X[1, i] = zero(T)  # Initialize z to 0.0 + 0.0im
    end
    in_bounds = [true for _ = 1:num_rows]

    for iteration in 1:max_iteration
        # Only work on subset of data that is in-bounds
        sliced_X = @view X[:, in_bounds]
        sliced_in_bounds = @view in_bounds[in_bounds]
        # (The @view macro is used to avoid copying the data)

        sliced_out, completed = eval_tree_array(tree, sliced_X, options)
        if !completed
            # Penalty if incomplete evaluation due to NaNs,
            #  but reduce penalty if it got further along
            return L(1000000 * (1 + max_iteration - iteration))
        end

        # Update z
        for i in axes(sliced_X, 2)
            sliced_in_bounds[i] &= abs(sliced_out[i]) <= boundary
            if sliced_in_bounds[i]
                # Only update if still in bounds
                sliced_X[1, i] = sliced_out[i]
            end
        end

        if !any(in_bounds)
            break
        end
    end
    return L(sum(i -> abs(dataset.y[i] - in_bounds[i]), 1:num_rows) / num_rows)
end

N = 300

z = [zero(ComplexF32) for _ = 1:N]
c = let RNG = Random.MersenneTwister(0)
    [rand(RNG, ComplexF32) * 2 - 1 for _ = 1:N]
end

X = (; z, c)
y = ComplexF32.(in_mandelbrot.(c))

println("Proportion of dataset in Mandelbrot set: ", Float64(sum(y) / N))

model = SRRegressor(
    binary_operators=(+, -, *),
    unary_operators=(cos,),
    niterations=100,
    loss_function=my_loss,
)

mach = machine(model, X, y)
fit!(mach)

We can see that this recovers the correct Mandelbrot set relation!

Note that this code is written to be pedagogical so does not use all the Julia tricks out there. Also, it is currently quite expensive to run for some reason. Perhaps that is unavoidable or maybe there is a better way to structure things. Disabling some things like constant optimization, if not needed, might speed things up.

Finally, note that the order of features input matters – it is assumed in the code that the first feature is z which gets initialized as 0 and is used to store the output at each step.

The key part here is the definition of my_loss which defines a loss that performs recursive evaluation up to 1000 iterations (to estimate if a value is bounded). See, e.g., https://astroautomata.com/PySR/examples/#9-custom-objectives for some discussion about custom losses. There are also several other discussion threads which might be enlightening for the kinds of things you can do here (note that differentiability is not required).

The PySR equivalent would be as follows:

import numpy as np
import pandas as pd
from pysr import PySRRegressor

def in_mandelbrot(c, max_iteration=1000):
    z = 0.0 + 0.0j
    iteration = 0
    while abs(z) <= 2.0 and iteration < max_iteration:
        z = z**2 + c
        iteration += 1
    return abs(z) <= 2.0 and iteration == max_iteration

full_objective = """
    function my_loss(tree, dataset::Dataset{T,L}, options::Options) where {T,L}
        num_rows = size(dataset.X, 2)
        max_iteration = 1000
        boundary = 2.0

        X = copy(dataset.X)  # Copy dataset for modification
        for i in axes(X, 2)
            X[1, i] = zero(T)  # Initialize z to 0.0 + 0.0im
        end
        in_bounds = [true for _ = 1:num_rows]

        for iteration = 1:max_iteration
            # Only work on subset of data that is in-bounds
            sliced_X = @view X[:, in_bounds]
            sliced_in_bounds = @view in_bounds[in_bounds]
            # (The @view macro is used to avoid copying the data)

            sliced_out, completed = eval_tree_array(tree, sliced_X, options)
            if !completed
                # Penalty if incomplete evaluation due to NaNs,
                #  but reduce penalty if it got further along
                return L(1000000 * (1 + max_iteration - iteration))
            end

            # Update z
            for i in axes(sliced_X, 2)
                sliced_in_bounds[i] &= abs(sliced_out[i]) <= boundary
                if sliced_in_bounds[i]
                    # Only update if still in bounds
                    sliced_X[1, i] = sliced_out[i]
                end
            end

            if !any(in_bounds)
                break
            end
        end
        return L(sum(i -> abs(dataset.y[i] - in_bounds[i]), 1:num_rows) / num_rows)
    end
"""

N = 300

rstate = np.random.RandomState(0)

z = np.zeros(N, dtype=np.complex128)
c = rstate.uniform(-1, 1, N) + 1j * rstate.uniform(-1, 1, N)

X = pd.DataFrame({"z": z, "c": c})
y = (np.vectorize(in_mandelbrot)(c)).astype(np.complex128)


model = PySRRegressor(
    binary_operators=["+", "-", "*"],
    unary_operators=["cos"],
    niterations=100,
    full_objective=full_objective,
)

model.fit(X, y)

Here's part way through the search:

a little bit farther (with the correct relation now showing up):

and here it is once it finds it:

Cheers,
Miles

0 replies

jbdatascience · 2024-02-10T21:53:09Z

jbdatascience
Feb 10, 2024

Thank you for your answer ! Amazing work, which goes to show the power of Symbolic Regression in general, but more specific the power of PySR .

Also this proves that PySR is able to uncover the simplicity behind seemingly overly complex data! (Although we know beforehand the ground truth in this use case: the Mandelbrot data generating recursive function, but it had to be seen if PySR has the power to uncover that, which it has according to your effort!). That was my main goal: getting to know if PySR is up to this challenging problem.

It strengthens my trust that PySR in principle can solve difficult problems, just from raw (experimental) data alone.

I still have to go into your code, because it is not so straightforward as it seems, to me as totally inexperienced PySR practitioner at least. In particular the construction of the loss function looks not so trivial at first sight!

I will study your solution and of course try it out next week! Very exciting I must say. Perhaps I come back to you here in this space …

3 replies

jbdatascience Feb 11, 2024

I tried your code in a Google Colab notebook and it works beautifully!

I had only one problem: PySR found the solution very soon (I believe already at the 16th iteration !), but there were 1500 iterations in total, which is not necessary as it turned out. But I could not stop the notebook cell running by the usual way in Colab! I use the free version of Colab, so I cannot make use of the terminal window to kill the particular process that runs the symbolic regression. The only thing I could do was restart the runtime and thereby losing all my variables!

So I am looking for a practical solution for that problem.

Although I am very satisfied with your solution I am wondering if we could pose the Mandelbrot problem in another way: if we would generate a dataset with the following columns:

Zn, Zn+1, C

would PySR then be able to find the general function that represents the relation between Zn+1 and Zn, like this:

Zn+1 = f(Zn) which should equal our ground truth Zn+1 = Zn^2 + C so that f(Zn) = Zn^2 + C

If PySR can do that, then it has found the recurrent function that I was after!

MilesCranmer Feb 11, 2024
Maintainer

If you run in IPython or regular Python, then you can stop a search early with ctrl-c or q. But Jupyter has no way of communicating to Julia once the search has started.

Otherwise for Jupyter I recommend using a smaller niterations and warm_start=True; then you can repeatedly call .fit until satisfied.

jbdatascience Feb 11, 2024

That is a good advice I think.
There is much more to PySR than I had imagined!

I am very curious as to what useful information PySR can extract from real data from experiments, especially in Physics. That is something that I will certainly will going to deep dive in! Can you recommend me sources for this? And also where to get the data! But I think I will start by generating the data from a known equation (= Data Generating Process (DGP) ), so we know the ground truth that later on can be compared with the outcome PySR delivers!

I read somewhere that PySR can struggle to find the correct formula in some cases (for example I believe in the case of finding the Rydberg constant from data. I wonder what would be the fundamental reason for the struggle that PySR shows in such case!).

Perhaps I can synthetically generate data based on the equation for the Rydberg constant and perform Symbolic Regression on that data and try to find out the reason for the struggle in that case.
(and perhaps also in other cases).

Also I read a publication about discovering the Lagrangian (a very fundamental concept in theoretical Physics) directly from data by performing Symbolic Regression ! These are some links to the publication and the Python code:

https://greydanus.github.io/2020/03/10/lagrangian-nns/

https://arxiv.org/abs/2003.04630

https://colab.research.google.com/drive/1CSy-xfrnTX28p1difoTA8ulYw0zytJkq

https://github.com/MilesCranmer/lagrangian_nns

I wonder how this could be done using PySR? Can you shed any light on this? Please let me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can PySR handle recursive functions? #540

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can PySR handle recursive functions? #540

jabogithub Feb 2, 2024

Replies: 2 comments · 3 replies

MilesCranmer Feb 3, 2024 Maintainer

jbdatascience Feb 10, 2024

jbdatascience Feb 11, 2024

MilesCranmer Feb 11, 2024 Maintainer

jbdatascience Feb 11, 2024

jabogithub
Feb 2, 2024

Replies: 2 comments 3 replies

MilesCranmer
Feb 3, 2024
Maintainer

jbdatascience
Feb 10, 2024

MilesCranmer Feb 11, 2024
Maintainer