Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split public and internal structure #68

Merged
merged 21 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f9cdc3d
Move internal structure to a separate place than the public API
golmschenk Apr 30, 2024
5c435e3
Add required keyword parameters
golmschenk May 2, 2024
0c570c1
Add qodana
golmschenk May 2, 2024
d8f48fb
Add the default transforms to the public API and update the docs to u…
golmschenk May 2, 2024
ff5bf0f
Add the available transforms to the public API
golmschenk May 2, 2024
b3a97d9
Remove ramjet import in example scripts to the experimental package
golmschenk May 2, 2024
1321926
Add inputs of post_injection_transforms
golmschenk May 3, 2024
4251595
Add a docstring to all public API methods
golmschenk May 6, 2024
6c23e84
Add note about randomization
golmschenk May 6, 2024
8406a76
Add an infer case
golmschenk May 6, 2024
4dc9e1f
Add the infinite_datasets_test_session
golmschenk May 6, 2024
aa298a8
Make the default TESS SPOC light curve processed length be 3500
golmschenk May 8, 2024
95a5b90
Remove examples from the main project repository to move them to a se…
golmschenk May 8, 2024
d8945e8
Add binary AUROC metric
golmschenk May 9, 2024
a1accd5
Remove torchmetrics requirement
golmschenk May 9, 2024
98c8b8c
Switch back to torchmetrics
golmschenk May 10, 2024
60bff1d
Add better name casing for acronyms
golmschenk May 14, 2024
250f2a6
Move transforms to transforms file
golmschenk May 15, 2024
63fcfc5
Remove randomization from make_uniform_length and leave that to a sep…
golmschenk May 15, 2024
176d9b0
Fix tutorial code
golmschenk May 15, 2024
117d200
Switch the tutorial to use the example project repository
golmschenk May 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@
templates_path = ["_templates"]
exclude_patterns = []
source_suffix = [".rst", ".md"]
autodoc_class_signature = 'separated'
autodoc_default_options = {
'special-members': None,
}

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
Expand Down
21 changes: 19 additions & 2 deletions docs/source/reference_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
# Reference

```{eval-rst}
.. automodule:: qusi.light_curve
:members:
.. autoclass:: qusi.data.LightCurve
:members: new
.. autoclass:: qusi.data.LightCurveCollection
:members: new
.. autoclass:: qusi.data.LightCurveDataset
:members: new
.. autoclass:: qusi.data.LightCurveObservationCollection
:members: new
.. autoclass:: qusi.data.FiniteStandardLightCurveDataset
:members: new
.. autoclass:: qusi.data.FiniteStandardLightCurveObservationDataset
:members: new
.. autoclass:: qusi.model.Hadryss
:members: new
.. autofunction:: qusi.session.get_device
.. autofunction:: qusi.session.infer_session
.. autofunction:: qusi.session.train_session
.. autoclass:: qusi.session.TrainHyperparameterConfiguration
:members: new
```
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def get_positive_train_paths():

This functions says to create a `Path` object for a directory at `data/spoc_transit_experiment/train/positives`. Then, it obtains all the files ending with the `.fits` extension. It puts that in a list and returns that list. In particular, `qusi` expects a function that takes no input parameters and outputs a list of `Path`s.

In our example code, we've split the data based on if it's train, validation, or test data and we've split the data based on if it's positive or negative data. And we provide a function for each of the 6 permutations of this, which is almost identical to what's above. You can see the above function and other 5 similar functions near the top of `examples/transit_dataset.py`.
In our example code, we've split the data based on if it's train, validation, or test data and we've split the data based on if it's positive or negative data. And we provide a function for each of the 6 permutations of this, which is almost identical to what's above. You can see the above function and other 5 similar functions near the top of `scripts/transit_dataset.py`.

`qusi` is flexible in how the paths are provided, and this construction of having a separate function for each type of data is certainly not the only way of approaching this. Depending on your task, another option might serve better. In another tutorial, we will explore a few example alternatives. However, to better understand those alternatives, it's first useful to see the rest of this dataset construction.

Expand All @@ -35,7 +35,7 @@ def load_times_and_fluxes_from_path(path):
return light_curve.times, light_curve.fluxes
```

This uses a builtin class in `qusi` that is designed for loading light curves from TESS mission FITS files. However, the important thing is that your function returns two comma separated values, which is a NumPy array of the times and a NumPy array of the fluxes of your light curve. And the function takes a single `Path` object as input. These `Path` objects will be one of the ones we returned from the functions in the previous section. But you can write any code you need to get from a `Path` to the two arrays that represent times and fluxes. For example, if your file is a simple CSV file, it would be easy to use Pandas to load the CSV file and extract the time column and the flux column as two arrays which are then returned at the end of the function. You will see the above function in `examples/transit_dataset.py`.
This uses a builtin class in `qusi` that is designed for loading light curves from TESS mission FITS files. However, the important thing is that your function returns two comma separated values, which is a NumPy array of the times and a NumPy array of the fluxes of your light curve. And the function takes a single `Path` object as input. These `Path` objects will be one of the ones we returned from the functions in the previous section. But you can write any code you need to get from a `Path` to the two arrays that represent times and fluxes. For example, if your file is a simple CSV file, it would be easy to use Pandas to load the CSV file and extract the time column and the flux column as two arrays which are then returned at the end of the function. You will see the above function in `scripts/transit_dataset.py`.

## Creating a function to provide a label for the data

Expand All @@ -49,44 +49,34 @@ def negative_label_function(path):
return 0
```

Note, `qusi` expects the label functions to take in a `Path` object as input, even if we don't end up using it. This is because, it allows for more flexible configurations. For example, in a different situation, the data might not be split into positive and negative directories, but instead, the label data might be contained within the user's data file itself. Also, in other cases, this label can also be something other than 0 and 1. The label is whatever the NN is attempting to predict for the input light curve. But for our binary classification case, 0 and 1 are what we want to use. Once again, you can see these functions in `examples/transit_dataset.py`.
Note, `qusi` expects the label functions to take in a `Path` object as input, even if we don't end up using it. This is because, it allows for more flexible configurations. For example, in a different situation, the data might not be split into positive and negative directories, but instead, the label data might be contained within the user's data file itself. Also, in other cases, this label can also be something other than 0 and 1. The label is whatever the NN is attempting to predict for the input light curve. But for our binary classification case, 0 and 1 are what we want to use. Once again, you can see these functions in `scripts/transit_dataset.py`.

## Creating a light curve collection

Now we're going to join the various functions we've just defined into `LightCurveObservationCollection`s. For the case of positive train light curves, this looks like:

```python
positive_train_light_curve_collection = LightCurveObservationCollection.new(
get_paths_function=get_positive_train_paths,
load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
load_label_from_path_function=positive_label_function)
positive_train_light_curve_collection = LightCurveObservationCollection.new()
```

This defines a collection of labeled light curves where `qusi` knows how to obtain the paths, how to load the times and fluxes of the light curves, and how to load the labels. This `LightCurveObservationCollection.new(...` function takes in the three pieces we just built earlier. Note that you pass in the functions themselves, not the output of the functions. So for the `get_paths_function` parameter, we pass `get_positive_train_paths`, not `get_positive_train_paths()` (notice the difference in parenthesis). `qusi` will call these functions internally. However, the above bit of code is not by itself in `examples/transit_dataset.py` as the rest of the code in this tutorial was. This is because `qusi` doesn't use this collection by itself. It uses it as part of a dataset. We will explain why there's this extra layer in a moment.
This defines a collection of labeled light curves where `qusi` knows how to obtain the paths, how to load the times and fluxes of the light curves, and how to load the labels. This `LightCurveObservationCollection.new(...` function takes in the three pieces we just built earlier. Note that you pass in the functions themselves, not the output of the functions. So for the `get_paths_function` parameter, we pass `get_positive_train_paths`, not `get_positive_train_paths()` (notice the difference in parenthesis). `qusi` will call these functions internally. However, the above bit of code is not by itself in `scripts/transit_dataset.py` as the rest of the code in this tutorial was. This is because `qusi` doesn't use this collection by itself. It uses it as part of a dataset. We will explain why there's this extra layer in a moment.

## Creating a dataset

Finally, we build the dataset `qusi` uses to train the network. First, we'll take a look and then unpack it:

```python
def get_transit_train_dataset():
positive_train_light_curve_collection = LightCurveObservationCollection.new(
get_paths_function=get_positive_train_paths,
load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
load_label_from_path_function=positive_label_function)
negative_train_light_curve_collection = LightCurveObservationCollection.new(
get_paths_function=get_negative_train_paths,
load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
load_label_from_path_function=negative_label_function)
train_light_curve_dataset = LightCurveDataset.new(
standard_light_curve_collections=[positive_train_light_curve_collection,
negative_train_light_curve_collection])
positive_train_light_curve_collection = LightCurveObservationCollection.new()
negative_train_light_curve_collection = LightCurveObservationCollection.new()
train_light_curve_dataset = LightCurveDataset.new(light_curve_collections=[positive_train_light_curve_collection,
negative_train_light_curve_collection])
return train_light_curve_dataset
```

This is the function which generates the training dataset we called in the {doc}`/tutorials/basic_transit_identification_with_prebuilt_components` tutorial. The parts of this function are as follows. First, we create the `positive_train_light_curve_collection`. This is exactly what we just saw in the previous section. Next, we create a `negative_train_light_curve_collection`. This is almost identical to its positive counterpart, except now we pass the `get_negative_train_paths` and `negative_label_function` instead of the positive versions. Then there is the `train_light_curve_dataset = LightCurveDataset.new(` line. This creates a `qusi` dataset built from these two collections. The reason the collections are separate is that `LightCurveDataset` has several mechanisms working under-the-hood. Notably for this case, `LightCurveDataset` will balance the two light curve collections. We know of a lot more light curves that don't have planet transits in them than we do light curves that do have planet transits. In the real world case, it's thousands of times more at least. But for a NN, it's usually useful to during the training process to show equal amounts of the positives and negatives. `LightCurveDataset` will do this for us. You may have also noticed that we passed these collections in as the `standard_light_curve_collections` parameter. `LightCurveDataset` also allows for passing different types of collections. Notably, collections can be passed such that light curves from one collection will be injected into another. This is useful for injecting synthetic signals into real telescope data. However, we'll save the injection options for another tutorial.

You can see the above `get_transit_train_dataset` dataset creation function in the `examples/transit_dataset.py` file. The only part of that file we haven't yet looked at in detail is the `get_transit_validation_dataset` and `get_transit_finite_test_dataset` functions. However, these are nearly identical to the above `get_transit_train_dataset` expect using the validation and test path obtaining functions above instead of the train ones.
You can see the above `get_transit_train_dataset` dataset creation function in the `scripts/transit_dataset.py` file. The only part of that file we haven't yet looked at in detail is the `get_transit_validation_dataset` and `get_transit_finite_test_dataset` functions. However, these are nearly identical to the above `get_transit_train_dataset` expect using the validation and test path obtaining functions above instead of the train ones.

## Adjusting this for your own binary classification task

Expand Down
Loading
Loading