Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] for "Condition" arrays #418

Open
niksirbi opened this issue Feb 14, 2025 · 8 comments
Open

[Proposal] for "Condition" arrays #418

niksirbi opened this issue Feb 14, 2025 · 8 comments
Labels
question Further information is requested

Comments

@niksirbi
Copy link
Member

We propose introducing a new type of data array in movement for storing boolean outcomes of various conditions. This idea evolved from discussions between me, @sfmig and @stellaprins about region-of-interest (RoI) occupancy but can be generalized to other use cases such as ethograms or social interaction detection.


1. Region-of-Interest (RoI) Occupancy

Consider a typical poses dataset ds with:

  • N time points
  • 2 spatial dimensions (x, y)
  • 6 tracked keypoints
  • 3 individuals (mice)

The position array ds.position has shape (N, 2, 6, 3).

We have two example RoIs:

nest = PolygonOfInterest(nest_exterior_coords)
feeder = PolygonOfInterest(feeder_exterior_coords)

Using PR #413, we can check if a given point is within each polygon at each time point:

is_in_nest = nest.contains_point(ds.position)    # Shape: (time, keypoints, individuals)
is_in_feeder = feeder.contains_point(ds.position) # Shape: (time, keypoints, individuals)

Each result is a boolean array with shape (N, 6, 3), effectively collapsing the spatial dimension. If we want multiple RoIs in a single structure, we could introduce a new dimension called conditions, resulting in an array of shape (time, keypoints, individuals, conditions)—for instance, (N, 6, 3, 2) if we have two polygons (nest and feeder).

A helper function might look like this:

def compute_region_occupancy(
    position: xr.DataArray,
    regions: list[PolygonOfInterest]
) -> xr.DataArray:
    region_occupancy_arrays = [
        region.contains_point(position) 
        for region in regions
    ]
    condition_array = xr.concat(
        region_occupancy_arrays, dim="conditions"
    )
    # Optionally add region names as condition labels
    return condition_array

From such an occupancy array, we can:

  • Collapse the keypoints dimension based on rules (e.g., all keypoints inside an RoI).
  • Calculate time spent in each RoI.
  • Count entries/exits.
  • Compute transition metrics (e.g., nest to feeder).
  • Extract or summarize trajectories of interest (e.g. extract all trajectories going from nest to feeder that are shorter than 20 seconds and highlight them in red in a trajectory plot).

2. General Utility of Boolean "Condition" Arrays

This concept extends beyond RoI occupancy. Any scenario where we want to track a set of boolean conditions over time (and possibly across individuals or keypoints) can benefit from such arrays.

2.1 Ethograms

An ethogram shows when an individual exhibits specific behaviors. A boolean array (time, individuals, conditions) could indicate whether each individual is performing behaviors like grooming, feeding, or sleeping. This structure simplifies identifying onsets, durations, and transitions between behaviors.

2.2 Social Interactions

See issue #225. For instance, detecting snout-to-snout or snout-to-tail interactions between two mice can be framed as checking whether specific distance-based conditions are fulfilled at each time point. Storing these results in a (time, conditions) boolean array makes it straightforward to analyze interaction onsets and offsets.

2.3 Edge Case: Collapsing time

In some cases, the time dimension might be collapsed. For example, to determine whether a region was always in an animal's field of view throughout an entire session, we might reduce (time, individuals, conditions) to (individuals, conditions) by requiring all time points to be True.

Another example of a time-less array would be one that answers the question "which indivudals stayed in box A for >= 10 seconds".

We can also label such time-less arrays as "Condition" arrays, but this is mostly a semantic choice.

2.4 Benefits of a unified "Conditions" framework

Framing all the above problems as "Condition" array problems allows us to write general methods for post-processing them. For example: counting the entries to a region is conceptually (and computationally) identical to counting onsets of a specific behaviour in an ethogram; time spent in a given RoI is the same as time spent engaging in a specific type of social interaction; etc.

If we enforce a consistent dimension name, i.e. conditions (open to alternative names), our methods can do something meaningful when they encounter boolean arrays with that dimension.


3. Could We Have a General Function?

A specialized function like compute_region_occupancy works for RoIs, but we could consider a more general approach:

def compute_condition_array(
    data: xr.DataArray,
    condition: Callable[..., xr.DataArray]
) -> xr.DataArray:
    ...

The data here could be position, velocity on any other variable that makes sense.
The callable condition function would implement custom logic, potentially collapsing (broadcasting over) different dimensions. This could even make good use of our recently acquired "broadcasting" decorators.

While this general approach might be powerful, it could also be overkill. We might be better off starting with specific functions (e.g., region occupancy) and later explore a more generalized approach.


Feedback Requested
We’d appreciate input from @willGraham01 and others on the design and feasibility of introducing these boolean condition arrays into movement.

Thank you for reading and for any insights or suggestions!

@niksirbi niksirbi added the question Further information is requested label Feb 14, 2025
@niksirbi
Copy link
Member Author

Could also be an interesting read for you @roaldarbol, though some aspects of this discussion are very xarray-centric.

@willGraham01
Copy link
Contributor

willGraham01 commented Feb 17, 2025

I think 3 is definitely do-able, though the function call signature might end up a bit more complicated.

But I do echo the possible overkill concerns. To me, if a user is willing to do through the trouble of writing their own condition_function, then they are likely going to be comfortable doing

condition_array = condition_function(data)

That said, I agree that we should still have some wrappers for the most common conditions we expect.

Furthermore, having the ability to automatically combine them into one new "condition array" would be nice and does seem like a natural goal we want to reach eventually. Plus, running multiple conditions over a single (or even multiple) DataArrays would be cool:

compute_condition_array(
    data: Sequence[xr.DataArray],
    conditions: Sequence[Callable[[...], xr.DataArray],
) -> xr.DataArray
    # returns a DataArray with 2 extra dimensions; 'data' and 'condition',
    # whose lengths are equal to the lengths of the data and condition arguments respectively.
    # We can provide coordinates for the two new axes by using the individual names of the data / conditions,
    # and potentially even store (pointers to them) in the returned DataArray.attrs.

The final function signature of this method would likely be a bit more complex though (since we might need to pass keyword arguments to the conditions, etc).

Maybe a starting point is a conditions submodule that (for now) contains the proposed compute_region_occupancy method? And we try to get this to support (potentially multiple) data and multiple regions to start with?

@niksirbi
Copy link
Member Author

But I do echo the possible overkill concerns. To me, if a user is willing to do through the trouble of writing their own condition_function, then they are likely going to be comfortable doing

condition_array = condition_function(data)

You are absolutely right about that! Writing the condition function is the hard bit, calling that on data to get a boolean array is trivial.

Furthermore, having the ability to automatically combine them into one new "condition array" would be nice and does seem like a natural goal we want to reach eventually. Plus, running multiple conditions over a single (or even multiple) DataArrays would be cool:

It is definitely cool, but I wouldn't start with that.

Maybe a starting point is a conditions submodule that (for now) contains the proposed compute_region_occupancy method? And we try to get this to support (potentially multiple) data and multiple regions to start with?

Yes, we should absolutely start with the compute_region_occupancy method, and have it support at least multiple regions (supporting multiple data arrays is less important imo).
What I'm currently unsure about is whether a function like that should go in a conditions.py submodule. From a user's perspective, I'd expect to find compute_region_occupancy() grouped together with other roi stuff, and also together with the functions that operated on the resulting boolean array (e.g. functions for computing entries/exits in regions). So, perhaps it should go under movement.roi?

@willGraham01
Copy link
Contributor

What I'm currently unsure about is whether a function like that should go in a conditions.py submodule. From a user's perspective, I'd expect to find compute_region_occupancy() grouped together with other roi stuff, and also together with the functions that operated on the resulting boolean array (e.g. functions for computing entries/exits in regions). So, perhaps it should go under movement.roi?

Under movement.roi.conditions maybe? Annoyingly compute_region_occupancy can't be a classmethod (well it could, but I argue it is a bit of an eyesore), because we want to pass in multiple regions, and IMO something like

polygon.compute_region_occupancy(data, other_regions)

doesn't make that much sense (why is this a property of one of the regions we care about?) compared to

compute_region_occupancy(data, [polygon1, polygon2, ...])

I guess we could do

class PolygonOfInterest:

    @staticmethod
    def compute_region_occupancy(data, regions) -> None:
        ...

but again, IMO a standalone function is cleaner.

@niksirbi
Copy link
Member Author

Yes, I meant as a standalone function, not inside the class. Perhaps put it in movement/roi/conditions.py, but make sure it can be imported as from movement.roi import compute_region_occupancy? Does that make sense?

@willGraham01
Copy link
Contributor

I've made a start on this over on #421. Just needs some tests added, but I'm teaching the next couple of days. @stellaprins might be able to pick this up as she's back with us Thursday

@sfmig
Copy link
Contributor

sfmig commented Feb 21, 2025

@lochhh found this xarray-events package that could be useful here

This library makes it possible to extend a Dataset by introducing events based on the data. Internally it works as an accessor to xarray that provides new methods to deal with new data in the form of events and also extends the existing ones already provided by it to add compatibility with this new kind of data.

@niksirbi
Copy link
Member Author

xarray-events looks very neat and is exactly what we'd need to annotate time, but I'm noticing that the repo has been last updated 5 years ago...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: 🤔 Triage
Development

No branches or pull requests

3 participants