Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize _remove_polygon_holes function for improved performance with complex geometries #1200

Closed
3 tasks done
wrignj08 opened this issue Aug 7, 2024 · 9 comments
Closed
3 tasks done

Comments

@wrignj08
Copy link
Contributor

wrignj08 commented Aug 7, 2024

Contributing guidelines

  • I understand the contributing guidelines

Documentation

  • My proposal is not addressed by the documentation or examples

Existing issues

  • Nothing similar appears in an existing issue

What problem does your feature proposal solve?

The proposed feature addresses a performance issue in the 'features._remove_polygon_holes' function. When dealing with complex outer polygons and numerous inner polygons, the current implementation can be extremely slow, sometimes taking hours to complete. By replacing the inner loop with a single difference operation using a multi-polygon, we can significantly improve the function's efficiency.

What is your proposed solution?

The proposed change is to optimize the '_remove_polygon_holes' function in OSMnx by replacing the current nested loop approach with a more efficient single-pass operation. The key modifications are:

  1. Replace the inner loop that iterates over each inner polygon with a single operation.
  2. Create a MultiPolygon from all inner polygons.
  3. Perform a single difference operation between the outer polygon and the MultiPolygon of inner polygons.

Implementation plan:

  1. Modify the existing '_remove_polygon_holes' function in the features module.
  2. Replace the inner for-loop with the following operation:
    outer = outer.difference(MultiPolygon(inner_polygons).buffer(0))
  3. Remove the individual containment checks, as the difference operation will handle this implicitly.

This change aims to significantly reduce processing time for complex polygons with many holes, while maintaining the function's current behavior and output.

What alternatives have you considered?

An alternative solution considered was to implement a spatial index (such as R-tree) for the inner polygons. This approach aimed to quickly identify which inner polygons potentially intersect with each outer polygon, theoretically reducing the number of containment checks required. However, upon testing, this method did not yield a noticeable performance improvement in practice. Additionally, implementing this solution would have introduced an extra dependency to the project.

Additional context

This example demonstrates how the current version of _remove_polygon_holes can be slow when processing complex polygons with many holes. By replacing it with the optimized version, which uses a single difference operation with a MultiPolygon, the performance is significantly improved.

Try running this it should take a very long time to complete, as it intersects a complex polygon.

features = ox.features_from_bbox(
 bbox=(23.0, 22.98, 32.98, 33.0),
 tags={"natural": "water"},
)

Now if we patch in the proposed solution it is much faster

def optimized_remove_polygon_holes(
    outer_polygons: list[Polygon],
    inner_polygons: list[Polygon],
) -> Polygon | MultiPolygon:
    """
    Subtract inner holes from outer polygons.

    This allows possible island polygons within a larger polygon's holes.

    Parameters
    ----------
    outer_polygons
        Polygons, including possible islands within a larger polygon's holes.
    inner_polygons
        Inner holes to subtract from the outer polygons that contain them.

    Returns
    -------
    geometry
    """
    if len(inner_polygons) == 0:
        # if there are no holes to remove, geom is the union of outer polygons
        geometry = unary_union(outer_polygons)
    else:
        # otherwise, remove from each outer poly each inner poly it contains
        polygons_with_holes = []
        for outer in outer_polygons:
            #  remove inner polygons from outer polygons in single pass
            outer = outer.difference(MultiPolygon(inner_polygons).buffer(0))
            polygons_with_holes.append(outer)
        geometry = unary_union(polygons_with_holes)

    # ensure returned geometry is a Polygon or MultiPolygon
    if isinstance(geometry, (Polygon, MultiPolygon)):
        return geometry
    return Polygon()

# Monkey patch the OSMnx function
ox.features._remove_polygon_holes = optimized_remove_polygon_holes


features = ox.features_from_bbox(
    bbox=(23.0, 22.98, 32.98, 33.0),
    tags={"natural": "water"},
)
@gboeing
Copy link
Owner

gboeing commented Aug 7, 2024

Thanks for this suggestion. Could you run some more comprehensive tests to 1) provide before/after timings, and 2) ensure identical geometric result across many feature geometries in several different places? For example you could just adapt the code from #1157 (comment)

@wrignj08
Copy link
Contributor Author

wrignj08 commented Aug 8, 2024

Thank you for considering this change. I've prepared a Colab notebook comparing the existing implementation with the suggested approach. The results show that:

  1. The suggested approach produces identical data to the existing implementation.
  2. On powerful systems, there's often no significant change in computation time for simpler polygons. However, on slower systems (like Colab), the performance difference becomes apparent even with basic geometry. This is evident when comparing the execution times of code cells 8 and 10 in the notebook.
  3. When complex polygons are encountered, the suggested approach is much faster.

This is illustrated in the lower half of the notebook, where we test locations containing complex lake polygons.
The notebook provides before/after timings and verifies identical geometric results across various feature geometries in several different places, as requested.

Colab example

@gboeing
Copy link
Owner

gboeing commented Aug 12, 2024

It doesn't look like your code can properly handle the island within a hole use case.

For example:

outer1 = Polygon(((0,0), (4,0), (4,4), (0,4)))
inner1 = Polygon(((1,1), (2,1), (2,3), (1,3)))
inner2 = Polygon(((2,1), (3,1), (3,3), (2,3)))
outer2 = Polygon(((1.5,1.5), (2.5,1.5), (2.5,2.5), (1.5,2.5)))
outer_polygons = [outer1, outer2]
inner_polygons = [inner1, inner2]
result1 = ox.features._remove_polygon_holes(outer_polygons, inner_polygons)
result2 = optimized_remove_polygon_holes(outer_polygons, inner_polygons)
assert result1.equals(result2)  # raises AssertionError

That's why that inner loop is there: to check which of the inner polygons are contained by which of the outer polygons. There may be an optimization opportunity here, but the current proposed implementation misses this use case.

@gboeing
Copy link
Owner

gboeing commented Aug 12, 2024

It's possible that preparing the outer polygon (to improve the performance of the contains function) and then performing the difference function just once at the end could speed things up. Something like:

def optimized_remove_polygon_holes(
    outer_polygons: list[Polygon],
    inner_polygons: list[Polygon],
) -> Polygon | MultiPolygon:
    if len(inner_polygons) == 0:
        # if there are no holes to remove, geom is the union of outer polygons
        geometry = unary_union(outer_polygons)
    else:
        # otherwise, remove from each outer poly each inner poly it contains
        polygons_with_holes = []
        for outer in outer_polygons:
            prepare(outer)
            holes = [inner for inner in inner_polygons if outer.contains(inner)]
            polygons_with_holes.append(outer.difference(unary_union(holes)))
        geometry = unary_union(polygons_with_holes)

    # ensure returned geometry is a Polygon or MultiPolygon
    if isinstance(geometry, (Polygon, MultiPolygon)):
        return geometry
    return Polygon()

Something along those lines may be worth testing for parity of timings and results.

EDIT: from my quick initial timing test, it looks like this is indeed much faster for your "lakes" example.

@gboeing
Copy link
Owner

gboeing commented Aug 12, 2024

One other minor performance gain (that will also avoid PerformanceWarnings) would be to change these lines to something like:

gdf = (
        gpd.GeoDataFrame(
            data=_process_features(elements, set(tags.keys())),
            geometry="geometry",
            crs=settings.default_crs,
        )
        .set_index(idx)
        .sort_index()
    )

@wrignj08
Copy link
Contributor Author

It doesn't look like your code can properly handle the island within a hole use case.

For example:

outer1 = Polygon(((0,0), (4,0), (4,4), (0,4)))
inner1 = Polygon(((1,1), (2,1), (2,3), (1,3)))
inner2 = Polygon(((2,1), (3,1), (3,3), (2,3)))
outer2 = Polygon(((1.5,1.5), (2.5,1.5), (2.5,2.5), (1.5,2.5)))
outer_polygons = [outer1, outer2]
inner_polygons = [inner1, inner2]
result1 = ox.features._remove_polygon_holes(outer_polygons, inner_polygons)
result2 = optimized_remove_polygon_holes(outer_polygons, inner_polygons)
assert result1.equals(result2)  # raises AssertionError

That's why that inner loop is there: to check which of the inner polygons are contained by which of the outer polygons. There may be an optimization opportunity here, but the current proposed implementation misses this use case.

Nice pickup, I believe you are correct, I had not realized this was a valid situation, sorry about that.

It's possible that preparing the outer polygon (to improve the performance of the contains function) and then performing the difference function just once at the end could speed things up. Something like:

def optimized_remove_polygon_holes(
    outer_polygons: list[Polygon],
    inner_polygons: list[Polygon],
) -> Polygon | MultiPolygon:
    if len(inner_polygons) == 0:
        # if there are no holes to remove, geom is the union of outer polygons
        geometry = unary_union(outer_polygons)
    else:
        # otherwise, remove from each outer poly each inner poly it contains
        polygons_with_holes = []
        for outer in outer_polygons:
            prepare(outer)
            holes = [inner for inner in inner_polygons if outer.contains(inner)]
            polygons_with_holes.append(outer.difference(unary_union(holes)))
        geometry = unary_union(polygons_with_holes)

    # ensure returned geometry is a Polygon or MultiPolygon
    if isinstance(geometry, (Polygon, MultiPolygon)):
        return geometry
    return Polygon()

Something along those lines may be worth testing for parity of timings and results.

EDIT: from my quick initial timing test, it looks like this is indeed much faster for your "lakes" example.

I like this solution, I have tested this locally and it appears to work well for my test locations (showing similar performance to my initial suggestion), while also producing the same output as the existing approach on all examples including your islands within a hole example.

@gboeing gboeing reopened this Aug 13, 2024
@gboeing
Copy link
Owner

gboeing commented Aug 13, 2024

Would you like to take these code snippets and open a PR?

@wrignj08
Copy link
Contributor Author

Sure, I would be happy to, I've implemented the changes as discussed and opened a new pull request (#1205) titled "Optimize _remove_polygon_holes and _create_gdf for performance".

The PR includes the following changes:

  1. Optimized the _remove_polygon_holes function using your suggested approach with shapely.prepared geometries.
  2. Updated the _create_gdf function to improve performance and avoid PerformanceWarnings, as you recommended.
  3. Added the necessary import for prepare from shapely.

I've tested these changes locally, and they appear to work well for various test cases, including the complex "lakes" example and the island-within-a-hole scenario you provided. The optimizations maintain the same output as the existing approach while significantly improving performance for complex geometries.

Regarding the mypy failure on osmnx/plot.py:856, I've noted in the PR description that this is a pre-existing issue unrelated to the changes in this PR. I used --no-verify to bypass the GitHub precommit hooks for this reason.

This is my first PR, so I apologize if I've done anything incorrectly.

@gboeing
Copy link
Owner

gboeing commented Aug 14, 2024

Closed by #1205

@gboeing gboeing closed this as completed Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants