Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query population usign city boundaries to avoid loading all country data #30

Open
Claudio9701 opened this issue Sep 20, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@Claudio9701
Copy link
Collaborator

Currrent behaviour:

pop_search = up.download.search_hdx_dataset(country)
pop_country = up.download.get_hdx_dataset(pop_search, pop_index) # This takes too much time

Requested feature:

pop_search = up.download.search_hdx_dataset(country)
pop_country = up.download.get_hdx_dataset(pop_search, pop_index, mask=city_limits) # This is expected to be faster

Where city_limits represent either the city boundaries as a polygon or the city total bounds as a bounding box.

@Claudio9701 Claudio9701 added the enhancement New feature or request label Sep 20, 2023
@jeronimoluza
Copy link
Contributor

Hello!
I could take on this one, if no one is doing it.

Question: the mask parameter should always receive a polygon, right?

@Claudio9701
Copy link
Collaborator Author

Hello @jeronimoluza , that's awesome. Thanks for the help. Feel free to make a PR 🚀.

@jeronimoluza
Copy link
Contributor

Hi @Claudio9701!
I'm unable to run the current behavior.
The environment that conda env create -f environment.yml tries to build can no longer be created, because Anaconda has discontinued support for python=3.6*.
I was able to install Python 3.6.15 using pyenv, but I was not able to install the required libraries as dependency conflicts seem to exist:

ERROR: Cannot install -r requirements.txt (line 1) because these package versions have conflicting dependencies.

Do you have any advice that can help me build the required workspace?

@Claudio9701
Copy link
Collaborator Author

Claudio9701 commented Aug 21, 2024

Hi @jeronimoluza Thanks for your help. Could you try the following steps to setup the workspace:

Project Setup Instructions

  1. Create a project folder

  2. Create a virtual environment inside your folder

    conda create --name urbanpyEnv
  3. Activate the environment

    conda activate urbanpyEnv
  4. Install GeoPandas

    (urbanpyEnv) $ conda install geopandas descartes
  5. Install UrbanPy (last dev version)

    (urbanpyEnv) $ pip install urbanpy==0.2.2.dev1
  6. Install Docker (Optional)

    For Windows users, make sure to run the following command in PowerShell to avoid execution errors:

    Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope CurrentUser

@jeronimoluza
Copy link
Contributor

I'm able to create the environment using Python3.12, but unable to do so using Python 3.6 – I get a lot of dependency conflicts.
When trying to do up.download.hdx_fb_population("uruguay", "full") with Python3.12, it doesn't work because of code compatibility issues:

Traceback (most recent call last):
  File "/Users/jeronimoluza/jl_repos/urbanpy/run.py", line 4, in <module>
    pop = up.download.hdx_fb_population("uruguay", "full")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jeronimoluza/jl_repos/urbanpy/urbanpy/download/download.py", line 432, in hdx_fb_population
    population = get_hdx_dataset(resources_df, dataset_ix)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jeronimoluza/jl_repos/urbanpy/urbanpy/download/download.py", line 390, in get_hdx_dataset
    return pd.read_csv(urls)
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/common.py", line 719, in get_handle
    if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/urbanpy/lib/python3.12/site-packages/pandas/io/common.py", line 1181, in _is_binary_mode
    return isinstance(handle, _get_binary_io_classes()) or "b" in getattr(
                                                           ^^^^^^^^^^^^^^^
TypeError: argument of type 'method' is not iterable

@jeronimoluza
Copy link
Contributor

Hi @Claudio9701,

I resumed this after a week, and I realized I was having a bad function call inside one of my testing scripts 🥲.
I tried using spatial intersections to filter the (lat, long) points coming from HDX that are inside the polygon/multipolygon mask, but the HDX Datasets are so large that it takes more than double the original time to return the dataset.

After playing around for a bit, I found that this next method (without generating shapely geometries for the lat, long points) is a possible solution:

    urls = resources_df.loc[ids, "url"]

    print(urls)
    if isinstance(ids, list) and len(ids) > 1:
        df = pd.concat([pd.read_csv(url) for url in urls])
    else:
        df = pd.read_csv(urls)

    if mask:
        if isinstance(mask, GeoDataFrame):
            mask = mask.unary_union
        minx, miny, maxx, maxy = mask.bounds

        df_filtered = df[
            (df["longitude"] >= minx)
            & (df["longitude"] <= maxx)
            & (df["latitude"] >= miny)
            & (df["latitude"] <= maxy)
        ]
        return df_filtered
    else:
        return df

The only problem is that is not as precise as when returning the points – the intersection method will return the points that intersect with the mask, while this "bounds" method will return all the points inside the extent of the mask.

image

What do you think?

@Claudio9701
Copy link
Collaborator Author

That's awesome, thanks! I like that solution we can do the bounding box filter and then apply clip to the result.

Another option could be to read the geotiff instead of the csv from the hdx. Geotiffs can be queried directly when reading using rasterio.

I would say we close this issue with the first solucion (bounds+clip) and think if it's worth it to handle geotiffs later. I'm working on a function to bring FABDEM raster data into urbanpy

@jeronimoluza
Copy link
Contributor

Just implemented and tested the bounds + clip solution! Sending a pull request now.
Exciting news about FABDEM! Let me know if I can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants