Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Service Abstraction to manage RAWR tiles #245

Closed
zerebubuth opened this issue Oct 4, 2017 · 5 comments
Closed

Data Service Abstraction to manage RAWR tiles #245

zerebubuth opened this issue Oct 4, 2017 · 5 comments

Comments

@zerebubuth
Copy link
Member

@rmarianski said in a comment:

About the data fetcher itself, it looks like we're going to be creating a new one for each tile pyramid. Are we going to introduce a higher level concept, like a data service abstraction, that manages this given any tile? Who will fetch the data from s3 and hand that to the data fetcher?

Yup, that sounds like a good idea. I was thinking that it would parallel the database pool, but I think it would need to be different. Database connections can't be shared between threads/processes, so the pool would need to hand out one each, whereas the RAWR data fetcher (once indexed) is read-only and thread-safe.

Would this be the kind of interface you were thinking of?

class  RAWRFetcherFactory(object):
    def __init__(self, config):
        self.min_z, self.max_z, self.storage = parse_rawr_config(config)
        self.layers, self.label_placement_layers = parse_layer_config(config)

    def __call__(self, tile):
        tables = self.download_from_s3(tile)
        assert tile.z == self.min_z
        tile_pyramid = TilePyramid(tile.z, tile.x, tile.y, self.max_z)
        return make_rawr_data_fetcher(self.layers, tables, tile_pyramid, 
                                      self.source, self.label_placement_layers)

And there'd be a similar one for DatabaseFetcherFactory (which ignores the tile parameter), and a third which switched between the two depending on the tile.z parameter?

@rmarianski
Copy link
Member

Would this be the kind of interface you were thinking of?

Yup, pretty much exactly! It does assume that there's knowledge of the tile pyramid upstream, but maybe that's ok? It also implies knowledge that the tile would represent the pyramid at some zooms but not the pyramid for others though.
Alternatively, we can modify the signature to take a list of tiles, and just assert that they all share the same parent in the rawr implementation for z10+.

Database connections can't be shared between threads/processes, so the pool would need to hand out one each, whereas the RAWR data fetcher (once indexed) is read-only and thread-safe.

Were you thinking of creating the data for the tile pyramid (the data fetcher part) concurrently?

@zerebubuth
Copy link
Member Author

Were you thinking of creating the data for the tile pyramid (the data fetcher part) concurrently?

I was just thinking that the same object could suffice for all the worker threads, and remove the need to download and process it multiple times.

Reading through the source code, I'm wondering if perhaps it makes sense to hide the tile_pyramid knowledge internally within the RAWR DataFetcher. Currently, the workers look like they share an instance of the DataFetcher via the DataFetch object, and the job handling is "upstream" of that. Rather than have a major refactor, how would you feel about making the RAWR DataFetcher contain a cached pool of tile pyramids (and indexes)?

That would mean loading the RAWR tiles on-the-fly and handling the locking manually, which adds complication. But might be worth it, rather than moving big chunks of code around.

What do you think?

@rmarianski
Copy link
Member

Hhhmm, or maybe we introduce a new step in the chain that gets lifted up into a top level abstraction? Even with the existing postgresql query fetch, we have a step that converts the result of the query into data that is suitable for processing, ie from per table results into per layer groupings. We could mirror that with the rawr tiles pipeline, where we have a step that converts the results of the query, ie the payload from s3, into individual tiles for processing. It seems like this conversion would need to grow an awareness that it gets passed a list of tile coordinates, but that seems like it's probably ok to me. On the plus side, it already happens in "processing space" in terms of the system. It also feels like we have the different parts of the system having the right concerns this way, ie the data fetch just knows that it needs to get data for a particular region, and then we have downstream processing that extracts out smaller portions of that for data processing. This might prove useful if we ever want to add buffering back into the mix (maybe, hard to tell without having working code for it already).

More specifically, the data fetcher step for rawr tiles could just yield the s3 specific data. And then this new step, which we could call something like DataTransformer, DataConverter, or DataProcessingPreparer, is what would happen first in the processing chain. It would take the result of the data fetch, and then be responsible for creating the data shape that processing understands. For rawr tiles, this step is what would create the index first, and then extract out the individual tiles for processing.

Anyway, I think it's worth doing the code refactoring, because the existing data conversion step was more like temporary glue to prepare for rawr tiles, and rather than add a new layer of that to hold everything together, with some caching mru scheme that might have its own set of issues, altering the pipeline to handle it more directly seems like it would be a better option longer term.

@nvkelso
Copy link
Member

nvkelso commented Mar 13, 2018

@zerebubuth Is this complete as part of the RAWR tiles sprint from 2017Q4?

@zerebubuth
Copy link
Member Author

Yes, close enough. In query/__init__.py there's an abstraction to create data fetchers from the database or RAWR tiles on S3. It's a bit hacky at the moment, but it'll do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants