-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-2-overview.Rmd
138 lines (79 loc) · 12.3 KB
/
01-2-overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Hierarchical spatial data {#hierarchical}
There is a strong relationship between the hierarchical forms of data structure used in geo-spatial analysis and in the grammars of data analysis and graphics. Currently the translation between geo-spatial forms and the grammars is disjointed and sometimes awkward. Here we present a way of classifying and structuring geo-spatial forms that is consistent and complementary to those used in the grammars, and allows the technical and specialist requirements of geo-spatial to be modularized appropriately, while allowing for more general and extensible model development.
GIS-based vector data provides a complex set of data structures, and the [modern R form](https://cran.r-project.org/package=sf) is compliant with the [simple features standard](https://en.wikipedia.org/wiki/Simple_Features). In R these are stored in *nested lists* where the physical structure matches the logical structures in the data.
R can deal with a much wider variety of complex data types than the simple features standard, with key examples in the `ggplot2` family, `igraph` networks, `rgl` indexed mesh models, `rhdf5` and `ncdf4` arrays, and many "tracking" models such as `adehabitatLT`, `trajectories` and `move`. There is a need for a common language for translating and storing data between these specialized forms, and while no single file format or class is sufficiently flexible there are well-established database techniques for dealing with arbitrarily complex models. From this point of view, traditional geo-spatial forms are a special case of a more general capability - very important for compliance with external systems, but as a domain-specific optimization suited to a particular scope.
A `Spatial` object (`sp`) is a complex nested list of coordinates stored in matrices with S4 classes, with each top-level item linked by ID to rows in a data frame. A newer form to replace `Spatial` that handles more kinds of data sets is simple features in `sf`. Here the coordinates are in nested lists of matrices or vectors. The `sf` package stores each object with its data frame row rather than by a remote link ID, and the package is compliant with the simple features standard which removes some ambiguities that existed in `sp`. The simple features approach is more aligned with the tidyverse principles, but does this by way of a formal API to switch from lists to data frames in pre-specified ways.
Nested lists of data can be inverted and stored in an *inside-out* way as normal data frames. Various packages provide conversions between nested and table forms, but there is no overall approach that works in the general case and no categorization of the common conversions that are most useful. Worse, there are fragmented approaches spread across dozens of implementations without one central framework or vision. There is room for extension and improvement to the handling of data structures with conversion tools, and ideally a central form-converter framework. Here we focus on GIS vector data to show the limitations of the nested structures, and how being able to flip between representations provides much added power and provides a clear pathway forward for many complex problems that don't fit in the traditional GIS model.
We can define an *always data frame* analog to these data structures in two quite different ways.
* nested data frames stored in one single data frame
* multiple data frames without nesting
The first approach is good because there is a single lowest level data frame object and there's no need to deal with more than one object.
The latter approach encodes the nested relations by way of database ID techniques, and because each structure is a single table this is easily transferred from and to databases. This approach also provides extra flexibility for spatial data structures, that don't fit in the simple features framework and that are generally precluded by nested structures.
## Terminology
The types of entities that we deal with are:
* **Objects** (some GIS call these "features", for `ggplot2` this is "metadata" which controls aesthetics like "fill", "color")
* **Branches** - these are the parts (pieces), connected topologies like a linear path
* **Coordinates** - all geometry in a single table, identity in multiple column with object, branch, order required by the set in the table
* **Vertices** - purely geometry, with vertex-ID
* **Primitive-table** - tables of primitives, with primitive-ID
* **Link-table** - link tables provide many-to-one relations for branches and primitives
Use "Branch" as a general term for "part" or "piece" of an individual Object (rather than "ring", "linestring", "coordinate" in the specific cases). Branch really means an isolated connected topology, i.e. "path" as in a line-string path, or a polygon ring path - but can refer to a single point in the case of "multi points".
Use "Primitive" for the general case, 0D-Primitive is single-vertex coordinate, 1D-Primitive is two-vertex line segment, a 2D Primitive is a triangle.
There might be a separate table for each kind of primitive, or in wide form they can be kept together in a single table. There are long- and wide- forms for the primitives link tables, but in wide form they must have only one type (a column per vertex ID). Branch-link tables must always be long because they have varying numbers of coordinates per branch.
## Nested data frames
Using nesting for data frames involves the use of list columns, and provides a "structural hierarchy" where a many-to-one relationship between tables can be stored in one object. This has long been possible to create, since data frames are themselves recursive list vectors but had no wide support until the `tidyr` package provided consistent printing and subsetting.
A nested data frame sitting in a parent row doesn't require an explicit value for the parent, and un-nesting will automatically replicate the parent rows an appropriate number of times for each child geometry.
### There are two kinds of nesting for Spatial / sf
1. Nest once, so each row has its own Coordinates table. In this form the geometry is exactly like the output of `ggplot2::fortify`, you could row bind each of these tables together with their parent's ID and use `geom_polygon` directly.
2. Nest twice so each row has a Branch table, which in turn has its Vertex table. In this form, the branches become proper entities, and values like "hole status", or "ring parent" can be stored here. The Vertex table must keep track of the line/poly path order however, or guarantee that it will always match the structural order.[^1]
Tidy nesting is a robust way to store structures analogous to the `Spatial` and `sf` classes, so that row-apply operations may be done on the geometry within the single table list column. It's not clear if these are useful structures in their own right, but they are helpful for explaining the complexity in recursive structures and as intermediates for converting between different forms.
## Relational view
"Relations" are tables, referring to the task of keeping related properties together in the same row. In this view we use a scheme that will be familiar to users of databases, where a given table stores entity data. The entities are objects, branches, vertices and one-to-many links. Nesting approaches avoid the need for links, but also preclude the ability to remove duplications. Removing duplications is a key requirement for topological operations and for providing extensibility to spatial data.
### Entities
Row-bind all entity tables together, in single-nesting this is the fortify approach used in `ggplot2` for `sp`, in double-nesting this gives us three tables.
We name these inside-out, multiple-table approaches:
* Fortify - two tables, Object and Branch-Coordinate.
* Branch - three tables, Object, Branch, Coordinate
### Primitives
Once in Branch form, we can go further by converting into Primitives form, which provides data structures and techniques not supported by simple features or any general GIS standard.
The steps to convert to Primitives are:
* de-duplicate vertices in geometry-space[^2]
* convert from from path to line segment model (a Planar Straight Line Graph)
In this form, we have a primitives model that is able to store all the objects of simple features in a decomposed form. This excludes triangulations, but in simple features those are stored like explicit triangle branches, not a dense mesh of vertices. In the next stage we are able to have 2D primitives with efficient use of a shared vertex pool.
## Beyond standard spatial forms
The line-segment model, or Planar Straight Line Graph (PSLG) can be used to generate two more forms:
* identify 3-way vertex-segment relations to generate arc-node model ('TopoJSON')
* generate 2D primitives from line segments via modified-Delaunay (`RTriangle`), or ear-clipping (`rgl`)
(The doubly-connected edge list is another data structure enabled by the PSLG approach. )
Now we have the following main types of representation
* Bespoke hierarchical (nested lists of things)
* Tidy hierarchical (nested data frames, single or double)
* Fortify (two tables, geometry and object metadata)
* Branch (three tables, coordinates, branches, objects)
* Primitives (usually four tables, object, primitives, links, vertices)
Each of these forms has direct applications for a variety of tasks, either for transferring between forms or for applications that are more efficient in a given form.
### Advantages
* in the branch model parts are identifiable and track-able - i.e. size of rings, length of line strings - in simple features we need to explode an object, badge every part with an ID and track those
* higher dimensional topological forms are provided naturally, the 2D primitives approach fits naturally as an extension to 1D primitives model
* entity tables provide unlimited room for extra information in the right places, i.e. length, area, duration, name can be stored on branches or primitives as needed and used for aesthetics in visualizations. For a triangulated surface with a Z geometry, this belongs on the unique vertex table and not on the link-instances-of-coordinates. For GPS data, we can densify the X-Y planar coordinates (i.e intersecting tracks at a depot) while keeping individual track time, measurement information on the link-instance table.
* The branch and primitives models may be combined, so that the branch table records the way the original simple features are constructed by a branch-link-primitives table - so we can have a perfect record of the original data, recreatable if needed - or completely reworked by operations on primitives that when recombined provide a modified object.
### Key examples
* multi level features as custom types
* trip data as grouped points with measurements, also indexed as lines
* Level-3 bins for ocean colour
* BGM doubly-connected edge-list model
* continuous variation across polygon surfaces
* vector data fusion
* wireframe curvilinear rasters
Multi-level custom types
* state and county sharing vertices, note rules on internal rings (discard them, leave one level for simple features)
* pre-computed topologically sound resolutions (i.e. district precision for a state boundary vs. country precision)
* a voyage track, intervals within it, fine-resolution underway measures, specific stations
* a fused surface, a combination of contour lines with a constant Z fused with polygons without elevation
## Existing implementations
The multiple-table approach described here is used in the following packages.
* **rbgm** - [Atlantis Box Geometry Model](https://github.com/AustralianAntarcticDivision/rbgm), a "doubly-connected edge-list" form of linked faces and boxes in a spatially-explicit 3D ecosystem model
* **rangl** - [Primitives for Spatial data](https://github.com/r-gris/rangl), a generalization of GIS forms with simple 3D plotting
* **spbabel** - [Translators for R Spatial](https://github.com/mdsumner/spbabel), tools to convert from and to spatial forms, provides the general decomposition framework for branches, used by `rangl`
[^1]: *Nesting three times* for simple features multipolygons is a possibility but doesn't neessarily add clarity to the model? Does a hole need to be stored recursively inside its parent ring? Should a second line path be stored as a child of the first?
[^2]: De-duplication may be in any dimensional geometric space, but for simple features this would rarely make sense to not be X,Y.