Spatial clustering of flow data #27

Hussein-Mahfouz · 2024-01-31T17:17:16Z

Clustering flow data can show where demand is concentrated. This can be overlaid on PT supply to identify gaps. Try:

First pass

spatial clustering using planar flow data
independent clustering for car and PT flows

Second (more ambitious) pass

Bivariate clustering to look at car and pt flow simultaneously
Construct density domain (High-Car High-Bus, Low-Car High-Bus, High-Car-Low-Bus, Low-Car-Low-Bus)

Limitations to mention

No temporal aspect (there are methods for it)
check notion doc

Robinlovelace · 2024-01-31T20:00:40Z

Also: you could try just aggregating origins or destinations to make it simpler.

Hussein-Mahfouz · 2024-02-06T10:20:00Z

Robinlovelace · 2024-02-06T10:22:00Z

This looks great to me Hussein, no other comment at this stage...

Hussein-Mahfouz · 2024-02-06T13:04:56Z

It seems like I cannot create big matrices using pivot_wider in tidyr (see issue 1097) . This is an issue for point [1] above as it prevents me from going down to the OA (or custom) hexagon level. I need to find another package that allows this

Robinlovelace · 2024-02-06T14:23:27Z

od should be able to help. See here Robinlovelace/simodels#33 for some tests, if you can provide a reproducible example to get moving that would help, key question is how many OD pairs are you looking to question?

And what's the max distance you need?

Hussein-Mahfouz · 2024-02-06T18:41:53Z

@Robinlovelace I am looking at all OD pairs with commuting flows. I am getting the "distance" between each pair of desire lines using the metric in eq 1 here. I'm not sure how I would use the function you linked me to. I need to think about it. The clustering algorithm takes in a square matrix, so I still need to create a full matrix even if it is full of NAs (that is where I run into memory issue.

I am now able to use the weights parameter in DBSCAN, see here:

drt-potential/code/demand_cluster_flows.R

Lines 229 to 232 in 57d8470

    
           cluster_dbscan = dbscan::dbscan(dist_mat, 
        
                                           minPts = 250, 
        
                                           eps = 1.2, 
        
                                           weights = w_vec)

The results are still not good and I think this is mainly because I am using MSOA centroids. I will try odjitter to distribute the flows spatially and see how that affects the results

Robinlovelace · 2024-02-06T20:03:20Z

Sounds good, the {od} package may be useful, I was suggesting that the code in that PR goes there. Never tried the DBSCAN algorithm, seems good for this use case.

Hussein-Mahfouz · 2024-02-07T12:13:42Z

Do you recommend using the od_jitter() function in the od package, or should I stick to the odjitter rust package used in atumie?

Robinlovelace · 2024-02-07T13:21:18Z

Do you recommend using the od_jitter() function in the od package, or should I stick to the odjitter rust package used in atumie?

Good question. I recommend the odjitter package, especially for large datasets. Not used or tested od_jitter() in a while and it's slow.

Hussein-Mahfouz · 2024-02-08T17:35:47Z

Results are a bit better after (a) jittering and (b) weighting the flows in DBSAN - flows from different origins OR destinations are being clustered together. I still don't understand why most lines are in the big cluster 0.

TODO:

Understand how to set epsilon
split flows by distance and then cluster: this is based on the preprocessing done in the Bivariate Flow Clustering paper
- option 1: same as paper. Calculate local l function to see if clustering is happening at different scales. I need to figure out how to calculate the local L
- option 2: split flows into (a) n distance groups using cut() or (b) overlapping window groups (to avoid arbitrary thresholds
Look at effect of distances that weight either the origin or destination higher (see alpha and beta here)
Bivariate flow clustering (as in the paper linked above):
- get flows by mode
- check how quadrants are created in the paper

Robinlovelace · 2024-02-08T17:45:23Z

This is great to see and I can think of applications in other projects. Great work Hussein in figuring out spatial clustering of OD data.

Hussein-Mahfouz · 2024-02-09T11:54:03Z

Last commit fixes a mistake with the distance matrix calculation. I was only calculating distances between flows from the same origin zone, so I wasn't using distances for flows that have a different origin zone. Results now show clusters with flows that start in different zones. Still need to work on the points mentioned above

Hussein-Mahfouz added the demand demand analysis label Jan 31, 2024

Hussein-Mahfouz mentioned this issue Jan 31, 2024

spatial clustering of flow data #28

Merged

Hussein-Mahfouz linked a pull request Jan 31, 2024 that will close this issue

spatial clustering of flow data #28

Merged

Hussein-Mahfouz added a commit that referenced this issue Feb 6, 2024

weighted cluster using DBSCAN, ref #27

57d8470

Hussein-Mahfouz added a commit that referenced this issue Feb 9, 2024

fix dist_mat calculation, ref #27

c0bc27e

Hussein-Mahfouz added a commit that referenced this issue Feb 14, 2024

first attempt at clusters with mode, ref #27

ace397f

Hussein-Mahfouz added a commit that referenced this issue Feb 20, 2024

dbscan sensitivity plots, ref #27

06f5e64

Hussein-Mahfouz added a commit that referenced this issue Feb 21, 2024

filter od pairs based on poor supply. Jitter not working now. ref #27

793752c

Hussein-Mahfouz closed this as completed in #28 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spatial clustering of flow data #27

Spatial clustering of flow data #27

Hussein-Mahfouz commented Jan 31, 2024

Robinlovelace commented Jan 31, 2024

Hussein-Mahfouz commented Feb 6, 2024 •

edited

Loading

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 6, 2024 •

edited

Loading

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 6, 2024

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 7, 2024

Robinlovelace commented Feb 7, 2024

Hussein-Mahfouz commented Feb 8, 2024 •

edited

Loading

Robinlovelace commented Feb 8, 2024

Hussein-Mahfouz commented Feb 9, 2024

Spatial clustering of flow data #27

Spatial clustering of flow data #27

Comments

Hussein-Mahfouz commented Jan 31, 2024

Robinlovelace commented Jan 31, 2024

Hussein-Mahfouz commented Feb 6, 2024 • edited Loading

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 6, 2024 • edited Loading

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 6, 2024

Robinlovelace commented Feb 6, 2024

Hussein-Mahfouz commented Feb 7, 2024

Robinlovelace commented Feb 7, 2024

Hussein-Mahfouz commented Feb 8, 2024 • edited Loading

Robinlovelace commented Feb 8, 2024

Hussein-Mahfouz commented Feb 9, 2024

Hussein-Mahfouz commented Feb 6, 2024 •

edited

Loading

Hussein-Mahfouz commented Feb 6, 2024 •

edited

Loading

Hussein-Mahfouz commented Feb 8, 2024 •

edited

Loading