Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial clustering of flow data #27

Closed
Hussein-Mahfouz opened this issue Jan 31, 2024 · 12 comments · Fixed by #28
Closed

Spatial clustering of flow data #27

Hussein-Mahfouz opened this issue Jan 31, 2024 · 12 comments · Fixed by #28
Labels
demand demand analysis

Comments

@Hussein-Mahfouz
Copy link
Owner

Clustering flow data can show where demand is concentrated. This can be overlaid on PT supply to identify gaps. Try:

First pass

  • spatial clustering using planar flow data
  • independent clustering for car and PT flows

Second (more ambitious) pass

  • Bivariate clustering to look at car and pt flow simultaneously
  • Construct density domain (High-Car High-Bus, Low-Car High-Bus, High-Car-Low-Bus, Low-Car-Low-Bus)

Limitations to mention

  • No temporal aspect (there are methods for it)
  • check notion doc
@Hussein-Mahfouz Hussein-Mahfouz added the demand demand analysis label Jan 31, 2024
@Hussein-Mahfouz Hussein-Mahfouz linked a pull request Jan 31, 2024 that will close this issue
@Robinlovelace
Copy link
Collaborator

Also: you could try just aggregating origins or destinations to make it simpler.

@Hussein-Mahfouz
Copy link
Owner Author

Hussein-Mahfouz commented Feb 6, 2024

I have mad a first pass at using HDBSCAN for clustering lines by using a custom distance matrix. The distance between lines is based on equation 3 here. Initial results with minPts = 50:

cluster

This is just proof that the distance function works, but there are many issues:

  • I'm using the MSOA od matrix and each line is treated equally. It is not weighted based on number of people commuting
  • The od matrix is between MSOA centroids. We have clusters grouped by their endpoints in the outer MSOAs because these MSOAs are bigger than the ones in the center - larger distance between their centroids and neighbouring centroids makes it more likely that they are in their own cluster

To Do:

  • [1] handle MAUP due to variable zone sizes. Options:

    • use OA level data (MVP)
    • use uniform hexagon grid
    • use OD jitter to distribute points in each zone based on population distribution (same as atumie)
  • [2] Accounting for weights (no. of people between OD pair)

    • HDBSCAN
      • find r implementation that uses weights. I don't think there is any
      • duplicate each flow n times based on number of commuters. This will create a huge distance matrix. To get distance between all desire line pairs for Leeds: 107 MSOAs = $107^2$ flows = 11, 449 m * 11,449 n distance matrix BUT if we treat each individual commuter separately then the matrix becomes 142,201 m * 142,201 n. The HDBSCAN function is already slow when m > 3,000 so I don't think this will work
    • DBSCAN
      • How do I use the weights parameter in DBSAN? (this is possibly the best option)
    • Other algorithms that account for weights?
      • OPTICS?: DBSCAN is inappropriate for clusters with various densities, since epsilon is fixed. OPTICS fixes this apparently
  • [3] Bivariate flow clustering to determine clusters with [LOW PT, HIGH CAR], [LOW PT, LOW CAR] etc

    • extend DBSCAN

@Robinlovelace
Copy link
Collaborator

This looks great to me Hussein, no other comment at this stage...

@Hussein-Mahfouz
Copy link
Owner Author

Hussein-Mahfouz commented Feb 6, 2024

It seems like I cannot create big matrices using pivot_wider in tidyr (see issue 1097) . This is an issue for point [1] above as it prevents me from going down to the OA (or custom) hexagon level. I need to find another package that allows this

@Robinlovelace
Copy link
Collaborator

od should be able to help. See here Robinlovelace/simodels#33 for some tests, if you can provide a reproducible example to get moving that would help, key question is how many OD pairs are you looking to question?

And what's the max distance you need?

@Hussein-Mahfouz
Copy link
Owner Author

@Robinlovelace I am looking at all OD pairs with commuting flows. I am getting the "distance" between each pair of desire lines using the metric in eq 1 here. I'm not sure how I would use the function you linked me to. I need to think about it. The clustering algorithm takes in a square matrix, so I still need to create a full matrix even if it is full of NAs (that is where I run into memory issue.

I am now able to use the weights parameter in DBSCAN, see here:

cluster_dbscan = dbscan::dbscan(dist_mat,
minPts = 250,
eps = 1.2,
weights = w_vec)

The results are still not good and I think this is mainly because I am using MSOA centroids. I will try odjitter to distribute the flows spatially and see how that affects the results

@Robinlovelace
Copy link
Collaborator

Sounds good, the {od} package may be useful, I was suggesting that the code in that PR goes there. Never tried the DBSCAN algorithm, seems good for this use case.

@Hussein-Mahfouz
Copy link
Owner Author

Do you recommend using the od_jitter() function in the od package, or should I stick to the odjitter rust package used in atumie?

@Robinlovelace
Copy link
Collaborator

Do you recommend using the od_jitter() function in the od package, or should I stick to the odjitter rust package used in atumie?

Good question. I recommend the odjitter package, especially for large datasets. Not used or tested od_jitter() in a while and it's slow.

@Hussein-Mahfouz
Copy link
Owner Author

Hussein-Mahfouz commented Feb 8, 2024

Results are a bit better after (a) jittering and (b) weighting the flows in DBSAN - flows from different origins OR destinations are being clustered together. I still don't understand why most lines are in the big cluster 0.

image

TODO:

  1. Understand how to set epsilon
  2. split flows by distance and then cluster: this is based on the preprocessing done in the Bivariate Flow Clustering paper
    • option 1: same as paper. Calculate local l function to see if clustering is happening at different scales. I need to figure out how to calculate the local L
    • option 2: split flows into (a) n distance groups using cut() or (b) overlapping window groups (to avoid arbitrary thresholds
  3. Look at effect of distances that weight either the origin or destination higher (see alpha and beta here)
  4. Bivariate flow clustering (as in the paper linked above):
    • get flows by mode
    • check how quadrants are created in the paper

@Robinlovelace
Copy link
Collaborator

This is great to see and I can think of applications in other projects. Great work Hussein in figuring out spatial clustering of OD data.

@Hussein-Mahfouz
Copy link
Owner Author

Last commit fixes a mistake with the distance matrix calculation. I was only calculating distances between flows from the same origin zone, so I wasn't using distances for flows that have a different origin zone. Results now show clusters with flows that start in different zones. Still need to work on the points mentioned above

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demand demand analysis
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants