This project shows how to use kd-trees with Apache Spark and Scala for the analysis of location based / geometric data.
The example data consists of records of "check ins" of users with a mobile phone. For each checkin the user id, the date and time and the location coordinates (latitude and longitude) are recorded.
[user] [check-in time] [latitude] [longitude] [location id]
196514 2010-07-24T13:45:06Z 53.3648119 -2.2723465833 145064
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017 1275991
The example data set is called "loc-Gowalla" and available at the Stanford Large Network Dataset Collection.
With this app we can calculate for every user at some specified point in time the number of users that are checked in in his/her "neighborhood" (specified in kilometers).
Tools used:
- Apache Hadoop, Apache Spark, Scala, SBT, ScalaTest, Kryo.
Tools needed:
- Scala, SBT
- If you edit the SBT you can use the standalone version of Spark, so a Spark cluster is not needed
Checkout the source
git clone https://github.com/jdinkla/location-based-nearest-neighbours.git
Change into the directory
cd location-based-nearest-neighbours
Create an assembly
$ sbt assembly
Check the file cluster.properties
and adapt it to your situation. If you want to use a local filesystem see
local.properties
. You'll also have to change the spark hostname in scripts/submit.sh
(or .cmd
if you are a Windows user).
Download the example data, prepare the data and find the neighbors for the 6th of october 2009 within a 5 km range for
every user. Make sure that you use the correct hadoop user name, otherwise set the environment variable
HADOOP_USER_NAME
.
$ scripts/submit.sh download
$ scripts/submit.sh sort-by-user
$ scripts/submit.sh neighbors 20091006 5
The output will be a file num_neighbors_20091006_5.0.csv
that contains data like the following:
CustomerId;number of neighbors
4904;34
5812;3
6520;0
31744;0
6248;33
The current state is first proof of concept. The app works, but is not performance optimized. Partitioning of Apache Spark is ignored at the moment.
- Partitioning in Spark
- Testing with larger data
- Writing a random data generator or "multiplying" the current data
- Use other geometric data structures
- Range trees
kd tree, kd-tree, k-d-tree, range queries, range query, latitude, longitude
Written by Jörn Dinkla. See my homepage for further information