lbnn - Location based nearest neighbors

This project shows how to use kd-trees with Apache Spark and Scala for the analysis of location based / geometric data.

The example data consists of records of "check ins" of users with a mobile phone. For each checkin the user id, the date and time and the location coordinates (latitude and longitude) are recorded.

[user]  [check-in time]	        [latitude]      [longitude]     [location id]
196514  2010-07-24T13:45:06Z    53.3648119      -2.2723465833   145064
196514  2010-07-24T13:44:58Z    53.360511233    -2.276369017    1275991

The example data set is called "loc-Gowalla" and available at the Stanford Large Network Dataset Collection.

With this app we can calculate for every user at some specified point in time the number of users that are checked in in his/her "neighborhood" (specified in kilometers).

Tools used:

Apache Hadoop, Apache Spark, Scala, SBT, ScalaTest, Kryo.

Tools needed:

Scala, SBT
If you edit the SBT you can use the standalone version of Spark, so a Spark cluster is not needed

How to use

Checkout the source

git clone https://github.com/jdinkla/location-based-nearest-neighbours.git

Change into the directory

cd location-based-nearest-neighbours

Create an assembly

$ sbt assembly

Check the file cluster.properties and adapt it to your situation. If you want to use a local filesystem see local.properties. You'll also have to change the spark hostname in scripts/submit.sh (or .cmd if you are a Windows user).

Download the example data, prepare the data and find the neighbors for the 6th of october 2009 within a 5 km range for every user. Make sure that you use the correct hadoop user name, otherwise set the environment variable HADOOP_USER_NAME.

$ scripts/submit.sh download
$ scripts/submit.sh sort-by-user
$ scripts/submit.sh neighbors 20091006 5

The output will be a file num_neighbors_20091006_5.0.csv that contains data like the following:

CustomerId;number of neighbors
4904;34
5812;3
6520;0
31744;0
6248;33

Current state

The current state is first proof of concept. The app works, but is not performance optimized. Partitioning of Apache Spark is ignored at the moment.

Future work

Partitioning in Spark
Testing with larger data
- Writing a random data generator or "multiplying" the current data
Use other geometric data structures
- Range trees

Keywords

kd tree, kd-tree, k-d-tree, range queries, range query, latitude, longitude

Author

Written by Jörn Dinkla. See my homepage for further information

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.idea		.idea
project		project
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
cluster.properties		cluster.properties
local.properties		local.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lbnn - Location based nearest neighbors

How to use

Current state

Future work

Keywords

Author

About

Releases

Packages

Languages

License

jdinkla/location-based-nearest-neighbours

Folders and files

Latest commit

History

Repository files navigation

lbnn - Location based nearest neighbors

How to use

Current state

Future work

Keywords

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages