-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathreport-exploring-yelp-api.Rmd
260 lines (203 loc) · 9.31 KB
/
report-exploring-yelp-api.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
title: "Exploring YELP API"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**At first, please note that if you haven't got the data directory, then the script will need to download the data from Yelp.
In order to do so, please go to [yelp developers](https://www.yelp.com/developers/manage_api_keys) and create the credentials,
then, add your credentials to the `credentials.R.template` file, and rename it to `credentials.R`**
```{r, include=FALSE}
require(httr)
require(httpuv)
require(jsonlite)
require(base64enc)
library(geosphere)
library(plyr)
library(ggplot2)
library(ggmap)
```
```{r, echo=FALSE}
# Define constants:
# Number of pages to retrieve from YELP.
# Each page contains 20 rows
NUM_OF_PAGES <- 50
```
```{r, echo=FALSE}
wd <- '/Users/hagai_lvi/tmp/data_scientist/assignment_2'
setwd(wd)
source('credentials.R')
```
At first, we are gathering the Data. The code that we are using is:
```{r, echo=TRUE}
yelp_search <- function(term, location, category_filter, sort=0, offset=0, limit=20) {
# Search term and location go in the query string.
path <- "/v2/search/"
query_args <- list(term=term,
location=location,
category_filter=category_filter,
sort=sort,
offset=offset,
limit=limit)
# Make request.
results <- yelp_query(path, query_args)
return(results)
}
yelp_query <- function(path, query_args) {
# Use OAuth to authorize the request.
myapp <- oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig <- sign_oauth1.0(myapp, token=token, token_secret=token_secret)
# Build Yelp API URL.
scheme <- "https"
host <- "api.yelp.com"
yelpurl <- paste0(scheme, "://", host, path)
# Make request.
results <- GET(yelpurl, sig, query=query_args)
# If status is not success, print some debugging output.
HTTP_SUCCESS <- 200
if (results$status != HTTP_SUCCESS) {
print(results)
}
return(results)
}
# Create the data dir if it doesn't exist
if(! file.exists('data')){
dir.create('data')
}
calc_dist <- function(i, data){
if (is.na(data[i, 'latitude']) || is.na(data[i, 'longitude'])){
NA
}else{
distm(c(data[i, 'longitude'], data[i, 'latitude']), c(-122.405766, 37.790244))
}
}
if(! file.exists('./data/business.csv')){
businesses = NULL
# The api serves a small amount of rows each time, so we are iterating over and over and building
# a one big dataframe
for (offset in 0:(NUM_OF_PAGES-1)) {
yelp_search_result <- yelp_search(term="food", category_filter="food", location="San Francisco, CA", sort=0, offset = offset*20)
# Parse data from the http response to a json
yelp_data = content(yelp_search_result)
yelp_data_list=jsonlite::fromJSON(toJSON(yelp_data, auto_unbox = TRUE))
# Clean and rename the dataframe
tmp <- yelp_data_list$businesses
tmp <- data.frame(tmp$name, tmp$rating, tmp$review_count, tmp$location$coordinate$latitude, tmp$location$coordinate$longitude)
tmp <- rename(tmp, c("tmp.name"="name", "tmp.rating"="rating", "tmp.review_count"="review_count",
"tmp.location.coordinate.latitude"="latitude", "tmp.location.coordinate.longitude"="longitude"))
# Merge with the previous data, for a one big dataset
businesses <- rbind(businesses, tmp)
}
# Add a distance column that is calculated according to the Pythagorean theorem
businesses$dist <- sapply(1:nrow(businesses), calc_dist, businesses)
# Write the data to csv
write.csv(businesses, file='./data/business.csv')
} else{
businesses <- read.csv('./data/business.csv', header = TRUE)
}
cat(sprintf("Number of rows: %i", nrow(businesses)))
```
Note that we have `r nrow(businesses)` rows.
The available features are
```{r, echo=FALSE}
names(businesses)
```
Head of the table:
```{r, echo=FALSE}
head(businesses)
```
## Now, some plots:
At first, box plots for the relevant features:
```{r}
par(mfrow=c(1,3))
boxplot(businesses$rating, main='Rating')
boxplot(businesses$review_count, main='# of reviews')
boxplot(businesses$dist, main='Distance(meter)')
```
- We can see that most of the ratings are at the range [4.0,4.5].
- We can also see that most of the restaurants have a really small number of review.
- Last but not least, we can see that most of the restaurants are at an approximate range of 3km, which sounds as a reasonable radius
for a central area in a large city.
-----
```{r}
with(businesses, plot(x = review_count, y = rating, main='Rating as function of # of reviews', xlab= '# of reviews',
ylab='rating'))
# Add trend line
with(businesses, abline(lm(review_count~rating), col="red"))
```
We thought that we might see some trend in this graph, but we couldn't find any.
We can also use `cor(businesses$review_count, businesses$rating)` and see that the correlation is `r cor(businesses$review_count, businesses$rating)`
which is negligible.
-----------------
```{r}
plot(x=businesses$dist, y=businesses$rating, main='Rating as function of distance from the center', xlab="Distance from the center(meter)",
ylab="Rating")
# Add trend line
abline(lm(businesses$dist~businesses$rating), col="red")
```
We thought that the rating might be affected by the distance from the center, but it seems like there is no trend in this graph.
Using `cor(...)` we see that also here the correlation is `r cor(businesses$dist, businesses$rating, use="complete")`, which is negligible.
------------
```{r}
plot(businesses$dist, businesses$review_count, main='# of reviews as a function of distance from the center',
xlab='Distance from the center(meter)', ylab='# of reviews')
# Add trend line
abline(lm(businesses$dist~businesses$review_count), col="red")
```
We expected that the restaurants that are the closest to the center will have the most reviews, because the most people visit there.
The correlation is `r cor(businesses$dist, businesses$review_count, use="complete")`, which again, is negligible.
--------------------
```{r}
hist(businesses$dist, main = 'Histogram of amount of restaurants as\na function of distance from the center', xlab = 'Distance(meter)')
rug(businesses$dist)
```
The histogram shows that indeed the central area which is at a radius of 2-3 km, contains the most restaurants, and as we are getting further from
the center the number of restaurants gets smaller.
This probably is also related to the fact that sea surrounds this area (as can be seen in the maps below)
-----------
Relevant maps:
```{r, include=FALSE, warning=FALSE}
# Retrieve a map
map <- get_map(location = c(lon = -122.4250, lat = 37.7550), zoom = 12, maptype = "hybrid", scale = 2)
```
```{r, echo=FALSE, warning=FALSE, results='hide'}
# Show all the restaurants on a map:
ggmap(map, extent = 'device') + geom_point(aes(x=longitude, y=latitude), data=businesses) +
geom_point(data = data.frame(lon = -122.405766, lat = 37.790244), colour = "red")
```
We can see the location of the restaurants that we have retrieved from yelp on a map.
*The red dot marks the center. We picked it by approximation as it seemed as the most crowded spot.
This might not be the best decision, but we thought that it was good enough for now.*
----
```{r, echo=FALSE, warning=FALSE}
# A Heat-map that shows the most "dense" areas
ggmap(map, extent='device') + geom_density2d(aes(x=longitude, y=latitude), data = businesses) + stat_density2d(data=businesses,aes(x=longitude, y=latitude, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") +
scale_fill_gradient(low = "green", high = "red", guide = FALSE) + scale_alpha(range = c(0, 0.3), guide = FALSE)
```
This is a Heat map that shows the central area.
We can clearly see that indeed there is a central area in which most of the restaurants grouped, as the histogram shows.
----
## Summary, conclusions and recomendations
### Summary
In this assignment we have experienced with accessing a web API from R, gathering and cleaning data, and inferring some
basic insights about the data.
We have also visualized basic data on a map (including locations and a heatmap).
### Conclusions
The most interesting graph in our opinion is the histogram. It shows an increase in the number of restaurants
(because as the radius increases, the area is increased by the power of 2), and then at a certain threshold we
get far enough from the center and the number of restaurants is dropping.
We also thing that the heat map is really interesting, because it shows in a very clear manner what is "the hottest"
zone in San Francisco, and it shows clearly how the "heat" reduces gradually.
This mini research can help tourists gain knowledge about a destination, before even visiting it.
### Recomendations
- Try and check other cities around the world.
- Check other fields of Yelp, such as Nightlife and shopping.
- Try aggregating results from multiple fields together.
- Check other features, such as food categories to see how they affect the results.
- Check more advanced topics, for example - how a business's website affects its popularity.
- There seem to be a little trend in the `# of reviews as function of distance from the center` graph,
might be intresting to select a different center point, and see if it improves the results and makes
this trendline less ambiguous.