You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the titles column there are 'Reasons/Departments" specified before the title code. These are EMS, Fire and Traffic. Use .apply() with a custom lambda expression to create a new column called "Reason " that contains this string values.
For Example, If the title column value is EMS: BACK PAINS/INJURY, the reason column value should be EMS.
Now let us begin to focus on time information. What is the data type of the objects in the time Stamp column?
df['timeStamp'].dtypeprint(df['timeStamp'].dtype)
object
type(df['timeStamp'][0])
str
Use pd.to_datetime to convert the column fro strings to Date Time Objects.
df['timeStamp'] =pd.to_datetime(df['timeStamp'])
df['timeStamp'].dtype
dtype('<M8[ns]')
type(df['timeStamp'][0])
pandas._libs.tslibs.timestamps.Timestamp
You can no wgrab specific attributes from a Datetime object by caling them . For example
time = df['timeStamp'].iloc[0]
time.hour
We can use jupyters tab method to explore the various attributes that we can call>Now that the timestamp column are actually DateTime Objects , use .apply to create 3 new columns called Hour, Month and Day of week. You will create this columns based off of the timeStamp column.
You should have notice that some months(sep,oct, and Nov) are missing lets see if we can fill the information by plotting the information in another way, possiby a simple line plot that fills in the missing months, in order to do this , we need to work with pandas again.
Now create a groupby object called byMonth, where you group the dataframe by the month column and use the count method for aggregation. Use the head() method on this returned DataFrame.
byMonth=df.groupby('Month').count()
byMonth
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
Hour
Day of week
Month
1
13205
13205
13205
11527
13205
13205
13203
13096
13205
13205
13205
13205
2
11467
11467
11467
9930
11467
11467
11465
11396
11467
11467
11467
11467
3
11101
11101
11101
9755
11101
11101
11092
11059
11101
11101
11101
11101
4
11326
11326
11326
9895
11326
11326
11323
11283
11326
11326
11326
11326
5
11423
11423
11423
9946
11423
11423
11420
11378
11423
11423
11423
11423
6
11786
11786
11786
10212
11786
11786
11777
11732
11786
11786
11786
11786
7
12137
12137
12137
10633
12137
12137
12133
12088
12137
12137
12137
12137
8
9078
9078
9078
7832
9078
9078
9073
9025
9078
9078
9078
9078
12
7969
7969
7969
6907
7969
7969
7963
7916
7969
7969
7969
7969
Now create a simple plot off the dataframe indicating the count of calls per month?
What we see here is number of calls comes down from january and goes up again during the summer months and and falls of drastically off October and december months.
This also produces an estimated value for the number of calls, for the missing months.
Now see if you can use seaborn's lmplot to create a linear fit on the number per month.Keep in mind you may need to reset the index to a column.
byMonth.reset_index(inplace=True)
byMonth
Month
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
Hour
Day of week
0
1
13205
13205
13205
11527
13205
13205
13203
13096
13205
13205
13205
13205
1
2
11467
11467
11467
9930
11467
11467
11465
11396
11467
11467
11467
11467
2
3
11101
11101
11101
9755
11101
11101
11092
11059
11101
11101
11101
11101
3
4
11326
11326
11326
9895
11326
11326
11323
11283
11326
11326
11326
11326
4
5
11423
11423
11423
9946
11423
11423
11420
11378
11423
11423
11423
11423
5
6
11786
11786
11786
10212
11786
11786
11777
11732
11786
11786
11786
11786
6
7
12137
12137
12137
10633
12137
12137
12133
12088
12137
12137
12137
12137
7
8
9078
9078
9078
7832
9078
9078
9073
9025
9078
9078
9078
9078
8
12
7969
7969
7969
6907
7969
7969
7963
7916
7969
7969
7969
7969
sns.lmplot(data=byMonth, x='Month' ,y='e', height=4, scatter_kws= {'s':10}, line_kws={'lw':2}) #scatter_kws-> s represents the size
The shadow represents the error
What we see is as we go from january to decemeber, The regression analysis is showing that number of calls per month goes lower and lower .
However if you see the the scattered point it goes down and then it goes up again before it crashes in august and december
-Thats why its showing a much larger margin of error there over the shadow portion.
So the actual data points could lie anywhere between 8500 to 12000 for the month October.so it desnt give a good fit here because of this high error.
Create a new column called 'Date' that contains the date from the rimestamp column. You will need to use apply along with the .date() method.
The data actually starts from 2015 , then it goes till the 8th month i.e, is August of 2016.
In the begining i have predicted that we might have a missing data in the year 2016 rather this graph explains us that the data starts somewhere in the month of december in 2015 and continues only till 2016.
The data starts from 10th of Dec 2015 till 24th of August 2016.
The number of calls is less in August and Decmeber because we do not have the data of the complete months.
If you analyse this data this is the total calls byDate, On any individual day it looks like the total number of calls varies between 300 and 450 calls with spikes on certain days.
Now recreate the plot but create 3 separate plots with each plot representing a Reason for the call
If you see the graph for the Fire reason the number of calls on each day varies in between 30 to 80 calls on a normal days with a spike on certain days.
Here it looks like the calls vary between 150 and 225 with spikes on downward direction at the end of april or first of may.
Now lets move on to creating heatmaps with seaborn and our data. We'll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the day of the week. There are a lots of ways to do this , but I would recommend trying to combine group by with an unstack method.
df.head()
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
Hour
Month
Day of week
Date
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
17
12
Thu
2015-12-10
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
17
12
Thu
2015-12-10
2
40.121182
-75.351975
HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...
19401.0
Fire: GAS-ODOR/LEAK
2015-12-10 17:40:00
NORRISTOWN
HAWS AVE
1
Fire
17
12
Thu
2015-12-10
3
40.116153
-75.343513
AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...
19401.0
EMS: CARDIAC EMERGENCY
2015-12-10 17:40:01
NORRISTOWN
AIRY ST & SWEDE ST
1
EMS
17
12
Thu
2015-12-10
4
40.251492
-75.603350
CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...
NaN
EMS: DIZZINESS
2015-12-10 17:40:01
LOWER POTTSGROVE
CHERRYWOOD CT & DEAD END
1
EMS
17
12
Thu
2015-12-10
The easiest way to do this is using pivot table. If you check the dataframe we have added the new columns Reason, Hour Month and Day of the week and Date. What I want here is, 'Day of week' to be the index and the hour column actually be the columns and the hour varies between 0 to 23.
So I want 24 columns for each hour of day and seven rows for each day of the week.
Easiest way to to it is using the pivot table.
df.pivot_table(index='Day of week', columns='Hour', values='e', aggfunc='sum')
Hour
0
1
2
3
4
5
6
7
8
9
...
14
15
16
17
18
19
20
21
22
23
Day of week
Fri
275
235
191
175
201
194
372
598
742
752
...
932
980
1039
980
820
696
667
559
514
474
Mon
282
221
201
194
204
267
397
653
819
786
...
869
913
989
997
885
746
613
497
472
325
Sat
375
301
263
260
224
231
257
391
459
640
...
789
796
848
757
778
696
628
572
506
467
Sun
383
306
286
268
242
240
300
402
483
620
...
684
691
663
714
670
655
537
461
415
330
Thu
278
202
233
159
182
203
362
570
777
828
...
876
969
935
1013
810
698
617
553
424
354
Tue
269
240
186
170
209
239
415
655
889
880
...
943
938
1026
1019
905
731
647
571
462
274
Wed
250
216
189
209
156
255
410
701
875
808
...
904
867
990
1037
894
686
668
575
490
335
7 rows × 24 columns
## Getting the same results using unstack method## Here we use the groupby method and rather than grouping by single column, we will pass in a list of two columns and that## will produce a multi index dataframedf.groupby(['Day of week', 'Hour']).count()
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
Month
Date
Day of week
Hour
Fri
0
275
275
275
248
275
275
275
275
275
275
275
275
1
235
235
235
200
235
235
235
232
235
235
235
235
2
191
191
191
165
191
191
191
191
191
191
191
191
3
175
175
175
164
175
175
175
175
175
175
175
175
4
201
201
201
184
201
201
201
201
201
201
201
201
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Wed
19
686
686
686
590
686
686
686
682
686
686
686
686
20
668
668
668
597
668
668
668
662
668
668
668
668
21
575
575
575
508
575
575
574
572
575
575
575
575
22
490
490
490
432
490
490
490
485
490
490
490
490
23
335
335
335
294
335
335
334
335
335
335
335
335
168 rows × 12 columns
If we observe the dataframe we can see two indexes here one is 'Day of Week' and 'Hour' for each day of the week
## Lets use unstack on it byDayHour=df.groupby(['Day of week', 'Hour']).count()['e'].unstack()
byDayHour
Hour
0
1
2
3
4
5
6
7
8
9
...
14
15
16
17
18
19
20
21
22
23
Day of week
Fri
275
235
191
175
201
194
372
598
742
752
...
932
980
1039
980
820
696
667
559
514
474
Mon
282
221
201
194
204
267
397
653
819
786
...
869
913
989
997
885
746
613
497
472
325
Sat
375
301
263
260
224
231
257
391
459
640
...
789
796
848
757
778
696
628
572
506
467
Sun
383
306
286
268
242
240
300
402
483
620
...
684
691
663
714
670
655
537
461
415
330
Thu
278
202
233
159
182
203
362
570
777
828
...
876
969
935
1013
810
698
617
553
424
354
Tue
269
240
186
170
209
239
415
655
889
880
...
943
938
1026
1019
905
731
647
571
462
274
Wed
250
216
189
209
156
255
410
701
875
808
...
904
867
990
1037
894
686
668
575
490
335
7 rows × 24 columns
This is a 7 x 24 df . We have seven days of the week and 24 hours of the day ranging from 0 to 23 and we have the sum that is the count for each row which has 'e' value as 1
If you see the color bar the darker color has the higher value and the lighter color has the lower value.
We have the days of the week on the y axis and hours of the day on the x axis.
At midnight i.e, between 01 A.M to 5 A.M the values are lower and if you go through the day from the morning to the afternoon and evening the the colors become darker and darker thrpught the day and starts getting lighter again as you reach 9 P.m, 10 P.m and 11 P.m.
If you look at the Y axis the days of the week sat and sun has lighter colors compared to the other days of the week
The number of calls increased as you go through the day.
Most of the calls have been around three to seven in the evening and similiarly most of the calls happen on the weekdays than on the weekends.
Lets repeat these same plots and operations, for a DataFrame that shows the Month as the column.
byDayMonth=df.groupby(['Day of week', 'Month']).count()['e'].unstack()
byDayMonth
Month
1
2
3
4
5
6
7
8
12
Day of week
Fri
1970
1581
1525
1958
1730
1649
2045
1310
1065
Mon
1727
1964
1535
1598
1779
1617
1692
1511
1257
Sat
2291
1441
1266
1734
1444
1388
1695
1099
978
Sun
1960
1229
1102
1488
1424
1333
1672
1021
907
Thu
1584
1596
1900
1601
1590
2065
1646
1230
1266
Tue
1973
1753
1884
1430
1918
1676
1670
1612
1234
Wed
1700
1903
1889
1517
1538
2058
1717
1295
1262
We have our days on the row index and months on the column and we have total number of calls and the data is given as a form of matrix where the actual data relates to the column and row index.
In this Project Its seen that more calls are made on weekdays than on weekends.
And also more calls were made on late afternoon and evening hours than compared to late nighs and morning hours
That kind of make sense as people are more active during these times so there would be more traffic calls or maybe EMS calls.
However there was no pattern observed for months data partly due to the missing daa for the fall and the winter months.
If the missing data is found then it could be analysed for the different reasons for summer and winter months as there could be more fire incidents in the summers and more traffic related calls in the winter months.
Author - Thanveer Ahmed Shaik
This project is part of my portfolio, showcasing the Python skills essential for data analyst roles. If you have any questions, feedback, or would like to collaborate, feel free to get in touch!