-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05-r-and-databases.html
773 lines (693 loc) · 40.6 KB
/
05-r-and-databases.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Data Carpentry contributors" />
<title>SQL databases and R</title>
<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/jqueryui-1.11.4/jquery-ui.min.js"></script>
<link href="site_libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" />
<script src="site_libs/tocify-1.9.1/jquery.tocify.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<link href="site_libs/font-awesome-4.5.0/css/font-awesome.min.css" rel="stylesheet" />
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="style.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<style type="text/css">
/* padding for bootstrap navbar */
body {
padding-top: 51px;
padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar) */
.section h1 {
padding-top: 56px;
margin-top: -56px;
}
.section h2 {
padding-top: 56px;
margin-top: -56px;
}
.section h3 {
padding-top: 56px;
margin-top: -56px;
}
.section h4 {
padding-top: 56px;
margin-top: -56px;
}
.section h5 {
padding-top: 56px;
margin-top: -56px;
}
.section h6 {
padding-top: 56px;
margin-top: -56px;
}
</style>
<script>
// manage active state of menu based on current page
$(document).ready(function () {
// active menu anchor
href = window.location.pathname
href = href.substr(href.lastIndexOf('/') + 1)
if (href === "")
href = "index.html";
var menuAnchor = $('a[href="' + href + '"]');
// mark it active
menuAnchor.parent().addClass('active');
// if it's got a parent navbar menu mark it active as well
menuAnchor.closest('li.dropdown').addClass('active');
});
</script>
<div class="container-fluid main-container">
<!-- tabsets -->
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<script>
$(document).ready(function () {
// move toc-ignore selectors from section div to header
$('div.section.toc-ignore')
.removeClass('toc-ignore')
.children('h1,h2,h3,h4,h5').addClass('toc-ignore');
// establish options
var options = {
selectors: "h1,h2,h3",
theme: "bootstrap3",
context: '.toc-content',
hashGenerator: function (text) {
return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase();
},
ignoreSelector: ".toc-ignore",
scrollTo: 0
};
options.showAndHide = false;
options.smoothScroll = true;
// tocify
var toc = $("#TOC").tocify(options).data("toc-tocify");
});
</script>
<style type="text/css">
#TOC {
margin: 25px 0px 20px 0px;
}
@media (max-width: 768px) {
#TOC {
position: relative;
width: 100%;
}
}
.toc-content {
padding-left: 30px;
padding-right: 40px;
}
div.main-container {
max-width: 1200px;
}
div.tocify {
width: 20%;
max-width: 260px;
max-height: 85%;
}
@media (min-width: 768px) and (max-width: 991px) {
div.tocify {
width: 25%;
}
}
@media (max-width: 767px) {
div.tocify {
width: 100%;
max-width: none;
}
}
.tocify ul, .tocify li {
line-height: 20px;
}
.tocify-subheader .tocify-item {
font-size: 0.90em;
padding-left: 25px;
text-indent: 0;
}
.tocify .list-group-item {
border-radius: 0px;
}
.tocify-subheader {
display: inline;
}
.tocify-subheader .tocify-item {
font-size: 0.95em;
}
</style>
<!-- setup 3col/9col grid for toc_float and main content -->
<div class="row-fluid">
<div class="col-xs-12 col-sm-4 col-md-3">
<div id="TOC" class="tocify">
</div>
</div>
<div class="toc-content col-xs-12 col-sm-8 col-md-9">
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html"></a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="index.html">Home</a>
</li>
<li>
<a href="00-before-we-start.html">Before we start</a>
</li>
<li>
<a href="01-intro-to-r.html">Intro to R</a>
</li>
<li>
<a href="02-starting-with-data.html">Starting with data</a>
</li>
<li>
<a href="03-dplyr.html">Manipulating data frames</a>
</li>
<li>
<a href="04-visualization-ggplot2.html">Visualizing data</a>
</li>
<li>
<a href="05-r-and-databases.html">R and SQL</a>
</li>
<li>
<a href="code-handout.R">Code Handout</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="https://github.com/data-lessons/R-nhcdata-lesson">
<span class="fa fa-github fa-lg"></span>
</a>
</li>
</ul>
</div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">SQL databases and R</h1>
<h4 class="author"><em>Data Carpentry contributors</em></h4>
</div>
<hr />
<blockquote>
<h3 id="learning-objectives">Learning Objectives</h3>
<ul>
<li>Access a database from R.</li>
<li>Run SQL queries in R using <strong><code>RSQLite</code></strong> and <strong><code>dplyr</code></strong>.</li>
<li>Create an SQLite database from existing .csv files.</li>
</ul>
</blockquote>
<hr />
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole? In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the chunks needed for the current analysis.</p>
<p>Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.</p>
<p>R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., <strong><code>RSQLite</code></strong>, RMySQL, etc). Furthermore, the <a href="https://cran.r-project.org/web/packages/dplyr/index.html"><strong><code>dplyr</code></strong></a> package you used in the previous chapter, in conjunction with <a href="https://cran.r-project.org/package=dbplyr"><strong><code>dbplyr</code></strong></a> supports connecting to the widely-used open source databases <a href="https://sqlite.org/">sqlite</a>, <a href="https://www.mysql.com/">mysql</a> and <a href="https://www.postgresql.org/">postgresql</a>, as well as Google’s <a href="https://cloud.google.com/bigquery/">bigquery</a>, and it can also be extended to other database types (a <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html">vignette</a> in the <strong><code>dplyr</code></strong> package explains how to do it).</p>
<p>Interfacing with databases using <strong><code>dplyr</code></strong> focuses on retrieving and analyzing datasets by generating <code>SELECT</code> SQL statements, but it doesn’t modify the database itself. <strong><code>dplyr</code></strong> does not offer functions to <code>UPDATE</code> or <code>DELETE</code> entries. If you need these functionalities, you will need to use additional R packages (e.g., <strong><code>RSQLite</code></strong>). Here we will demonstrate how to interact with a database using <strong><code>dplyr</code></strong>, using both the <strong><code>dplyr</code></strong>’s verb syntax and the SQL syntax.</p>
<div id="the-idigbio_rodents-database" class="section level3">
<h3>The idigbio_rodents database</h3>
<p>We will continue to explore the <code>specimens</code> data you are already familiar with from previous lessons. First, we are going to install the <strong><code>dbplyr</code></strong> package:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">install.packages</span>(<span class="st">"dbplyr"</span>)
<span class="kw">library</span>(dbplyr)</code></pre></div>
<p>The SQLite database is contained in a single file <code>idigbio_rodents.sqlite</code> that you generated during the SQL lesson. If you don’t have it, you can download it from Figshare into the <code>data</code> subdirectory using:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dir.create</span>(<span class="st">"data"</span>, <span class="dt">showWarnings =</span> <span class="ot">FALSE</span>)
<span class="kw">download.file</span>(<span class="dt">url =</span> <span class="st">"https://ndownloader.figshare.com/files/9583582"</span>,
<span class="dt">destfile =</span> <span class="st">"data/idigbio_rodents.sqlite"</span>, <span class="dt">mode =</span> <span class="st">"wb"</span>)</code></pre></div>
</div>
</div>
<div id="connecting-to-databases-with-dplyr" class="section level2">
<h2>Connecting to databases with <strong><code>dplyr</code></strong></h2>
<p>We can point R to this database with <strong><code>dplyr</code></strong>’s <code>src_sqlite()</code> command.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)</code></pre></div>
<pre><code>#>
#> Attaching package: 'dplyr'</code></pre>
<pre><code>#> The following objects are masked from 'package:stats':
#>
#> filter, lag</code></pre>
<pre><code>#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rodents <-<span class="st"> </span><span class="kw">src_sqlite</span>(<span class="st">"data/idigbio_rodents.sqlite"</span>)</code></pre></div>
<p>The <code>src_sqlite()</code> command does not load the data into the R session (as the <code>read.csv()</code> function did). Instead, it merely instructs R to connect to the <code>SQLite</code> database contained in the <code>idigbio_rodents.sqlite</code> file.</p>
<p>(You can use the <code>src_mysql()</code>, <code>src_postgres()</code> and <code>src_bigquery()</code> to connect to the other database types supported by <strong><code>dplyr</code></strong>.)</p>
<p>Let’s take a closer look at the <code>rodents</code> database we just connected to:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rodents</code></pre></div>
<pre><code>#> src: sqlite 3.19.3 [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> tbls: localities, species, specimens</code></pre>
<p>Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the <code>tbls</code> row in the output above:</p>
<ul>
<li>localities</li>
<li>species</li>
<li>specimens</li>
</ul>
<p>Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.</p>
<div id="querying-the-database-with-the-sql-syntax" class="section level3">
<h3>Querying the database with the SQL syntax</h3>
<p>To connect to tables within a database, you can use the <code>tbl()</code> function from <strong><code>dplyr</code></strong>. This function can be used to send SQL queries to the database. To demonstrate this functionality, let’s select the columns “year”, “speciesID”, and “localityID” from the <code>specimens</code> table:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">tbl</span>(rodents, <span class="kw">sql</span>(<span class="st">"SELECT year, speciesID, localityID FROM specimens"</span>))</code></pre></div>
<p>With this approach you can use any of the SQL queries we have seen in the database lesson.</p>
</div>
<div id="querying-the-database-with-the-dplyr-syntax" class="section level3">
<h3>Querying the database with the dplyr syntax</h3>
<p>One of the strengths of <strong><code>dplyr</code></strong> is that the same operation can be done using <strong><code>dplyr</code></strong>’s verbs instead of writing SQL. First, we select the table on which to do the operations by creating the <code>specimens</code> object, and then we use the standard <strong><code>dplyr</code></strong> syntax as if it were a data frame:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">specimens <-<span class="st"> </span><span class="kw">tbl</span>(rodents, <span class="st">"specimens"</span>)
specimens %>%
<span class="st"> </span><span class="kw">select</span>(year, speciesID, localityID)</code></pre></div>
<p>In this case, the <code>specimens</code> object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. For instance, the <code>head()</code> function can be used to check the first 10 rows of the table:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(specimens, <span class="dt">n =</span> <span class="dv">10</span>)</code></pre></div>
<pre><code>#> # Source: lazy query [?? x 14]
#> # Database: sqlite 3.19.3
#> # [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> uuid speciesID localityID
#> <chr> <int> <int>
#> 1 060380ea-7b06-474e-8d2e-b6e4a8c21e1a 411735100 887181682
#> 2 0fb17a79-a8ce-45b6-b57a-2f640e8cccb6 411735100 1668152736
#> 3 1a69c8ad-0ac3-4612-9dc0-6867e8b9a218 411735100 1668152736
#> 4 1a9932b4-beab-4472-bec1-a7e68c4b9e6e 411735100 1668152736
#> 5 1f3b8aea-fbae-46d1-91c8-274924b40c9f 411735100 1668152736
#> 6 203f0531-9b46-403f-ac09-3acab5be977c 596319155 1294942448
#> 7 21ceb6f0-3905-4e9a-92ae-8a1cd2af4a3a 1323405193 1811586400
#> 8 23d3f0d9-0075-472a-96c4-5d9f01c78d0f 2033031125 1797513247
#> 9 244bbe9f-81aa-42b9-bc92-ffef2b530fc4 1390967007 1073686650
#> 10 2682aa08-3d8f-496a-a487-a76575566b13 597275825 944367018
#> # ... with more rows, and 11 more variables: institutionCode <chr>,
#> # collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> # eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> # length <dbl>, sex <chr></code></pre>
<p>This output of the <code>head</code> command looks just like a regular <code>data.frame</code>: The table has 9 columns and the <code>head()</code> command shows us the first 10 rows. Note that the columns <code>stateProvince</code>, <code>country</code>, <code>genus</code>, and <code>scientificName</code> are missing. These are now located in the tables <code>localities</code> and <code>species</code> which we will join together in a moment.</p>
<p>However, some functions don’t work quite as expected. For instance, let’s check how many rows there are in total using <code>nrow()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">nrow</span>(tbl)</code></pre></div>
<pre><code>#> NULL</code></pre>
<p>That’s strange - R doesn’t know how many rows the <code>species</code> table contains - it returns <code>NULL</code> instead. You might have already noticed that the first line of the <code>head()</code> output included <code>??</code> indicating that the number of rows wasn’t known.</p>
<p>The reason for this behavior highlights a key difference between using <strong><code>dplyr</code></strong> on datasets in memory (e.g. loaded into your R session via <code>read.csv()</code>) and those provided by a database. To understand it, we take a closer look at how <strong><code>dplyr</code></strong> communicates with our SQLite database.</p>
</div>
<div id="sql-translation" class="section level3">
<h3>SQL translation</h3>
<p>Relational databases typically use a special-purpose language, <a href="https://en.wikipedia.org/wiki/SQL">Structured Query Language (SQL)</a>, to manage and query data.</p>
<p>For example, the following SQL query returns the first 10 rows from the <code>specimens</code> table:</p>
<div class="sourceCode"><pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> *
<span class="kw">FROM</span> `specimens`
<span class="kw">LIMIT</span> <span class="dv">10</span></code></pre></div>
<p>Behind the scenes, <strong><code>dplyr</code></strong>:</p>
<ol style="list-style-type: decimal">
<li>translates your R code into SQL</li>
<li>submits it to the database</li>
<li>translates the database’s response into an R data frame</li>
</ol>
<p>To lift the curtain, we can use <strong><code>dplyr</code></strong>’s <code>show_query()</code> function to show which SQL commands are actually sent to the database:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">show_query</span>(<span class="kw">head</span>(specimens, <span class="dt">n =</span> <span class="dv">10</span>))</code></pre></div>
<pre><code>#> <SQL>
#> SELECT *
#> FROM `specimens`
#> LIMIT 10</code></pre>
<p>The output shows the actual SQL query sent to the database; it matches our manually constructed <code>SELECT</code> statement above.</p>
<p>Instead of having to formulate the SQL query ourselves - and having to mentally switch back and forth between R and SQL syntax - we can delegate this translation to <strong><code>dplyr</code></strong>. (You don’t even need to know SQL to interact with a database via <strong><code>dplyr</code></strong>!)</p>
<p><strong><code>dplyr</code></strong>, in turn, doesn’t do the real work of subsetting the table, either. Instead, it merely sends the query to the database, waits for its response and returns it to us.</p>
<p>That way, R never gets to see the full <code>specimens</code> table - and that’s why it could not tell us how many rows it contains. On the bright side, this allows us to work with large datasets - even too large to fit into our computer’s memory.</p>
<p><strong><code>dplyr</code></strong> can translate many different query types into SQL allowing us to, e.g., <code>select()</code> specific columns, <code>filter()</code> rows, or join tables.</p>
<p>To see this in action, let’s compose a few queries with <strong><code>dplyr</code></strong>.</p>
</div>
</div>
<div id="simple-database-queries" class="section level2">
<h2>Simple database queries</h2>
<p>First, let’s only request rows of the <code>specimens</code> table in which <code>weight</code> is less than 5 and keep only the speciesID, sex, and weight columns.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">specimens %>%
<span class="st"> </span><span class="kw">filter</span>(weight <<span class="st"> </span><span class="dv">50</span>) %>%
<span class="st"> </span><span class="kw">select</span>(speciesID, sex, weight)</code></pre></div>
<pre><code>#> # Source: lazy query [?? x 3]
#> # Database: sqlite 3.19.3
#> # [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> speciesID sex weight
#> <int> <chr> <dbl>
#> 1 411735100 male 30.5
#> 2 411735100 female 22.0
#> 3 411735100 male 49.0
#> 4 411735100 female 33.0
#> 5 411735100 female 29.0
#> 6 596319155 female 23.5
#> 7 1323405193 24.0
#> 8 2033031125 female 27.0
#> 9 597275825 male 12.0
#> 10 621625598 female 33.5
#> # ... with more rows</code></pre>
<p>Executing this command will return a table with 10 rows and the requested <code>speciesID</code>, <code>sex</code> and <code>weight</code> columns. Great!</p>
<p>… but wait, why are there only 10 rows?</p>
<p>The last line:</p>
<pre><code># ... with more rows</code></pre>
<p>indicates that there are more results that fit our filtering criterion. Why was R lazy and only retrieved 10 of them?</p>
</div>
<div id="laziness" class="section level2">
<h2>Laziness</h2>
<p>Hadley Wickham, the author of <strong><code>dplyr</code></strong> <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html">explains</a>:</p>
<blockquote>
<p>When working with databases, <strong><code>dplyr</code></strong> tries to be as lazy as possible:</p>
<ul>
<li>It never pulls data into R unless you explicitly ask for it.</li>
<li>It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.</li>
</ul>
</blockquote>
<p>When you construct a <strong><code>dplyr</code></strong> query, you can connect multiple verbs into a single pipeline. For example, we combined the <code>filter()</code> and <code>select()</code> verbs using the <code>%>%</code> pipe.</p>
<p>If we wanted to, we could add on even more steps, e.g. remove the <code>sex</code> column in an additional <code>select</code> call:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_subset <-<span class="st"> </span>specimens %>%
<span class="st"> </span><span class="kw">filter</span>(weight <<span class="st"> </span><span class="dv">50</span>) %>%
<span class="st"> </span><span class="kw">select</span>(speciesID, sex, weight)
data_subset %>%
<span class="st"> </span><span class="kw">select</span>(-sex)</code></pre></div>
<pre><code>#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.19.3
#> # [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> speciesID weight
#> <int> <dbl>
#> 1 411735100 30.5
#> 2 411735100 22.0
#> 3 411735100 49.0
#> 4 411735100 33.0
#> 5 411735100 29.0
#> 6 596319155 23.5
#> 7 1323405193 24.0
#> 8 2033031125 27.0
#> 9 597275825 12.0
#> 10 621625598 33.5
#> # ... with more rows</code></pre>
<p>Just like the first <code>select(speciesID, sex, weight)</code> call, the <code>select(-sex)</code> command is not executed by R. It is sent to the database instead. Only the <em>final</em> result is retrieved and displayed to you.</p>
<p>Of course, we could always add on more steps, e.g., we could filter by <code>speciesID</code> or minimum <code>weight</code>. That’s why R doesn’t retrieve the full set of results - instead it only retrieves the first 10 results from the database by default. (After all, you might want to add an additional step and get the database to do more work…)</p>
<p>To instruct R to stop being lazy, e.g. to retrieve all of the query results from the database, we add the <code>collect()</code> command to our pipe. It indicates that our database query is finished: time to get the <em>final</em> results and load them into the R session.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_subset <-<span class="st"> </span>specimens %>%
<span class="st"> </span><span class="kw">filter</span>(weight <<span class="st"> </span><span class="dv">50</span>) %>%
<span class="st"> </span><span class="kw">select</span>(speciesID, sex, weight) %>%
<span class="st"> </span><span class="kw">collect</span>()</code></pre></div>
<p>Now we have all 7157 rows that match our query in a <code>data.frame</code> and can continue to work with them exclusively in R, without communicating with the database.</p>
</div>
<div id="complex-database-queries" class="section level2">
<h2>Complex database queries</h2>
<p><strong><code>dplyr</code></strong> enables database queries across one or multiple database tables, using the same single- and multiple-table verbs you encountered previously. This means you can use the same commands regardless of whether you interact with a remote database or local dataset! This is a really useful feature if you work with large datasets: you can first prototype your code on a small subset that fits into memory, and when your code is ready, you can change the input dataset to your full database without having to change the syntax.</p>
<p>On the other hand, being able use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.</p>
<p>To illustrate how to use <strong><code>dplyr</code></strong> with these complex queries, we are going to join the <code>localities</code> and <code>specimens</code> tables. The <code>localities</code> table in the database contains information about the different localities surveyed by the researchers. To access it, we point the <code>tbl()</code> command to it:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">localities <-<span class="st"> </span><span class="kw">tbl</span>(rodents, <span class="st">"localities"</span>)
localities</code></pre></div>
<pre><code>#> Warning in rsqlite_fetch(res@ptr, n = n): Column `decimalLatitude`: mixed
#> type, first seen values of type real, coercing other values of type string</code></pre>
<pre><code>#> Warning in rsqlite_fetch(res@ptr, n = n): Column `decimalLongitude`: mixed
#> type, first seen values of type real, coercing other values of type string</code></pre>
<pre><code>#> # Source: table<localities> [?? x 6]
#> # Database: sqlite 3.19.3
#> # [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> localityID countryCode stateProvince county decimalLatitude
#> <int> <chr> <chr> <chr> <dbl>
#> 1 1811586400 usa oklahoma le flore 34.53903
#> 2 1814588800 usa oklahoma mcintosh 0.00000
#> 3 543639200 usa washington snohomish county 47.78533
#> 4 1234201200 usa new mexico sandoval county 35.80743
#> 5 1133867400 usa california mariposa county 37.66312
#> 6 914916400 usa new mexico grant county 32.89984
#> 7 1740702200 usa idaho bear lake 42.19194
#> 8 599691200 usa oklahoma tillman 34.55083
#> 9 612348800 usa new mexico sierra county 32.98070
#> 10 1197391000 usa california del norte county 41.82547
#> # ... with more rows, and 1 more variables: decimalLongitude <dbl></code></pre>
<p>The <code>localityID</code> column also features in the <code>specimens</code> table:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">specimens</code></pre></div>
<pre><code>#> # Source: table<specimens> [?? x 14]
#> # Database: sqlite 3.19.3
#> # [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> uuid speciesID localityID
#> <chr> <int> <int>
#> 1 060380ea-7b06-474e-8d2e-b6e4a8c21e1a 411735100 887181682
#> 2 0fb17a79-a8ce-45b6-b57a-2f640e8cccb6 411735100 1668152736
#> 3 1a69c8ad-0ac3-4612-9dc0-6867e8b9a218 411735100 1668152736
#> 4 1a9932b4-beab-4472-bec1-a7e68c4b9e6e 411735100 1668152736
#> 5 1f3b8aea-fbae-46d1-91c8-274924b40c9f 411735100 1668152736
#> 6 203f0531-9b46-403f-ac09-3acab5be977c 596319155 1294942448
#> 7 21ceb6f0-3905-4e9a-92ae-8a1cd2af4a3a 1323405193 1811586400
#> 8 23d3f0d9-0075-472a-96c4-5d9f01c78d0f 2033031125 1797513247
#> 9 244bbe9f-81aa-42b9-bc92-ffef2b530fc4 1390967007 1073686650
#> 10 2682aa08-3d8f-496a-a487-a76575566b13 597275825 944367018
#> # ... with more rows, and 11 more variables: institutionCode <chr>,
#> # collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> # eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> # length <dbl>, sex <chr></code></pre>
<p>Because <code>localityID</code> is listed in both tables, we can use it to look up matching records, and join the two tables.</p>
<div class="figure">
<img src="img/joins.svg" alt="diagram illustrating inner and left joins" />
<p class="caption">diagram illustrating inner and left joins</p>
</div>
<p>For example, to extract all specimens for the first plot, which has <code>localityID</code> 1, we can do:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">localities %>%
<span class="st"> </span><span class="kw">filter</span>(localityID ==<span class="st"> </span><span class="dv">18620641</span>) %>%
<span class="st"> </span><span class="kw">inner_join</span>(specimens) %>%
<span class="st"> </span><span class="kw">collect</span>()</code></pre></div>
<pre><code>#> Joining, by = "localityID"</code></pre>
<pre><code>#> # A tibble: 34 x 19
#> localityID countryCode stateProvince county decimalLatitude
#> <int> <chr> <chr> <chr> <dbl>
#> 1 18620641 usa california tuolumne county 37.91355
#> 2 18620641 usa california tuolumne county 37.91355
#> 3 18620641 usa california tuolumne county 37.91355
#> 4 18620641 usa california tuolumne county 37.91355
#> 5 18620641 usa california tuolumne county 37.91355
#> 6 18620641 usa california tuolumne county 37.91355
#> 7 18620641 usa california tuolumne county 37.91355
#> 8 18620641 usa california tuolumne county 37.91355
#> 9 18620641 usa california tuolumne county 37.91355
#> 10 18620641 usa california tuolumne county 37.91355
#> # ... with 24 more rows, and 14 more variables: decimalLongitude <dbl>,
#> # uuid <chr>, speciesID <int>, institutionCode <chr>,
#> # collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> # eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> # length <dbl>, sex <chr></code></pre>
<p><strong>Important Note:</strong> Without the <code>collect()</code> statement, only the first 10 matching rows are returned. By adding <code>collect()</code>, the full set of 34 is retrieved.</p>
<blockquote>
<h3 id="challenge">Challenge</h3>
<p>Write a query that returns the number of dipodomys at each institution in each year.</p>
<p>Hint: Connect to the species table and write a query that joins the species and specimen tables together to exclude all non-dipodomys genera. The query should return counts of specimens by year.</p>
<p>Optional: Write a query in SQL that will produce the same result. You can join multiple tables together using the following syntax where foreign key refers to your unique id (e.g., <code>speciesID</code>):</p>
<div class="sourceCode"><pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> table.col, table.col
<span class="kw">FROM</span> table1 <span class="kw">JOIN</span> table2
<span class="kw">ON</span> table1.key = table2.key
<span class="kw">JOIN</span> table3 <span class="kw">ON</span> table2.key = table3.key</code></pre></div>
</blockquote>
<!-- answer
```r
## with dplyr syntax
species <- tbl(rodents, "species")
left_join(specimens, species) %>%
filter(genus == "dipodomys") %>%
group_by(institutionCode, year) %>%
tally %>%
collect()
```
```
#> Joining, by = "speciesID"
```
```r
## with SQL syntax
query <- paste("
SELECT a.year, b.genus, count(*) as count
FROM specimens a
JOIN species b
ON a.speciesID = b.speciesID
AND b.genus = 'dipodomys'
GROUP BY a.year, a.institutionCode",
sep = "" )
tbl(rodents, sql(query))
```
--->
<blockquote>
<h3 id="challenge-1">Challenge</h3>
<p>Write a query that returns the total number of rodents in each genus collected in the different states.</p>
<p>Hint: Write a query that joins the species, locality, and specimens tables together. The query should return counts of genus by state.</p>
</blockquote>
<!-- answer
```r
genus_counts <- left_join(specimens, localities) %>%
left_join(species) %>%
group_by(stateProvince, genus) %>%
tally %>%
collect()
```
```
#> Joining, by = "localityID"
```
```
#> Joining, by = "speciesID"
```
--->
<p>This is useful if we are interested in estimating the number of specimens belonging to each genus found in each state. But what if we were interested in the number of scientific names found in each state? Using <code>tally()</code> gives the number of individuals, instead we need to use <code>n_distinct()</code> to count the number of unique values found in a column.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">unique_names <-<span class="st"> </span><span class="kw">left_join</span>(specimens, localities) %>%
<span class="st"> </span><span class="kw">left_join</span>(species) %>%
<span class="st"> </span><span class="kw">group_by</span>(stateProvince) %>%
<span class="st"> </span><span class="kw">summarize</span>(
<span class="dt">n_scientificNames =</span> <span class="kw">n_distinct</span>(scientificName)
) %>%
<span class="st"> </span><span class="kw">collect</span>()</code></pre></div>
<pre><code>#> Joining, by = "localityID"</code></pre>
<pre><code>#> Joining, by = "speciesID"</code></pre>
<p><code>n_distinct</code>, like the other <strong><code>dplyr</code></strong> functions we have used in this lesson, works not only on database connections but also on regular data frames.</p>
</div>
<div id="creating-a-new-sqlite-database" class="section level2">
<h2>Creating a new SQLite database</h2>
<p>So far, we have used a previously prepared SQLite database. But we can also use R to create a new database, e.g. from existing <code>csv</code> files. Let’s recreate the rodents database that we’ve been working with, in R. First let’s read in the <code>csv</code> files.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">species <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/species.csv"</span>)
specimens <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/specimens.csv"</span>)
localities <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/localities.csv"</span>)</code></pre></div>
<p>Creating a new SQLite database with <strong><code>dplyr</code></strong> is easy. You can re-use the same command we used above to open an existing <code>.sqlite</code> file. The <code>create = TRUE</code> argument instructs R to create a new, empty database instead.</p>
<p><strong>Caution:</strong> When <code>create = TRUE</code> is added, any existing database at the same location is overwritten <em>without warning</em>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">my_db_file <-<span class="st"> "my_rodent_database.sqlite"</span>
my_db <-<span class="st"> </span><span class="kw">src_sqlite</span>(my_db_file, <span class="dt">create =</span> <span class="ot">TRUE</span>)</code></pre></div>
<p>Currently, our new database is empty, it doesn’t contain any tables:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">my_db</code></pre></div>
<pre><code>#> src: sqlite 3.19.3 [my_rodent_database.sqlite]
#> tbls:</code></pre>
<p>To add tables, we copy the existing data.frames into the database one by one:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">copy_to</span>(my_db, specimens)
<span class="kw">copy_to</span>(my_db, localities)
my_db</code></pre></div>
<p>If you check the location of our database you’ll see that data is automatically being written to disk. R and <strong><code>dplyr</code></strong> not only provide easy ways to query existing databases, they also allows you to easily create your own databases from flat files!</p>
<blockquote>
<h3 id="challenge-2">Challenge</h3>
<p>Add the remaining species table to the <code>my_db</code> database and run some of your queries from earlier in the lesson to verify that you have faithfully recreated the rodents database.</p>
</blockquote>
<p><strong>Note:</strong> In this example, we first loaded all of the data into the R session by reading the three <code>csv</code> files. Because all the data has to flow through R, this is not suitable for very large datasets.</p>
<p style="text-align: right; font-size: small;">
Page build on: 2017-10-25 10:37:05
</p>
</div>
<hr/>
<p><a href="http://datacarpentry.org/">Data Carpentry</a>,
2017. <a href="LICENSE.html">License</a>. <a href="CONTRIBUTING.html">Contributing</a>. <br/>
Questions? Feedback?
Please <a href="https://github.com/data-lessons/R-nhcdata-lesson/issues/new">file
an issue on GitHub</a>. <br/> On
Twitter: <a href="https://twitter.com/datacarpentry">@datacarpentry</a></p>
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>