Added more workshop materials

NCSU-Libraries · Jun 13, 2019 · 3a9bb63 · 3a9bb63
1 parent 9649497
commit 3a9bb63
Show file tree

Hide file tree

Showing 249 changed files with 47,456 additions and 0 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/Bringing Historical Maps into GIS/1861_RR_Depots.xls b/Bringing Historical Maps into GIS/1861_RR_Depots.xls
diff --git a/Bringing Historical Maps into GIS/Bringing Historical Maps into GIS.pdf b/Bringing Historical Maps into GIS/Bringing Historical Maps into GIS.pdf
diff --git a/...Maps into GIS/Map of North and South Carolina /NC_1860_eyhayes_workshoptest.jpg b/...Maps into GIS/Map of North and South Carolina /NC_1860_eyhayes_workshoptest.jpg
diff --git a/Bringing Historical Maps into GIS/Map of North and South Carolina /NC_1860_metadata.txt b/Bringing Historical Maps into GIS/Map of North and South Carolina /NC_1860_metadata.txt
@@ -0,0 +1,31 @@
+Author	Johnson, A.J.
+Date	1860
+Short Title	North And South Carolina.
+Publisher	Johnson and Browning
+Publisher Location	New York
+Type	Atlas Map
+Obj Height cm	44
+Obj Width cm	61
+Scale 1	1,584,000
+Note	In full color by county. Inset map of the city of Charleston.
+Reference	P6140-26-27.
+State/Province	North Carolina
+State/Province	South Carolina
+Full Title	Johnson's North And South Carolina By Johnson & Browning. No. 26-27.
+List No	2905.017
+Series No	21
+Publication Author	Johnson, A.J.
+Pub Date	1860
+Pub Title	Johnson's New Illustrated (Steel Plate) Family Atlas, With Descriptions, Geographical, Statistical, And Historical. Compiled, Drawn, and Engraved Under The Supervision Of J.H. Colton And A.J. Johnson. New York: Johnson And Browning, Formerly (Successors To J.H. Colton And Company,) No. 133 Nassau Street. 1860. Entered ... One Thousand Eight Hundred and Sixty, by Johnson & Browning ... Virginia.
+Pub Reference	P6140.
+Pub Note	1st Edition, 1st issue. Most of the maps come from Colton's 1859 edition of the General Atlas, published by Johnson and Browning, indicating the Johnson connection; some do not come from this atlas, and their sources are: the New England maps (scale 1" = 9 miles) come from Colton's map of New England and then the sub-maps of Vermont and New Hampshire, Mass/Conn/R.I.; the Ohio/Indiana is still a mystery; all the 1" = 24 miles maps (Iowa, Kentucky, etc.) come from Colton's Map of the United States and the Canadas, originally published by J. Calvin Smith in 1843 (see W. Heckrotte's copies and his list of editions); and the Colton General Atlas maps used by Johnson come from Colton's Travellers Series of maps - see our copies of Penn., Indiana. Colton mentions "The National Atlas of the United States, constructed from the Public Surveys..large Folio" as in preparation in his 1855 catalogue; this may be the embryonic Johnson Atlas. Colton used his wall maps "cut up" for pocket maps and Atlases. Johnson's maps of S. America, Europe, Africa, and (in the first edition, first issue, only) China, East Indies etc., all come from D. Griffing Johnson's Map of the World, 1847. These atlas maps are updated (esp. Africa). Colton took over the publication of the World Map in 1849, issued editions to 1868 (Ristow p318). Also, Johnson's N. America map is the inset N. America in Smith's Map of the U.S., the Canadas, etc. This first issue of Johnson's Family Atlas differs from the later 1860 edition in a small N.Y. (from the Colton U.S. map), small Texas, and many of the maps have fewer views or no views or different configurations. Clearly, this was a first attempt that was refined later in the year. Another issue of this same edition was published in Richmond, Virginia, the home town of Browning (I.L.). The California map originates with Johnson's New Illustrated and Embellished County Map of the Republics of North America, 1859 (1859 our copy, 1st ed. 1856), by D.G. and A.J. Joh
+Pub List No	2905.000
+Pub Type	World Atlas
+Pub Maps	55
+Pub Height cm	47
+Pub Width cm	38
+Image No	2905017
+Download 1	<a href=http://www.davidrumsey.com/rumsey/download.pl?image=/D0031/2905017.sid target=_blank>Full Image Download in MrSID Format</a>
+Download 2	<a href="https://www.extensis.com/support/geoviewer-9" target="_blank">GeoViewer for JP2 and SID files</a>
+Authors	Johnson, A.J.
+Collection	Rumsey Collection
diff --git a/...Historical Maps into GIS/Workshop Activity_ Georeferencing and Publishing Maps Online.pdf b/...Historical Maps into GIS/Workshop Activity_ Georeferencing and Publishing Maps Online.pdf
diff --git a/Diving Deeper into Text Analysis with R/Diving Deeper into Text Analysis with R Slides.pdf b/Diving Deeper into Text Analysis with R/Diving Deeper into Text Analysis with R Slides.pdf
diff --git a/Diving Deeper into Text Analysis with R/Diving Deeper into Text Analysis with R Slides.pptx b/Diving Deeper into Text Analysis with R/Diving Deeper into Text Analysis with R Slides.pptx
diff --git a/Diving Deeper into Text Analysis with R/analysis.R b/Diving Deeper into Text Analysis with R/analysis.R
@@ -0,0 +1,275 @@
+## Diving Deeper into Text Analysis with R
+## Markus Wust, Alison Blaine, and Erica Hayes
+## Activity: Analyze Presidential State of the Union Addresses
+
+
+# Section 1. Load data from multiple text files, create a corpus, and create a wordcloud.
+
+# 1. Import libraries
+install.packages("tm") # Text mining
+install.packages("readtext") # Reading text data from various filetypes
+install.packages("dplyr") # Data manipulation (from tidyverse)
+install.packages("magrittr") # Allows you to chain functions together (from tidyverse)
+install.packages("ggplot2") # Data visualization (from tidyverse)
+install.packages("stringr") # String manipulation (from tidyverse)
+install.packages("forcats") # Factor reordering (from tidyverse)
+install.packages("wordcloud") # Creating wordclouds 
+install.packages("RColorBrewer") # Creating color palettes
+install.packages("topicmodels") # Topic modeling
+install.packages("tidytext") # Tidying text data for text analysis
+install.packages("SentimentAnalysis") # Sentiment analyzer
+
+# 2. Load libraries
+library(tm)
+library(readtext)
+library(dplyr)
+library(magrittr)
+library(ggplot2)
+library(stringr)
+library(forcats)
+library(wordcloud)
+library(RColorBrewer)
+library(topicmodels)
+library(tidytext)
+library(SentimentAnalysis)
+
+# 3. Get the text from the files in your working directory into R as a dataframe.
+# NOTE: We got the data files from the internet using the sotu package and the following 2 lines of code:
+# directory <- getwd()
+# sotu_dir(dir = directory) # load in all of the SOTU address files into your working directory.
+
+docs <- readtext("*.txt",
+                      docvarsfrom = "filenames", 
+                      docvarnames = c("filename"),
+                      encoding = "utf8")
+
+
+head(names(docs)) # see column names
+
+# 4. Create new fields for year and name in your dataset.
+docs <- docs %>% 
+  mutate(year= str_sub(docs$filename, -5)) %>% # create a year column from the last 5 characters of the filename
+  mutate(name= str_sub(docs$filename, 1, -6)) # create a name column
+
+# 5. Clean the fields by stripping out unwanted characters and removing the filename column.
+docs$year <- docs$year %>%
+  str_replace("[-ab]", "")  # remove unwanted characters from the year column
+
+docs$name <- docs$name %>%
+  str_replace_all("-", " ") # remove unwanted characters from the name column
+
+docs <- select(docs, -filename)
+
+# 6. Turn your data frame into a corpus object for text analysis.
+docs_source <- DataframeSource(docs) # interprets each row of docs as a document
+docs_corpus <- SimpleCorpus(docs_source)  # creates a corpus of documents
+
+
+# 7. Now do some text processing.
+
+# A. Change all words to lower case
+docs_corpus <- tm_map(docs_corpus, content_transformer(tolower)) #tolower() is a base R function
+
+# B. Remove common English stopwords (e.g., "and", "the", etc).
+# If you want to define your own list of words, you can pass those words in a character vector, e.g.
+# docs_corpus <- tm_map(docs_corpus, removeWords, c("dogs", "cats"))
+
+docs_corpus <- tm_map(docs_corpus, removeWords, stopwords("english")) # removeWords is a tm function
+
+# C. Remove punctuation
+docs_corpus <- tm_map(docs_corpus, removePunctuation) #removePunctuation is a tm function
+
+# D. Remove numbers
+docs_corpus <- tm_map(docs_corpus, removeNumbers) #removeNumbers is a tm function
+
+# E. Remove white space
+docs_corpus <- tm_map(docs_corpus, stripWhitespace) #stripWhitespace is a tm function
+
+# see an example document
+content(docs_corpus[[100]])
+
+# 8. Create a term-document matrix (shows frequency of terms in a document collection)
+termdocmatrix <- TermDocumentMatrix(docs_corpus)
+matrix <- as.matrix(termdocmatrix)
+
+# 9. Get counts across all documents
+counts <- rowSums(matrix)
+sorted_counts <- sort(counts, decreasing = TRUE)
+
+# 10. Convert matrix to dataframe
+df <- data.frame(word = names(sorted_counts),freq=sorted_counts, row.names=NULL)
+
+# see the top 20 terms
+head(df, 20)
+
+# Uncomment next line if you want the same wordcloud on every run
+#set.seed(1234)
+
+# 11. Generate the word cloud
+wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words=500, random.order=TRUE, rot.per=0.35, colors=brewer.pal(8, "Set2"))
+
+# Section 2. Graphing & More with Term Frequencies
+
+# 12. Create a bar plot of the top 20 terms.
+barplot(sorted_counts[1:20], col="blue", las=2, ylim=range(pretty(c(0, sorted_counts)))) #bars will be in descending order because matrix is sorted
+
+# The previous bar plot was created with base R. This code creates the bar plot with the package ggplot2.
+df$word <- as.factor(df$word) #first, turn "word" into a factor so you can order the bars by count
+
+freq_plot <- df %>%
+  top_n(15) %>% # select top 15 terms
+  ggplot(., aes(x=fct_reorder(word, freq), y=freq)) + 
+  geom_col() +
+  coord_flip() +
+  labs(title="Most Frequent Terms", x=NULL, y="frequency") + 
+  theme_classic()
+
+freq_plot  
+
+# 13. Some more term frequency functions. 
+
+# Find frequent terms in a term-document-matrix. You can set a range.
+findFreqTerms(termdocmatrix, lowfreq = 1000, highfreq = Inf)
+
+# See the most frequent terms by document.
+findMostFreqTerms(termdocmatrix)
+
+
+# Section 3. Term Frequency - Inverse Document Frequency (tf-idf) and corpus filtering
+
+# 14. Construct a tf-idf on a term-document matrix to determine a document's most distinctive words
+tf_idf <- weightTfIdf(termdocmatrix, normalize = TRUE)
+tf_idf_mat <- as.matrix(tf_idf)
+
+# See most distinctive words by document.
+findMostFreqTerms(tf_idf)
+
+# 15. Use meta() and tm_filter() to filter out documents from the tdm. Create a set of documents from 1989-2016. 
+
+find_docs_8916 <- meta(docs_corpus, "year") >= 1989 & meta(docs_corpus, "year") <= 2016
+docs_8916 <- docs_corpus[find_docs_8916]
+
+docs_8916
+
+
+# 16. Create a tdm of the 1989-2016 corpus. Find term frequencies by document using that tdm.
+
+tdm_8916 <- TermDocumentMatrix(docs_8916)
+
+findMostFreqTerms(tdm_8916)
+
+
+# 17. Practice for Sections 2 & 3. 
+
+# A. Create a new subsetted corpus of documents based on President name and/or year. Use step # 15 as a guide.
+# Example filtering on name: 
+# nixon <- meta(docs_corpus, "name") == "richard m nixon"
+# nixon_docs <- docs_corpus[nixon]
+
+
+# B. Create a term document matrix from that subset.
+# Example solution: nixon_tdm <- TermDocumentMatrix(nixon_docs)
+
+# C. Find the most frequent terms in that matrix. 
+# Example solution: nixon_mostfreq <- findMostFreqTerms(nixon_tdm)
+
+# D. Create tf-idf weighting for that matrix. See step #14 as a guide.
+# Example solution: 
+# nixon_tf_idf <- weightTfIdf(nixon_tdm, normalize = TRUE)
+# nixon_tf_idf_mat <- as.matrix(nixon_tf_idf) # see it as a matrix
+
+# E. See the most distinctive terms by document in your tf-idf matrix. See #14 as a guide. 
+# findMostFreqTerms(nixon_tf_idf)
+
+# Section 4. Topic Modeling using Latent Dirichlet Allocation (LDA)
+
+# 18. Create a Document Term Matrix (as opposed to tdm).
+# from subset of speeches from 1989-2016 (GHW Bush to Obama years)
+dtm_8916 <- DocumentTermMatrix(docs_8916)
+
+# 19. Run the LDA model on the dtm.
+docs_8916_lda <- LDA(dtm_8916, k=10, control=list(seed=1234))
+docs_8916_lda
+
+docs8916_topics <- tidy(docs_8916_lda, matrix = "beta") #tidy() comes from the tidytext library
+docs8916_topics
+
+# 20. Sort topics by group in descending order. 
+topics_8916 <- docs8916_topics %>%
+  group_by(topic) %>%
+  top_n(10, beta) %>%
+  ungroup() %>%
+  arrange(topic, -beta)
+
+# 21. Graph the topics using ggplot2. 
+topics_8916 %>%
+  mutate(term=reorder(term, beta)) %>% #order terms by beta
+  ggplot(aes(term, beta, fill=factor(topic))) + 
+  geom_col(show.legend = FALSE) + 
+  facet_wrap(~ topic, scales="free") +    # break into subplots based on topic
+  coord_flip() #flip x and y coordinates for horizontal bar chart
+
+# 22. Filter out common words from the topics and graph again.
+
+common_words <- c("will", "people", "america")
+
+topics_8916 %>%
+  subset(., !term %in% common_words) %>%   # remove common words before graphing
+  mutate(term=reorder(term, beta)) %>%
+  ggplot(aes(term, beta, fill=factor(topic))) +
+  geom_col(show.legend = FALSE) +
+  facet_wrap(~ topic, scales="free") +
+  coord_flip()
+
+# 23. Practice. Repeat LDA model steps using the obama_docs dataset.
+
+# A. Create a Document Term Matrix for obama_docs named dtm_obama_docs
+
+# B. Run the LDA model on the dtm_obama_docs.
+
+# C. Sort topics by group in descending order. 
+
+# D. Graph the topics. 
+
+
+# Section 5. Sentiment Analysis
+
+# 24. Use analyzeSentiment on a corpus to get a sentiment rating per document.
+sentiment_8916 <- analyzeSentiment(tdm_8916)
+
+sentiment_8916
+
+plotSentiment(sentiment_8916)
+
+# 25. Try a tidytext approach to determining sentiment within documents. 
+
+# A. Create a data subset of docs data set (which is a dataframe, not a corpus) for George W. Bush docs.
+
+gw_docs <- docs %>%
+  filter(name == "george w bush")
+
+#. B. Tokenize the text into sentences using the unnest_tokens() function.
+
+gw_docs <- gw_docs %>%
+  unnest_tokens(word, text)
+
+head(gw_docs, 4)
+
+# 26. Tokenize the text into one word per line.
+
+gw_docs_sent <- gw_docs %>%
+  inner_join(get_sentiments("afinn")) %>%
+  group_by(year) %>%
+  summarise(sentiment= sum(score, na.rm=TRUE)) %>%
+  ggplot(aes(year, sentiment)) + geom_line(group=1)
+
+gw_docs_sent
+
+# 27. Let's look at the 2003 speech more closely. It got a pretty negative score overall.
+
+gw_docs_2003 <- gw_docs %>%
+  filter(year==2003) %>%
+  inner_join(get_sentiments("afinn")) %>% # calculates a sentiment score for each word
+  arrange(score) # sort by most negative to most positive term
+
+gw_docs_2003