Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content(cms): create Resource "corpus-analysis-with-spacy/index" #1135

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions content/posts/corpus-analysis-with-spacy/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
title: Corpus Analysis with spaCy
lang: en
date: 2024-07-11T15:20:06.548Z
version: 1.0.0
authors:
- s-kane-megan
editors:
- ladd-john-r
tags:
- python
- big-data
categories:
- programming-historian
featuredImage: images/11238783706_e7ca6c0c35_o.jpg
abstract: This lesson demonstrates how to use the Python library spaCy for
analysis of large collections of texts. This lesson details the process of
using spaCy to enrich a corpus via lemmatization, part-of-speech tagging,
dependency parsing, and named entity recognition. Readers will learn how the
linguistic annotations produced by spaCy can be analyzed to help researchers
explore meaningful trends in language patterns across a set of texts.
domain: Social Sciences and Humanities
targetGroup: Domain researchers
type: training-module
remote:
date: 2023-11-02T16:31:00.000Z
url: https://doi.org/10.46430/phen0113
publisher: ProgHist Ltd
licence: ccby-4.0
toc: false
draft: false
uuid: E7Hh84XHeikiofOoQpNW2
---
Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?

One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lesson is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.

#### Reviewed by:
- Maria Antoniak
- William Mattingly

## Learning outcomes
After completing this lesson, you will be able to:
- Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
- Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition
- Conduct frequency analyses using part-of-speech tags and named entities
- Download an enriched dataset for use in future NLP analyses

<ExternalResource title="Interested in learning more?" subtitle="Check out this lesson on Programming Historian's website" url="https://doi.org/10.46430/phen0113" />