Cody Kingham, cak47[put "at-sign" here].cam.ac.uk
Much of this course material is directly adapted from the Python for Text Analysis Course at the Vrije Universiteit Amsterdam. I take care to indicate those materials which are directly copied from that course. A special thanks to Chantal van Son, Evan Miltenburg, Marten Postma, Filip Ilievski, Pia Sommerauer, and the Computational Lexicology & Terminology Lab at the VU.
- Intro - description of the course
- Course Schedule - course schedule for Spring 2020
- Directory (Folder) Structure - structure of this directory (i.e. folder)
- Getting Started – how to download this repository; instructions to install Anaconda; Bring Your Own Text (BYOT) for exercises
- Learning Strategies - advice for learning how to code in Python
- Zen of Python – how to write good Python code
- Bibliography - resources used to develop this course
Welcome to the Python for Linguists and Humanists course! In this course you will learn the basics of Python and how to begin using Python to address corpus-driven, quantitative research questions in your field. This course puts an emphasis on a Bring Your Own Text (BYOT) approach, where many of the assignments work from a plain-text file of a text you are interested in. While there are many existing off-the-shelf tools for English texts, humanists often work with non-English texts that are comparatively resource poor. Many Python courses use dummy problems for the exercises. But I've worked to relate many of the exercises to a worthwhile concept in corpus linguistics. Another distinctive of this course is that Pandas DataFrames are introduced early on. Pandas is a Python package which provides data containers such as DataFrames. DataFrames are tables of rows and columns (see matrices) that contain numerical or categorical data. This data structure comes standard in R, for instance, due to its necessity for statistics and data science. Knowledge of Pandas DataFrames is likewise critical if you'd like to expand to machine learning later on.
The total duration of this course is 8 weeks. The first 2 weeks focus on the bare basics of Python. In week 3 we dive into Pandas DataFrames. Week 4 focuses on reading and writing various data formats. In week 5, a basic introduction to Matplotlib for data visualizations is provided. Finally, from weeks 6-7 we will review methods and tools for quantitative linguistics. An overview of quantitative linguistic methods is explored via Natalia Levshina, Stefan Gries and Anatol Stefanowitsch, and Nick Ellis. We look at off-the-shelf tools such as Text-Fabric, Natural Language Toolkit and spaCy. The 8th and final week will be dedicated to the final project, in which you will formulate a research question/hypothesis, and design a quantitative experiment to test that hypothesis within your own text. As the culmination of that experiment, you will upload your work to Github and archive it in Zenodo.
27 March – 15 May (2020)
Cody will teach the chapters. Held via Zoom (see Slack for more info).
Date | Time | Chapters | Topics | Assignment |
---|---|---|---|---|
27.03.2020 | 15:00–16:00 UTC | 1–4 | variables, values, integers, floats, strings, booleans, conditionals | ASSIGNMENT_1.ipynb |
03.04.2020 | 15:00–16:00 UTC | 5-11 | containers, loops, functions | TBD |
10.04.2020 | 15:00–16:00 UTC | TBD | Pandas DataFrames | TBD |
17.04.2020 | 15:00–16:00 UTC | TBD | importing, text files, data formats | TBD |
24.04.2020 | 15:00–16:00 UTC | TBD | matplotlib basics | TBD |
01.05.2020 | 15:00–16:00 UTC | TBD | methods in quantitative linguistics | TBD |
08.05.2020 | 15:00–16:00 UTC | TBD | Text-Fabric, Natural Language Toolkit, spaCy | TBD |
15.05.2020 | 15:00–16:00 UTC | project | submit final project |
Relaxed working session where students can ask questions and get help on assignments. Held via Zoom (see Slack for more info).
Date | Time |
---|---|
01.04.2020 | 15:00–16:30 UTC |
08.04.2020 | 15:00–16:30 UTC |
15.04.2020 | 15:00–16:30 UTC |
22.04.2020 | 15:00–16:30 UTC |
29.04.2020 | 15:00–16:30 UTC |
06.05.2020 | 15:00–16:30 UTC |
13.05.2020 | 15:00–16:30 UTC |
A directory is another word for a "folder". This director contains the following "sub"-directories. They are explained below in order of importance.
- chapters – contains the Jupyter notebooks from which I'll teach each lesson
- assignments - contains the Jupter notebook assignments which you can submit for optional evaluation
- BYOT – put the
.txt
file you want to use for the assignments here - data/texts – ready-made
.txt
files to put in BYOT if you don't want to use your own - data - data for the various assignments will go here
- images – these are just images for displaying content throughout the directory
The page you're reading now is a part of what's called a Github repository. A "repository" is just another way of saying "folder" or project. Github gives us a way to store and share code openly online.
You will need a copy of this repository on your own machine for the course. You can download a copy by clicking the green Clone or download
button, or by simply clicking the image below:
Or if you are familiar with command line and have the developer tools installed (Mac), in a directory of your choice just say:
git clone https://github.com/codykingham/pyling
For this course we rely heavily on packages and tools that come prepackaged in the Anaconda distribution of Python. Even if you already have a version of Python installed, it is best to install a parallel Anaconda version to avoid potential problems.
Follow these steps to install and launch Python:
1. Proceed to https://www.anaconda.com/distribution/, scroll down, download and install Anaconda for Python 3.7. See the Anaconda cheatsheet for additional information about installing
Be sure to select Python 3.7:
2. After installation, open the Anaconda Navigator which should've appeared somewhere in your applications area. From the launcher, click on the Jupter notebook application. It looks like this:
The Jupyter interface will open in your web browser. Note that Jupyter only uses your web browser as an interface, it is not actually connected to the internet and therefore does not need the internet to launch. You can now navigate within the Jupyter interface to a folder of your choice. Click New
at the upper right hand corner. You will see Notebook: Python 3
. Click it. This will launch you into your first Jupyter notebook!
Next, try to open the first Jupyter notebook lesson for this course. Navigate within the Jupyter file navigator to your local copy of this repository. Under the lessons/
folder you will find a bunch of Jupyter notebooks that are already pre-loaded with code and content. This is how we will begin the course!
For this course, you should bring your own plain-text file which the exercises will automatically load. There are a few guidelines for the text that you choose:
- any language is fine
- free of any markup or tags
- the text should be plain-text saved with a
.txt
extension. i.e. NOT Microsoft Word or equivalent, NOT rich text (.rtf
). - ~700kb or larger in size (i.e. a sizable corpus). This is a loose number, slightly lower is fine.
- has some kind of meta-data/introductory text at the beginning, and some indicator at the end of the file that text has ended.
A really great place to get texts like this is Project Gutenburg, which has a place you can download a .txt
. You might need to right-click and select "Download Linked File As..." to download the .txt
file directly.
If you'd prefer to simply use a ready-made plain-text file, you may pick one under data/texts/
.
After you've found the .txt
you want to use, place it in the BYOT
folder. The assignments will automatically pull the .txt
file placed in this folder.
Here is a some great advice on learning to code, taken from the Python for Text Analysis course at the VU.
When you are just learning how to program, it sometimes happens that you get stuck and you don't know what to do next. This is normal and even happens to very experienced programmers. Please try to follow these strategies when you get stuck:
- If you get error messages, read them carefully - they are informative! In particular, check the line in which the error occurs. If you don't understand what it says, try to google it (you will most likely find some explanation on Stackoverflow).
- Try to take a step back. Sometimes, you lose sight of the bigger picture when dealing with complicated code. Try to break down the problem into smaller problems without writing actual code (pen and paper can be quite helpful).
- Check the class material for solutions (the chapters treated in the assignment are usually a good start).
- Explain the problem to someone else (e.g. a class mate). Go through the code line by line and explain what it does (See pair programming and rubber duck debugging).
- Finally, take a break! Very often, just having a fresh look at the code helps!
- If none of these steps helped, please ask us for help (see assignment notebooks for contact details).
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Chantal van Son, Evan Miltenburg, Marten Postma, Filip Illievski, Pia Sommerauer. Python for Text Analysis course. Computational Lexicology and Terminology Lab, Vrije Universiteit Amsterdam.
Natalia Levshina. How to do Linguistics with R. Amsterdam: John-Benjamins, 2015.