11-785_project

Authors:

Mitchell Fogelson
Chris Dare
Xinkai Chen
Tony Dong

Date: 11-30-2020

Description:

This project was a course project for Introduction to Deep Learning 11-785 course at CMU Fall 2020.

Goals:

The goal of this project was to create a novel poetry generation Deep Learning Model.

Constraints:

We decided to constrain the problem to the poetry form of Limericks.
Limericks are rhyming poems of the form: AABBA

Model Architecture:

We used the GPT2 117M architecture based off the code from nshepperd
We trained on-top of a pretrained model that learned from general poetry by gwern

Hardware:

We trained the model on an NVIDIA Tesla V100

Data Base:

We used a corpus of ~90,000 Limericks thanks to sballas8

Preprocessing:

We removed all punctuation
We converted all numbers to text
We removed all poems that did not conform to the structure above
We added <|endoftext|> token to end of each poem

Training Time:

The model was trained on 24 GPU hrs
The final loss was ~0.90

Evaluation Metrics:

We implemented a Rhyming evaluation
We implemented a Coreference evaluation
We implemented a Nonsense word evaluation
We also set up a website HERE where we had human's evaluate poems generated from our model vs. from the training dataset
- This is the best way we can evaluate the success of our system

Downsampling:

From 8000 unconditionally generated poems 1000 were scored well enough to pass the 3 metrics described above and used for user testing

Deliverables

Key Resources

Gwern Blog Poetry Learning with GPT2:
- Teaches how to run GPT2
- Suggestions for improvements
- Experiments
Nshepperd/gpt-2 Github:
- Good documentation for running GPT2
- How GPT2 Works
Cole Peterson Master's Thesis:
- Useful information about poetry datasets
- Other methods for learning Poetry
Ng Wai Foong's Medium Article:
- Step by step for learning how to train GPT2 Model

Data

Models

117M-Clean (Gwern Model): https://mega.nz/#!2PhghaZD!_IJPpErXIRIDwRI0ktq2UKUZClDEoY7z8UpF28_qme8
117M-Clean-Lym Note:Model is too large to store on github contact Mitch to share
- Train time: 21hrs
- Loss: 0.09
117M-AA: Note:Model is too large to store on github contact Mitch to share
- Train time: 40hrs
- Loss: 0.11
117M-AABB: Note:Model is too large to store on github contact Mitch to share
- Train time: 40hrs
- Loss: 0.1
117M-limerick Note:Model is too large to store on github contact Mitch to share
- Train time: 40hrs
- Loss: 0.26

Samples (Experiment1 Model)

1:

caboyola's a genus of weeds

that grows near the shore and seeds seeds

or these shrubs found beside

are quite furry each side

<|endoftext|>

2:

a person who's often so rude

takes a tack of a beach that's subdued

in a business the lad

is more childish than bad

<|endoftext|>

3:

an episcopal practice i'm told

is quite certain to fight for our gold

to get gold from the king

to be saved from the thing

<|endoftext|>

4:

this is all about grandma who's proud

of her years in society's crowd

she has got a big raise

in those fungal-type ways

<|endoftext|>

Experiment 1

Create corpus of limericks:

[A |$| A]

[B |$| B]

[END]

Limericks Definition

5 Line Rhyming Poem
Rhyming Structure: A A B B A

Raw Data Example:

cap'n jack was washed over the side.

his crew searched but found not hair nor hide.

no longer the helm,

but the deep benthic realm,

is where jack will forever reside.

Processed Data Output:

["cap'n jack was washed over the side|$|his crew searched but found not hair nor hide"]

['no longer the helm|$|but the deep benthic realm']

['<|endoftext|>']

Files Included

preprocesser.ipynb -> Jupyter Notebook for preprocessing raw data
rhyming_evaluation.ipynb -> Jupyter Notebook to evaluate output samples Rhyming Success

Setting up

Dowloading data files required to run the app/notebook experiments

Step 1: Create .env file with required variables. See the .env.sample template for pointers AWS credentials will be on our slack. Will add a script for public download later on. Step 2: Run the setup script

bash setup.sh

How to move forward?

Better preprocessing data now that we know how the GPT2 matches structure of input - Xinkai
Finding ways to evaluate outputs quantitatively
- Rhyming - Chris
- Non-sense words - Mitch
- Pronoun reference - Tony
- Action reference
Change GPT2
- Loss Function
- NOTE: Cannot be done until non-human quantitative evaluation methods are made

PS: Setting up HTTP Access on EC2 instances:

Update:

Alright, I just figured out that some steps were not needed at all. Should be really simple.

Some details:

Launch instances
- Amazon Linux 2 AMI (HVM), SSD Volume Type - ami-03657b56516ab7912 (64-bit x86) / ami-023b120e01f4779c1 (64-bit Arm)
(The first one in the free tier group)

(Note that the username is ec2-user instead of ubuntu)

(Not recommended because many libraries (including pip, flask) need manual installation)

(Install pip:)
```
$ curl -O https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py --user
```
(Probably need to install python3 later for our project???)
On the Configure Security Group page
- Add Rule "Custom TCP Rule", where the Port Range must cover the port number used by our web app
- The source IP can be set to "0.0.0.0/0, ::/0" just for now
In app.py
- The listening IP should be set to brodcast IP address, i.e. "0.0.0.0"

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
api		api
app		app
app_update		app_update
tests		tests
youshen		youshen
.env.sample		.env.sample
.gitignore		.gitignore
Dockerfile		Dockerfile
Downsampler.ipynb		Downsampler.ipynb
README.md		README.md
User Evaluation Code.ipynb		User Evaluation Code.ipynb
best_limericks.txt		best_limericks.txt
coref.ipynb		coref.ipynb
correctness_score.ipynb		correctness_score.ipynb
docker-compose.yml		docker-compose.yml
init_repo.py		init_repo.py
limerick_samples.ipynb		limerick_samples.ipynb
limericks.txt		limericks.txt
limericks2.txt		limericks2.txt
limericks_final.txt		limericks_final.txt
perplexity_evaluator.ipynb		perplexity_evaluator.ipynb
poetry.ipynb		poetry.ipynb
poetry.lock		poetry.lock
preprocesser.ipynb		preprocesser.ipynb
preprocesser_v1.ipynb		preprocesser_v1.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rhyming_AA.txt		rhyming_AA.txt
rhyming_AABB.txt		rhyming_AABB.txt
rhyming_evaluation.ipynb		rhyming_evaluation.ipynb
samples-experiment-01.zip		samples-experiment-01.zip
settings.py		settings.py
setup.sh		setup.sh
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

11-785_project

Deliverables

Key Resources

Data

Models

Samples (Experiment1 Model)

Experiment 1

Files Included

Setting up

Dowloading data files required to run the app/notebook experiments

How to move forward?

PS: Setting up HTTP Access on EC2 instances:

About

Releases

Packages

Contributors 4

Languages

mfogelson/11-785_project

Folders and files

Latest commit

History

Repository files navigation

11-785_project

Deliverables

Key Resources

Data

Models

Samples (Experiment1 Model)

Experiment 1

Files Included

Setting up

Dowloading data files required to run the app/notebook experiments

How to move forward?

PS: Setting up HTTP Access on EC2 instances:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages