GitHub - tonytrill/project1: CS 5970 Project 1

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.idea		.idea
project1		project1
tests		tests
README		README
main.py		main.py
project1.db		project1.db
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Project 1
CS 5970
Author: Tony Silva
Email: [email protected]

Introduction:

The project that was developed in this package is for the University of Oklahoma CS 5970 - Introduction to Text Analytics. This is the first phase of Project 1, Data Extraction and Database Population.
The goal of this project was to extract the text of 7 different latin texts from www.thelatinlibrary.com and store the textual data into a sqlite3 database. The second phase of the project will be added to this project as it is developed.

Attached with this project submission is a version of the database that was created from running the project. It will have all of the data stored into the database. The database was renamed for the sake of not messing with any runs of the program.
Please note: the project folder was turned in inside of another folder named "project1_silva6928", this folder contains the project1 folder and the project1_submit database.
Please see "Bugs/Edge Cases" section below if you have any questions.

Project 1 Phase 1:
In the first phase of the project, multiple activities were performed. First, a database called "project1" was created with the schema as required by the project details. Next, the data was extracted from The Latin Library website.
The Latin Library provides numerous of documents by different philosophers/authors/etc. all in Latin. I was tasked with selecting 7 different instances from the Latin Library, to extract its text, and store that text into a sqlite3 database.
The seven instances I chose were as follows: Magna Carta, Christian Creeds, Roman Epitaphs, Novatian, Alfonsi, Bonaventure, and Gregorius Magnus.
The tricky part of the first phase of this project was that every instance on the Latin Library had differing document formatting, and different HTML formatting. For each of the different instances, I created a new way to extract the text from the associated webpage.
The way the text was extracted was through the act of "screen scraping". The screen scraping was performed by utilizing the python package Beautiful Soup 4. Beautiful Soup offers a great way to extract data from html tags on a web page.
For each of the different extracts, Beautiful Soup was utilized along with Regular Expressions in order to format the data correctly. For each of the extract functions in the project, the data is extracted as a list of list following the exact schema of the sqlite3 database.
This made it easier to insert the data into the project 1 database. However, due to the different kinds of instances/documents that were delt with in the project, assumptions had to be made regarding the how the data was to be stored.
For example: The Magna Carta is one document. Therefore, it does not necessarily have multiple "books" associated with the document, therefore an assumption had to be made regarding this latin instance.
Those assumptions can be seen below in the section title "Project 1 Text Assumptions". The data were inserted into the database through 7 different insert functions that made calls to the extraction functions to get the associated schema list.
Then through the sqlite3 connection, SQL code was called through Python in each insert function, to insert the data into the table for each of the 7 instances.

Project 1 Phase 2:

In the second phase of the project, multiple activities were performed. First, an FTS table was created to perform full text searches against the project1 database. The search will pull each instance in the database that contains the search term.
Additionally, a Command Line User Interface was created to facilitate the interaction between the database and the user. The user interface allows the user to request to search for a Latin Term, translate an English term into Latin, and create a visualization of the number of occurences the term appears in a title.
If the term does not appear in a title then there will not be a bar within the graph. The translation services utilizes a REST API provided by mymemory.translated.net. The translation will take an English Term and translated it to the Latin term and print the Latin translated term. The system will provide the passages and links to where the search term appears.
for both Latin and the English to Latin translation.
The searching will provide each snippet the term appears in as well as the link to The Latin Library.

In order to get the visualization to work on the GPEL machines please use the command "ssh -X" when attempting to SSH into one of the machines

Project 1 Text Assumptions:

Magna Carta (http://www.thelatinlibrary.com/magnacarta.html):
Assume there is one chapter, and each new paragraph in the document is a new verse. (Depending on what Phase two is I may want to reconsider this approach)

Christian Creeds (http://www.thelatinlibrary.com/creeds.html):
Assume each new Creed is a new book. Assume One chapter for each book, and each new line in the Creed is a verse.

Roman Epitaphs (http://www.thelatinlibrary.com/epitaphs.html):
Assume that if the text beings with "B" it is in book "B" if the text begins with "CIL", then the book is "CIL"
Assume the number next too the book designator is the chapter. A new line is the verse of the chapter.

Novatian (http://www.thelatinlibrary.com/novatian.html):
Assume there is only one book titled "NOVATIANI DE TRINITATE". Assume a new paragraph is a new chapter. A new verse
is designated by the number in from of a sentence. Split the paragraph up using a regular expression catching all the
verses in a paragraph.

Alfonsi (http://www.thelatinlibrary.com/alfonsi.disciplina.html):
Assume one book, each bold title is a new chapter and every new paragraph under the bold title is a verse.

Gregorius Magnus (http://www.thelatinlibrary.com/greg.html):
Assume, one title and one book. Assume, a chapter is a new paragraph. Assume, a new sentence line is a verse.

Bonaventure (http://www.thelatinlibrary.com/bonaventura.itinerarium.html):
Assume there is one title, and one book. Assume that a new chapter is designated by bold header.
Assume that a new verse begins with a number followed by a period.

Language: Python 3

Requirements:
Please see the requirements.txt file in the project folder in order to know what packages are needed to run this package
You can utilize a virtual environment to download certain packaging requirements and run the program while the virtual environment is activated.
Please make sure you have internet connection.

How to run:
Please note - the project1 project folder is contained in another directory called project1_silv6928. This was done for submission purposes.
First make sure all of the requirements, in the requirements.txt file, (see above) are satisfied. This can be done through a virtual environment.
Next make sure the the project1 project folder is saved to your home directory (~/).

Utilize the following command in the Linux Command Terminal.

python3 ./project1/main.py

This will run the program utilizing python3 that you have loaded in your virtual environment (make sure virtual env is activated).
The program will create the project1 database in your home directory. Additionally, the user interface will prompt you in the command line to perform actions you want to perform.
Entering 1 will allow for a Latin term search
Entering 2 will allow for a English Translation and term search
Entering 3 will allow for you to create a visualization of a Latin or English(Latin translated) frequency chart.
In order to get the visualization to work on the GPEL machines please use the command "ssh -X" when attempting to SSH into one of the machines
If you enter 0 you will close the interface.
Please see section on Bugs/Edge Cases on multiple runs.

Unit Tests:
Unit tests were created for each of the functions in the project1.py file. The unit tests contain two different types of tests. First, was the tests for extraction. The tests for extractions tested that the resulting list generated by an extraction function was not null.
This ensures the data was downloaded from each web page and nothing failed for each of the 7 extract functions.
Second, were the tests for database population. Each of the database population tests tested a SQL query on the project1.db database to see if the generated data from the database was not null for each of the 7 latin texts.
This ensures the data was populated into the database and that nothing failed when population occurs for each of the 7 population functions
If all unit tests pass we know that the data was extracted and populated correctly.
If all unit tests pass we know that the FTS table works properly and so does the translation services

How to run unit tests:
First make sure you have your virtual environment activated
You can run the project by utilizing tests to make sure that the project ran properly.
Navigate to inside the project1 directory.
Load the contents of the directory into editable mode. (pip3 install -e .) (Not sure if step is actually needed)
Next use the command: python3 setup.py test
All unit tests should pass.
Please note, an additional .cfg file was added to the project folder based on information provided by the pytest documentation
In order to run the unit tests please make sure if you have ran the project before that you delete the database that was created. (See Bugs/Edge Cases).

Bugs/Edge Cases:
The program does not automatically handle multiple runs. If multiple runs are performed the database will just append the same data into the "latin" table so there will be duplicate records.
If you want to run for a second time, please make sure the project1.db database is deleted.