Compared keyword analysis
This is a Python 3 package, which generates the most important key-phrase (or key-words) from a document based on a corpus. It reads one script file (script.txt) and 3 transcript files (transcript1,2,3.txt) and:
compute the most important key-words (a key-word can be between 1-3 words) in the script and transcripts;
select the top 10 keywords and the top 5 bigrams and trigrams for visualization and comparison;
the visualization in piecharts shows the frequency of occurrence of these top n-words in each text and overall.
The texts are intially cleaned from a list of stopwords ( Differences due to capital letters and singular/plural nouns are disregarded. The top 10 keywords and the top 5 bigrams and trigrams give a simplified but significative idea of the keyword distribution.
The code can be run through Functions include:
- (count key-words in a text)
- (get the frequency of any n-gram (composition of n-words))
- (remove stopwords from a text)
- (visualize percentage of occurrence in a piechart)
Stopwords are stored in stopwords.txt.
codecs re counter inflect tee islice matplotlib #title