-
Expand Contractions
- Converts words like
"haven't"
→"have not"
. - Library:
contractions
- Converts words like
-
Lowercase Conversion
- Coverts all words to lower case so
"love"
and"Love"
would count as the same. - Library: Python built-in string methods.
- Coverts all words to lower case so
-
Remove Punctuation
- Removes punctuation and special characters since we only care about the words
- Library: Python's
string
module.
-
Tokenization
- Split text in letter into individual words so we can iterate through.
- Library:
spacy
-
Remove Stop Words
- Filters out common stop words like "and," "the," and "is."
- Library:
spacy
-
Lemmatization
- Converts words to their base/root form like
"running"
→"run"
. - Library:
spacy
- Converts words to their base/root form like
-
Save Cleaned Data
- Outputs the cleaned text to an array of JSON objects where each object is:
{post_id(string): [array, of, words,...]}
- Outputs a bag of words with duplicates
- Outputs the cleaned text to an array of JSON objects where each object is:
Use pip
to install the necessary Python packages:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Run npm run start
to see the project on localhost:3000
!