Graduation Project for Sprints' AI & ML Course, Building a LLM (Large Language Model) from scratch.
Through said transformer, it is expcted to classify the data based on labels given to it to be either toxic or non-toxic (indicated by
Checklist to show what operations have and yet to happen on the data.
- Load the data
- Explore data:
- Read Inside.
- Overview the data.
- Explain what is to be done.
- Define Patterns to be replaced.
- Cleaning:
- Collapsing
toxic
Derivatives. - Dropping
id
and collapsed columns. - Remove Newlines.
- Remove Special Characters.
- Remove URLs.
- Remove IPs.
- Collapsing
- NLP Preprocessing:
- Tokenization
- Removal of Stopwords.
- Lowercasing.
- Removing Non-English words.
- Lemmatisation/Stemming.
- Transformer:
- Positional Encoding.
- Multi-Head Attention.
- Feed-Forward Neural Network.
- Encoder Block:
- Source Input.
- Embed.
- Normalize.
- Feed-Forward.
- Decoder Block:
- Target Input.
- Masking.
- Normalise.
- Cross-Head.
- Combining Blocks.
- Classification Transformer
- Encoder
- Functional Modularity (Preprocessing Only):
- Binarize
toxic
. - Basic Text Preprocessing:.
- Remove Newlines.
- Remove Special Characters.
- Remove URLs.
- Remove IPs.
- NLP Functions:
- Tokenization
- Stopword.
- Lowercase.
- Non-English.
- Lemmatise/Stem.
- Binarize