Final-Sprint

Graduation Project for Sprints' AI & ML Course, Building a LLM (Large Language Model) from scratch.

Through said transformer, it is expcted to classify the data based on labels given to it to be either toxic or non-toxic (indicated by $1$ and $0$ respectively), as to which it will be a labourous task within a timeframe of $10$ days (Starting from $13/09/2023$ and due on $23/09/2023$).

Tasks

Checklist to show what operations have and yet to happen on the data.

Load the data
Explore data:
- Read Inside.
- Overview the data.
- Explain what is to be done.
- Define Patterns to be replaced.
Cleaning:
- Collapsing toxic Derivatives.
- Dropping id and collapsed columns.
- Remove Newlines.
- Remove Special Characters.
- Remove URLs.
- Remove IPs.
NLP Preprocessing:
- Tokenization
- Removal of Stopwords.
- Lowercasing.
- Removing Non-English words.
- Lemmatisation/Stemming.

Transformer:
- Positional Encoding.
- Multi-Head Attention.
- Feed-Forward Neural Network.
- Encoder Block:
  - Source Input.
  - Embed.
  - Normalize.
  - Feed-Forward.
- Decoder Block:
  - Target Input.
  - Masking.
  - Normalise.
  - Cross-Head.
- Combining Blocks.
Classification Transformer
- Encoder

Functional Modularity (Preprocessing Only):
- Binarize toxic.
- Basic Text Preprocessing:.
  - Remove Newlines.
  - Remove Special Characters.
  - Remove URLs.
  - Remove IPs.
- NLP Functions:
  - Tokenization
  - Stopword.
  - Lowercase.
  - Non-English.
  - Lemmatise/Stem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Final-Sprint

Tasks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Final-Sprint

Tasks