Skip to content

Latest commit

 

History

History
65 lines (56 loc) · 1.77 KB

README.md

File metadata and controls

65 lines (56 loc) · 1.77 KB

Final-Sprint

Graduation Project for Sprints' AI & ML Course, Building a LLM (Large Language Model) from scratch.

Through said transformer, it is expcted to classify the data based on labels given to it to be either toxic or non-toxic (indicated by $1$ and $0$ respectively), as to which it will be a labourous task within a timeframe of $10$ days (Starting from $13/09/2023$ and due on $23/09/2023$).

Tasks

Checklist to show what operations have and yet to happen on the data.

  • Load the data
  • Explore data:
    • Read Inside.
    • Overview the data.
    • Explain what is to be done.
    • Define Patterns to be replaced.
  • Cleaning:
    • Collapsing toxic Derivatives.
    • Dropping id and collapsed columns.
    • Remove Newlines.
    • Remove Special Characters.
    • Remove URLs.
    • Remove IPs.
  • NLP Preprocessing:
    • Tokenization
    • Removal of Stopwords.
    • Lowercasing.
    • Removing Non-English words.
    • Lemmatisation/Stemming.

  • Transformer:
    • Positional Encoding.
    • Multi-Head Attention.
    • Feed-Forward Neural Network.
    • Encoder Block:
      • Source Input.
      • Embed.
      • Normalize.
      • Feed-Forward.
    • Decoder Block:
      • Target Input.
      • Masking.
      • Normalise.
      • Cross-Head.
    • Combining Blocks.
  • Classification Transformer
    • Encoder

  • Functional Modularity (Preprocessing Only):
    • Binarize toxic.
    • Basic Text Preprocessing:.
      • Remove Newlines.
      • Remove Special Characters.
      • Remove URLs.
      • Remove IPs.
    • NLP Functions:
      • Tokenization
      • Stopword.
      • Lowercase.
      • Non-English.
      • Lemmatise/Stem.