This repository contains a Transformer model implemented in PyTorch, leveraging the "English-Chinese Basic Sentences" dataset from the Kurohashi-Kawahara Laboratory at Kyoto University.
I created this repository as part of my research project at Wevnal, where I am currently working on Vision-Language Model (VML) investigations. To deepen my understanding of Large Language Models (LLMs), particularly the Transformer architecture, I decided to implement the entire model from scratch using PyTorch. Instead of relying on pre-built modules or functions, I coded each component manually to gain a comprehensive understanding of the underlying mechanisms. This project aims to serve as both a learning resource and a practical implementation reference.
The "English-Chinese Basic Sentences" dataset is used as the primary training and evaluation data. Each sentence pair contains a basic English sentence and its corresponding translation in Chinese, making it suitable for training language models for translation tasks.
The implemented Transformer follows the traditional encoder-decoder architecture with separate multi-head attention layers for encoding and decoding. The encoder and decoder structures are built to capture complex word relationships and contextual information from input sequences. A high-level summary of the architecture is given below:
The encoder is composed of multiple stacked layers, each designed to extract sequential and contextual information from the input sentences. Each layer consists of the following sub-components:
-
Token Embedding Layer
Converts each token into a fixed-size embedding vector, capturing the semantic representation of words.self.embedding = nn.Embedding(vocab_size, embed_dim)
-
Positional Encoding
Since Transformers lack inherent sequential order awareness, a positional encoding is added to the embedding vector to capture token positions within the sentence.self.positional_encoding = PositionalEncoding(embed_dim)
-
Multi-Head Self-Attention
This layer computes relationships between tokens, allowing the model to focus on relevant parts of the input sentence.attn_output, _ = self.self_attention(query, key, value)
-
Residual Connection and Layer Normalization
Applies skip connections to prevent vanishing gradients and layer normalization for stabilizing training.self.norm1 = nn.LayerNorm(embed_dim)
-
Feed-Forward Network
A simple fully connected layer with a ReLU activation to introduce non-linearity.self.feed_forward = nn.Sequential( nn.Linear(embed_dim, ff_dim), nn.ReLU(), nn.Linear(ff_dim, embed_dim) )
-
Repeat Above Layers
The above structure is repeated multiple times (6 layers) to refine the embeddings.
The decoder is structured similarly to the encoder but includes an additional attention layer to incorporate information from the encoder outputs. The components include:
-
Token Embedding Layer
Similar to the encoder, the decoder starts by converting input tokens to embeddings. -
Positional Encoding
Positional encodings are added to maintain the order of tokens in the target sequence. -
Masked Multi-Head Self-Attention
Prevents the decoder from attending to future positions by masking future tokens.attn_output, _ = self.self_attention(query, key, value, mask=mask)
-
Residual Connection and Layer Normalization
-
Source-Target Multi-Head Attention
Attends to the encoder outputs to gather context for each word in the target sequence.attn_output, _ = self.cross_attention(query, encoder_outputs, encoder_outputs)
-
Residual Connection and Layer Normalization
-
Feed-Forward Network
-
Repeat Above Layers
The decoder also stacks multiple such layers (6 layers in total). -
Output Linear Layer
The final linear layer maps the transformed embeddings back to vocabulary size.self.output_layer = nn.Linear(embed_dim, vocab_size)
-
Softmax
Applies softmax to convert logits to probability distributions over the vocabulary.
To train and evaluate the model, follow these steps:
-
Install Dependencies Ensure you have the required libraries installed:
pip install torch numpy
-
Prepare the Dataset Download the "English-Chinese Basic Sentences" dataset from the official source. Preprocess the dataset to create input-output pairs for training.
-
Train the Model Run the training script with the appropriate hyperparameters:
python train.py --epochs 20 --batch_size 64 --learning_rate 1e-4
-
Evaluate the Model Use the evaluation script to measure the translation performance on the test set:
python evaluate.py --model_path saved_model.pth --test_data test_data.txt
The model achieves promising results on the given dataset, showing that the encoder-decoder structure successfully captures the linguistic nuances of English and Chinese translations.
- Vaswani et al., 2017. "Attention is All You Need"
- Kurohashi-Kawahara Laboratory: "English-Chinese Basic Sentences"