generated from amitmerchant1990/reverie
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
56f5547
commit 1650555
Showing
3 changed files
with
88 additions
and
0 deletions.
There are no files selected for viewing
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
--- | ||
layout: post | ||
title: "Reading Roundup" | ||
categories: [] | ||
year: 2023 | ||
type: paper | ||
--- | ||
|
||
## The Reversal Curse: A Stark Reflection on LLMs' Limitations in Reasoning | ||
|
||
The paper on the "Reversal Curse" in Auto-regressive Large Language Models (LLMs) casts a sobering light on the overestimated generalization capabilities of these models. It's a striking revelation that LLMs, long thought to be adept at extrapolating and reasoning beyond their training data, actually demonstrate a significant shortfall in this regard. This failure to generalize is exemplified in the Reversal Curse, where LLMs, trained on statements in a specific order, fail to infer the logical reverse. | ||
|
||
This limitation was starkly demonstrated through two types of experiments. In the first, the models were fine-tuned on descriptions of fictitious celebrities in a particular order. Surprisingly, when asked to infer the reverse logic, these models faltered, unable to apply simple reasoning to reverse the order of the information. The second experiment involved real-world knowledge, particularly the relationships between celebrities and their parents. Here again, the models showed a marked inability to reason in reverse; they could identify a celebrity’s parent but struggled to do the reverse, identifying a child from the parent’s name. This indicates a deeper, more systemic issue in the way LLMs process and generalize information, challenging the previously held belief in their expansive reasoning capabilities. | ||
|
||
## RoFormer and Rotary Positional Embedding: A New Paradigm in Position Encoding | ||
|
||
Moving to the second paper, "RoFormer: Enhanced Transformer with Rotary Position Embedding," we encounter a groundbreaking development in position encoding within transformer models. The introduction of Rotary Positional Embedding (RoPE) represents a significant leap, unifying the absolute and relative approaches in a novel and effective manner. | ||
|
||
RoPE ingeniously uses the concept of rotations in complex number space to encode positional information. It treats token embeddings as complex numbers, applying rotations to represent positions. This method ensures that when both the query and key in the transformer's self-attention mechanism are shifted by the same amount, the relative position remains consistent, even as the absolute position changes. This is because the rotations applied to both representations are identical, leaving the angle between them—and thus the dot product used in self-attention—unaltered. RoPE's design elegantly preserves relative positional information while disregarding absolute position, solving a longstanding challenge in transformer model design. By leveraging the nature of rotations, RoFormer, the transformer model enhanced with RoPE, not only improves upon existing models but also provides a new framework for understanding and implementing position encoding in language models. | ||
|
||
## Refining Sampling Methods in Large Language Models | ||
|
||
I've also read a post from r/LocalLlama which offered a critical examination of the standard sampling methods in Large Language Models (LLMs). | ||
|
||
Top P Sampling: A Popular but Flawed Method | ||
The author points out that Top P sampling, despite its popularity (used by OpenAI's API), has inherent flaws. This method selects tokens to reach a cumulative probability sum, which can inadvertently include many low-probability options. This approach may lead to models considering irrelevant choices, potentially contributing to hallucinations. The author argues that this happens especially when the model's confidence is spread thinly across several options, causing it to consider a wider array of less likely tokens. | ||
|
||
Top K Sampling: Limitations in Linearity | ||
Top K sampling, another method discussed in the post, is more linear than Top P. It only considers a fixed number of top tokens (e.g., the top 5 tokens for Top K 5). This method can be overly restrictive, as it always limits the choices to a set number, potentially overlooking viable options that fall outside the top selections. | ||
|
||
Introducing Min P: A Balanced Approach | ||
The blog introduces Min P, a novel sampling method created by the author to address the limitations of Top P and Top K. Min P sets a minimum threshold for tokens to be considered, based on the confidence level of the highest probability token. For instance, with a Min P of 0.1, only tokens at least 1/10th as probable as the top token are considered. This method, as per the author's experiments, not only improves the model's performance but also allows for more diverse choices than Top P, ensuring a better balance between too many and too few token considerations. | ||
|
||
The Impact of Min P on Model Performance | ||
The author asserts that Min P particularly enhances model performance at higher temperatures, where it helps to include more diverse choices without falling into the trap of considering irrelevant low-probability options. This method is posited as a more balanced approach, allowing for a more nuanced and effective sampling that adapts to the confidence levels of the top choices. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
layout: post | ||
title: "LoRA: Low-Rank Adaptation of Large Language Models" | ||
categories: [NLP] | ||
year: 2021 | ||
type: paper | ||
author: Hu | ||
exturl: https://arxiv.org/pdf/2106.09685.pdf | ||
--- | ||
|
||
Models continue to grow, the interest for task-specific use-cases grows with them. A lot of people are looking to train models on their own data. Unfortunately, models are quite incompetent until we get into the billion parameter range. For a user looking to use the latest foundational model for their downstream task, this means re-training billions of parameters every few months (when a new better model is released). One can quickly note how infeasible this becomes. Additionally, if I want to use models for several downstream tasks I need to re-train **multiple** instances of the foundational model... | ||
|
||
Several attempts have been made to mitigate this issue by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, boosting operational efficiency when deployed. This is similar to pre-trained backbones in Computer Vision, such as ResNet. However, these existing techniques often introduce inference latency, a critical aspect of deployed LLMs. | ||
|
||
The authors propose Low-Rank Adaption (LoRA), a fine-tuning approach that is more efficient than existing techniques, merges completely with frozen pre-trained weights inducing zero inference latency and makes task-switching overhead significantly lower. LoRA builds on the hypothesis that adjustments needed to adapt (think fine-tune) a large language model for specific tasks can be effectively represented by a small number of parameters. LoRA tunes the dense layers of a pre-trained network indirectly, by optimizing rank decomposition matrices of the dense layers' change during adaption. Optimizing rank decomposition matrices involves finding this compact, low-rank representation of the changes. This is done by decomposing the weight adjustments into smaller matrices that, when combined, approximate the necessary changes in the model's large weight matrix. | ||
|
||
LoRA possesses several key advantages: | ||
|
||
- A pre-trained model can be shared and used to build many small LoRA modules for different tasks. The modules themselves can be shared and switched on/off frozen weights with very little overhead. | ||
- More efficient fine-tuning, lowering the hardware barrier to entry by up to 3 times when using adaptive optimizers since most gradient and optimizers states are discarded. | ||
- A simple linear design enables merging of LoRA modules and frozen weights, introducing zero inference latency. | ||
|
||
## Low rank update | ||
|
||
For a pre-trained network, in this case of Transformer type, $W$ or $W_0$ refers to the pre-trained weight matrix and $\triangle W$ its accumulated gradient update during adaptation. As such, a forward pass through a fine-tuned network is formalized as | ||
|
||
$$ h = W_0x + \triangle Wx$$ | ||
|
||
LoRA constrains the gradient update by representing the latter with a low-rank decomposition | ||
|
||
$$ h = W_0x + BAx$$ | ||
|
||
where $B$ and $A$ have significantly lower rank than $W$. During training, $W_0$ is frozen, while $A$ and $B$ contain trainable parameters. $A$ is initialized from a random Gaussian distribution and $B$ is zero. | ||
|
||
When deployed during production, you can explicitly compute and store $W = W_0 + BA$ and perform inference as usual. What's especially neat is that $W_0$ and $BA$ have the same dimensions, if you need to switch to another downstream task you simply recover $W_0$ by subtracting $BA$ and adding a different $B'A'$. | ||
|
||
## LoRA applied to Transformers | ||
|
||
In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. To concretize its use-case, the authors present LoRA implementation for the Transformer architecture. The study is limited to only adapting the attention weights for downstream tasks while completely freezing the MLP modules. When evaluated on a GPT-3 175B setting, VRAM consumption during training is reduced from 1.2TB to 350GB with $r = 4$. It is important to remember that $r$ needs to be significantly smaller than $d_{model}$ for benefits to be reaped. Additionally, the authors are able to store fine-tuning checkpoints using 10,000x less memory (from 350GB to 35MB). This all thanks to the fact that LoRA modules can be saved separate from the pre-trained weights. Finally, switching between tasks while deployed comes at a much lower costs as we're only swapping the LoRA weights (35 MB) as opposed to all parameters. | ||
|
||
## Targeting Weight Matrices for Maximum Impact | ||
|
||
A crucial element of LoRA's strategy is determining which weight matrices in a pre-trained Transformer are most beneficial to adapt. This decision is vital for optimizing performance on downstream tasks. By focusing on the most impactful parameters within a limited parameter budget, LoRA ensures efficient fine-tuning without overwhelming computational demands. At the heart of LoRA's efficiency is its use of a rank-deficient adaptation matrix, ∆W. Investigating the optimal rank of ∆W is essential to balance parameter reduction and learning capability. This rank-deficiency enables LoRA to maintain model performance while significantly cutting down the number of parameters required for adaptation. | ||
|
||
## Understanding the ∆W and W Relationship | ||
|
||
A fascinating aspect is the analysis of the relationship between the adaptation matrix (∆W) and the original weight matrix (W). This exploration into how closely ∆W correlates with W, and its relative size compared to W, is crucial. It sheds light on how LoRA's updates leverage and complement the pre-existing structure of the models, explaining why LoRA can adapt large models effectively with minimal retraining. | ||
|
||
## Striking a Balance Between Efficiency and Interpretability | ||
|
||
The technical explorations around LoRA highlight a balance between computational efficiency and model interpretability. Its low-rank structure not only makes adaptation computationally feasible but also enhances our understanding of the interaction between new updates and pre-trained models. This blend of efficiency and clarity in understanding model adaptations is a significant advancement in the field. | ||
|
||
In essence, LoRA's approach to adapting large language models, as unveiled through these studies, showcases a path to harnessing the power of advanced AI models in a resource-efficient and transparent manner. |