attempt to make t-SNE more intuitive

carpentries-incubator · Dec 5, 2024 · 60deb97 · 60deb97
1 parent 4d51000
commit 60deb97
Showing 1 changed file with 8 additions and 2 deletions.
diff --git a/_episodes/06-dimensionality-reduction.md b/_episodes/06-dimensionality-reduction.md
@@ -185,10 +185,16 @@ PCA has done a valiant effort to reduce the dimensionality of our problem from 6
 It's worth noting that PCA does not handle outlier data well primarily due to global preservation of structural information, and so we will now look at a more complex form of learning that we can apply to this problem.
 
 ### t-distributed Stochastic Neighbor Embedding (t-SNE)
+t-SNE is a powerful example of *manifold learning* — a non-deterministic non-linear approach to dimensionality reduction. A **manifold** is a way to think about complex, high-dimensional data as if it exists on a simpler, lower-dimensional shape within that space. Imagine a crumpled piece of paper: while it exists in 3D space when crumpled, the surface of the paper itself is inherently 2D. Similarly, in many datasets, the meaningful patterns and relationships lie along such "lower-dimensional surfaces" within the high-dimensional space. For example, in image data like MNIST, the raw pixels (hundreds of dimensions) may seem high-dimensional, but the actual structure (the shapes of digits) is much simpler, often following a lower-dimensional manifold. Manifold learning techniques like t-SNE aim to "uncrumple" the data and flatten it into a lower-dimensional space, while preserving the relationships between points as much as possible.
 
-t-SNE is a powerful example of manifold learning - a non-deterministic non-linear approach to dimensionality reduction. Manifold learning tasks are based on the idea that the dimension of many datasets is artificially high. This is likely the case for our MNIST dataset, as the corner pixels of our images are unlikely to contain digit data, and thus those dimensions are almost negligable compared with others.
+### Intuition for t-SNE
+t-SNE (**t-distributed Stochastic Neighbor Embedding**) is a method for visualizing high-dimensional data by mapping it into a low-dimensional space, typically 2D or 3D, while emphasizing *local relationships*. It focuses on keeping nearby points close together, helping to reveal clusters or patterns that may be hidden in the original high-dimensional space.
 
-The versatility of the algorithm in transforming the underlying structural information into lower-order projections makes t-SNE applicable to a wide range of research domains.
+**An analogy**: Imagine moving a group of friends from a large, crowded park into a much smaller garden while trying to keep people who are chatting with each other close. You won’t care much about preserving the exact distances between groups from the original park—your main goal is to keep friends near each other in the smaller space. Similarly, t-SNE prioritizes preserving these *local connections*, while global distances between clusters may be distorted or not reflect their true relationships. This distortion happens because t-SNE sacrifices global structure to accurately capture local neighborhoods. For example:
+- Two clusters that appear far apart in the t-SNE plot may actually be closer in the original high-dimensional space.
+- Similarly, clusters that appear close together in the plot might not actually be neighbors in the original space.
+
+As a result, while t-SNE is excellent for discovering *local patterns* (e.g., clusters, subgroups), you should be cautious about interpreting the relative distances between clusters. These are less reliable and are influenced by how the algorithm optimizes its layout in the reduced space. It's best to use t-SNE as a tool to find grouping and then validate these findings using additional analysis.
 
 For more in depth explanations of t-SNE and manifold learning please see the following links which also contain som very nice visual examples of manifold learning in action:
 * [https://thedatafrog.com/en/articles/visualizing-datasets/](https://thedatafrog.com/en/articles/visualizing-datasets/)