The cosine similarity can calculate the semantic similarity between two words, sentences or two documents when represented as vectors or embeddings.
The idea of the cosine metric when comparing embeddings is to calculate the cosine of the angle between the vectors (in n-dimensional space) This means that the two word vectors you want to compare have to have the same shape/dimensionality.
Let's say we have the words Village and Waiter and those are represented with the vectors v and w of the same size. Now we want to use cosine to calculate the similarity of v and w. First you can take the dot product of the two vectors:
Reminder Dot product is defined as:
$$\sum\limits_{i=1}^Nv_iw_i=v_1w_1+v_2w_2+...+v_Nw_N$$
The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions and vectors that have zeros in different dimensions (orthogonal vectors) will have a dot product of 0, representing their strong dissimilarity.
This raw dot product favours vector length long vectors because the dot product is higher if a vector is longer (with higher values in each dimension). More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them.
This is not good because it would mean longer vectors are more similar, which doesn't make sense. Longer here is not the number of items in the vector, it is the length of the vector from the center of the n-dimensional space to 'final location' of the vector. See below:
The vector length is written as
$|\textbf{v}|$ so with those two | | The vector length is defined as:$$|\textbf{v}| = \sqrt{\sum\limits^{N}_{i=1}{v^2_i}}$$ So you take the sum of the elements after you multiplied all the elements of the vector by themselves, and then you take the square root of that.
We would like a similarity metric that tells us how similar two words are regardless of their frequency. To overcome this issue, we modify the dot product to normalize for the vector length by dividing the dot product by the lengths of each of the two vectors.