https://d2l.ai/chapter_natural-language-processing-pretraining/similarity-analogy.html
Is there any theoretical proof which can explain why the word embedding algorithms could achieve finding synonyms and analogies using the pretrained vector. Isn’t there interpretability in term of this problem?
If you look at likelihood function in word embeddings such as word2vec, the exponent is the dot product of center word and multiple context words. If two center words are interchangable (e.g., synonym), we maximize likelihood functions so that interchangable words have larger dot products