Word Embedding (word2vec)

astonzhang · June 29, 2020, 10:38pm

https://d2l.ai/chapter_natural-language-processing-pretraining/word2vec.html

skywalker_H · August 2, 2020, 3:49am

I could not understand something so clearly about function of two submodels in word2vec. When we want to train word2vec model, where one word is mapped to one real vector, should we leverage both submodels, or just select one of them?

astonzhang · September 17, 2020, 6:52am

We usually just use either CBOW or skip-gram.

BH_L · May 19, 2021, 4:19am

how do we usually choose the vector dimension d, is it equal to the number of words in the dictionary or there are some other rules to decide the d value?

BH_L · May 19, 2021, 4:42am

regarding to the question 1, I think as the paper [Mikolov et al., 2013b] mentioned, initially we need consider each word vector in the dictionary, but with hierarchical softmax we can get an efficient approximation by consider log2(N) nodes, where N is the number of word in the dictionary.

for question 3, the thing is because the similar words always have similar context words, that means the training data for similar words are nearly same, so their word vectors will also be similar.

pandalabme · September 27, 2023, 8:21am

My solutions to the exs: 15.1

yoderj · January 22, 2024, 4:12pm

Can either word2vec or CBOW be shown to be equivalent to a simple neural network, such as a fully-connected layer (applied to one-hot encoding) followed by a softmax layer?

arplusman · April 9, 2024, 8:05am

Equation (15.1.8) is equal to $Uy - U\hat{y} = U(y - \hat{y})$ in which U is the weight matrix from the hidden layer to output layer. For solving the equation $U(y - \hat{y}) = 0$, one trivial answer is $y = \hat{y}$. Is there any non-trivial answer for this equation?

JH.Lam · July 30, 2024, 6:43am

great. I think it’s as the ‘autoregressive’ case in RNN.

JH.Lam · July 30, 2024, 6:45am

so is it more efficient for CBOW than skip-gram since the former has more information(more to one) in conditional probability ?

JH.Lam · August 21, 2024, 7:28am

one more question:
Is it more accurate to use CBOW than skip-gram since the former supply more information(context words) to do predict one token (center word) ?