https://d2l.ai/chapter_natural-language-processing-pretraining/word2vec.html

I could not understand something so clearly about function of two submodels in word2vec. When we want to train word2vec model, where one word is mapped to one real vector, should we leverage both submodels, or just select one of them?

We usually just use either CBOW or skip-gram.

how do we usually choose the vector dimension d, is it equal to the number of words in the dictionary or there are some other rules to decide the d value?

regarding to the question 1, I think as the paper [Mikolov et al., 2013b] mentioned, initially we need consider each word vector in the dictionary, but with hierarchical softmax we can get an efficient approximation by consider log2(N) nodes, where N is the number of word in the dictionary.

for question 3, the thing is because the similar words always have similar context words, that means the training data for similar words are nearly same, so their word vectors will also be similar.

Can either word2vec or CBOW be shown to be equivalent to a simple neural network, such as a fully-connected layer (applied to one-hot encoding) followed by a softmax layer?

Equation (15.1.8) is equal to $Uy - U\hat{y} = U(y - \hat{y})$ in which U is the weight matrix from the hidden layer to output layer. For solving the equation $U(y - \hat{y}) = 0$, one trivial answer is $y = \hat{y}$. Is there any non-trivial answer for this equation?

great. I think it’s as the ‘autoregressive’ case in RNN.

so is it more efficient for CBOW than skip-gram since the former has more information(more to one) in conditional probability ?