Continue Discussion 10 replies
Aug '20

skywalker_H

I could not understand something so clearly about function of two submodels in word2vec. When we want to train word2vec model, where one word is mapped to one real vector, should we leverage both submodels, or just select one of them?

1 reply
Sep '20 ▶ skywalker_H

astonzhang

We usually just use either CBOW or skip-gram.

1 reply
May '21

BH_L

how do we usually choose the vector dimension d, is it equal to the number of words in the dictionary or there are some other rules to decide the d value?

1 reply
May '21 ▶ BH_L

BH_L

regarding to the question 1, I think as the paper [Mikolov et al., 2013b] mentioned, initially we need consider each word vector in the dictionary, but with hierarchical softmax we can get an efficient approximation by consider log2(N) nodes, where N is the number of word in the dictionary.

for question 3, the thing is because the similar words always have similar context words, that means the training data for similar words are nearly same, so their word vectors will also be similar.

Sep '23

pandalabme

My solutions to the exs: 15.1

Jan '24

yoderj

Can either word2vec or CBOW be shown to be equivalent to a simple neural network, such as a fully-connected layer (applied to one-hot encoding) followed by a softmax layer?

1 reply
Apr '24

arplusman

Equation (15.1.8) is equal to $Uy - U\hat{y} = U(y - \hat{y})$ in which U is the weight matrix from the hidden layer to output layer. For solving the equation $U(y - \hat{y}) = 0$, one trivial answer is $y = \hat{y}$. Is there any non-trivial answer for this equation?

Jul '24 ▶ yoderj

JH.Lam

great. I think it’s as the ‘autoregressive’ case in RNN.

Jul '24 ▶ astonzhang

JH.Lam

so is it more efficient for CBOW than skip-gram since the former has more information(more to one) in conditional probability ?

Aug '24

JH.Lam

one more question:
Is it more accurate to use CBOW than skip-gram since the former supply more information(context words) to do predict one token (center word) ?