Word Embedding (word2vec) - English Version

Aug '20

skywalker_H

I could not understand something so clearly about function of two submodels in word2vec. When we want to train word2vec model, where one word is mapped to one real vector, should we leverage both submodels, or just select one of them?

1 reply

Sep '20 ▶ skywalker_H

astonzhang

We usually just use either CBOW or skip-gram.

1 reply

May '21

BH_L

how do we usually choose the vector dimension d, is it equal to the number of words in the dictionary or there are some other rules to decide the d value?

1 reply

May '21 ▶ BH_L

BH_L

regarding to the question 1, I think as the paper [Mikolov et al., 2013b] mentioned, initially we need consider each word vector in the dictionary, but with hierarchical softmax we can get an efficient approximation by consider log2(N) nodes, where N is the number of word in the dictionary.

for question 3, the thing is because the similar words always have similar context words, that means the training data for similar words are nearly same, so their word vectors will also be similar.

Sep '23

pandalabme

My solutions to the exs: 15.1

Jan '24

yoderj

Can either word2vec or CBOW be shown to be equivalent to a simple neural network, such as a fully-connected layer (applied to one-hot encoding) followed by a softmax layer?

1 reply

Apr '24

arplusman

Equation (15.1.8) is equal to $Uy - U\hat{y} = U(y - \hat{y})$ in which U is the weight matrix from the hidden layer to output layer. For solving the equation $U(y - \hat{y}) = 0$, one trivial answer is $y = \hat{y}$. Is there any non-trivial answer for this equation?

Jul '24 ▶ yoderj

skywalker_H

astonzhang

BH_L

BH_L

pandalabme

yoderj

arplusman

JH.Lam

JH.Lam

JH.Lam