I could not understand something so clearly about function of two submodels in word2vec. When we want to train word2vec model, where one word is mapped to one real vector, should we leverage both submodels, or just select one of them?
We usually just use either CBOW or skip-gram.
how do we usually choose the vector dimension d, is it equal to the number of words in the dictionary or there are some other rules to decide the d value?
regarding to the question 1, I think as the paper [Mikolov et al., 2013b] mentioned, initially we need consider each word vector in the dictionary, but with hierarchical softmax we can get an efficient approximation by consider log2(N) nodes, where N is the number of word in the dictionary.
for question 3, the thing is because the similar words always have similar context words, that means the training data for similar words are nearly same, so their word vectors will also be similar.