子词嵌入

xiaotinghe · December 7, 2021, 6:27pm

http://zh-v2.d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html

Perkzi · August 1, 2024, 8:09pm

第一题的论文内容

In order to bound the memory requirements of our
model, we use a hashing function that maps n-grams
to integers in 1 to K. We hash character sequences
using the Fowler-Noll-Vo hashing function (specifi-
cally the FNV-1a variant).1 We set K = 2.106 be-
low. Ultimately, a word is represented by its index
in the word dictionary and the set of hashed n-grams
it contains.

JH.Lam · October 29, 2024, 4:48pm

a couple of questions:
a. in fastText, there are some specified lengths for subwords. does it mean that there are some different types of subwords set in vocab?
b. so this Subword embedding can not be used in asian languages, eg. Chinese , Japanese?