用于预训练词嵌入的数据集

xiaotinghe · December 7, 2021, 6:27pm

http://zh-v2.d2l.ai/chapter_natural-language-processing-pretraining/word-embedding-dataset.html

jokertail · December 9, 2021, 3:37pm

Can’t pickle local object ‘load_data ptb.<locals>.PTB Dataset’

我在运行“把所有东西放一起”小节的代码块时，出现如上错误。

猜测可能是
class PTBDataset 的定义放在 load_data_ptb 中导致这个问题的出现。

cy_x · March 9, 2022, 1:03am

我按你的操作把class PTBDataset放在外面了，运行到for batch in data_iter: 是提示另外一个错误：

Can't pickle <class '__main__.PTBDataset'>: attribute lookup PTBDataset on __main__ failed

cy_x · March 9, 2022, 4:56am

上网搜了一下，把num_workers改为0就可以运行，不知道为什么

pyao · March 17, 2022, 8:01am

看网上把说num_workers 改为 0就不启用多进程，不会预加载多批次数据进入内存。
d2l里默认值是4。

def get_dataloader_workers():
    """Use 4 processes to read the data."""
    return 4

不清楚为什么在我的台式机上也是改成0才行，试过1、2、4都不行。。。

Csk · March 21, 2022, 10:20am

多线程问题，改 num_workers 也行，或者将代码重构，如图

def load_data_ptb(batch_size, max_window_size, num_noise_words):
    """下载PTB数据集，然后将其加载到内存中"""
    sentences = read_ptb()
    vocab = d2l.Vocab(sentences, min_freq=10)
    subsampled, counter = subsample(sentences, vocab)
    corpus = [vocab[line] for line in subsampled]
    all_centers, all_contexts = get_centers_and_contexts(
        corpus, max_window_size)
    all_negatives = get_negatives(
        all_contexts, vocab, counter, num_noise_words)
    
    return all_centers, all_contexts, all_negatives, vocab

class PTBDataset(torch.utils.data.Dataset):
    def __init__(self, centers, contexts, negatives):
        assert len(centers) == len(contexts) == len(negatives)
        self.centers = centers
        self.contexts = contexts
        self.negatives = negatives

    def __getitem__(self, index):
        return (self.centers[index], self.contexts[index],
                self.negatives[index])

    def __len__(self):
        return len(self.centers)

wplf · April 4, 2022, 4:32am

谢谢大佬，问题已经解决，请问这个问题出现的原因是什么呢

winson_huang · June 7, 2022, 8:59am

负采样时使用的词频分布是用下采样之前的词频生成的，为什么不用下采样之后的词频生成这个分布？

Xia_Wei · September 18, 2022, 1:53am

负采样的时候应该尽可能选一个高频词，因为一个高频词是负例比一个稀有词是负例更常见、更重要。而下采样是剔除了高频词，和负采样略微冲突。

mollysadventure · November 1, 2022, 1:08pm

一开始 AttributeError: module ‘d2l.torch’ has no attribute ‘show_list_len_pair_hist’ 更新d2l至d2l 1.0.0a0 以后
AttributeError: module ‘d2l.torch’ has no attribute d2l.count_corpus了肿么办呀
快点看到我哦！在线等挺急的拜托拜托

xiaotinghe · November 1, 2022, 1:42pm

pip install d2l==0.17.5 -I
这样行不行？

charlesgao2101024 · March 8, 2023, 11:21am

tiny_dataset = [list(range(7)), list(range(7, 10))]
print('数据集', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('中心词', center, '的上下文词是', context)

我想知道这段代码的运行结果中为什么0的上下文词不是[1, 2]

faye · March 23, 2023, 3:26am

这个function在第RNN那一章text preprocessing。你跑一次它就存在d2l里里

heavenkiller2018 · May 3, 2023, 12:16am

中心词和上下文词的提取时, 随机采样1到max_window_size之间的整数作为上下文窗口, 为什么不直接用max_window_size而要用随机❓用随机是基于什么原理和考虑?

adjustself · June 15, 2023, 7:50am

num_workers改成0也还是不行

blueblued · August 21, 2023, 8:16am

Answer from ChatGPT:
在Word2Vec中，上下文窗口是指在训练过程中，用来确定中心词周围哪些词作为上下文进行预测的窗口大小。这个窗口的大小可以影响到模型的性能和训练效果。

使用随机采样一个介于1到max_window_size之间的整数作为上下文窗口，而不是固定使用max_window_size，有一些原理和考虑：

Diversification（多样性）：随机采样窗口大小可以引入更多的多样性，使得模型不仅仅关注固定大小的上下文窗口。这样可以更好地捕捉不同距离的关系，避免过于集中在一个特定的上下文范围内。
平衡计算复杂度：较大的上下文窗口可能会引入更多的噪声，因为与中心词距离较远的词可能与中心词的语义关联较弱。同时，较大的窗口可能会增加计算的复杂度，因为要考虑更多的上下文词。通过随机采样窗口大小，可以在捕捉不同距离关系的同时，平衡计算的复杂度。
训练稳定性：随机窗口大小可以提高模型的训练稳定性。如果总是使用固定的max_window_size，可能会使模型对于固定距离范围内的关系过于敏感，从而导致训练不稳定。
对罕见词的处理：对于罕见词，较大的上下文窗口可能会导致模型将与中心词无关的词考虑在内，从而影响预测的准确性。随机窗口大小可以在一定程度上缓解这个问题。

综合考虑以上因素，使用随机采样一个介于1到max_window_size之间的窗口大小可以帮助模型更好地学习单词之间的关系，提高训练的效果和稳定性。

jasonhuxz · November 6, 2023, 2:56am

碰到相同情况，原因还是d2l包的版本不匹配造成，不能用最新的版本，需要使用书里推荐的版本

pip install d2l==0.17.6

Physics-Lee · March 28, 2024, 3:08am

在load_data_ptb函数里把 num_workers 改为0，运行成功。

abc_121 · July 13, 2024, 7:30am

在执行for batch in data_iter:时，还是会报AttributeError

Lynnzake · August 14, 2024, 2:29am

#@save
def subsample(sentences, vocab):
    """下采样高频词"""
    # 排除未知词元'<unk>'
    sentences = [[token for token in line if vocab[token] != vocab.unk]
                 for line in sentences]
    counter = d2l.count_corpus(sentences)
    num_tokens = sum(counter.values())

    # 如果在下采样期间保留词元，则返回True
    def keep(token):
        return(random.uniform(0, 1) <
               math.sqrt(1e-4 / counter[token] * num_tokens))

    return ([[token for token in line if keep(token)] for line in sentences],
            counter)

subsampled, counter = subsample(sentences, vocab)

这里keep的公式计算是不是错了，f(w_i)是w_i的词数与数据集中总次数的比率，这里实现的时候是做了乘法