Language Models and the Dataset

Unlike numpy.random.randint(a, b), Python’s standard random generator random.randint(a, b) generates integers in the range [a, b], that is b inclusive. So I think the code

corpus = corpus[random.randint(0, num_steps):]

should be

corpus = corpus[random.randint(0, num_steps - 1):]

You are right! Fixing:

When training our neural network, a minibatch of such subsequences will be fed into the model. Suppose that the network processes a subsequence of 𝑛 time steps at a time. Fig. 8.3.1 shows all the different ways to obtain subsequences from an original text sequence, where 𝑛=5

Over here what does n signify the blocks or the number of characters in each block

Why are we subtracting 1 in the following
num_subseqs = (len(corpus) - 1) // num_steps
My guess is that can be avoided

We need to reserve 1 length for the label sequence,I guess.

Similarly for Sequential Paritioning, we need to add num_steps-1
def seq_data_iter_sequential(corpus, batch_size, num_steps): #@save
“”“Generate a minibatch of subsequences using sequential partitioning.”""
#Start with a random offset to partition a sequence
offset = random.randint(0, num_steps) # should be num_steps-1