# Language Models and the Dataset

Unlike `numpy.random.randint(a, b)`, Pythonâ€™s standard random generator `random.randint(a, b)` generates integers in the range [a, b], that is b inclusive. So I think the code

``````corpus = corpus[random.randint(0, num_steps):]
``````

should be

``corpus = corpus[random.randint(0, num_steps - 1):]``
2 Likes

You are right! Fixing: https://github.com/d2l-ai/d2l-en/pull/1541

`When training our neural network, a minibatch of such subsequences will be fed into the model. Suppose that the network processes a subsequence of đť‘› time steps at a time. Fig. 8.3.1 shows all the different ways to obtain subsequences from an original text sequence, where đť‘›=5 `

Over here what does n signify the blocks or the number of characters in each block

Why are we subtracting 1 in the following
`num_subseqs = (len(corpus) - 1) // num_steps`
My guess is that can be avoided

We need to reserve 1 length for the label sequence,I guess.

Similarly for Sequential Paritioning, we need to add num_steps-1
def seq_data_iter_sequential(corpus, batch_size, num_steps): #@save
â€śâ€ťâ€śGenerate a minibatch of subsequences using sequential partitioning.â€ť""
offset = random.randint(0, num_steps) # should be num_steps-1

### Exercises and my errant answers

1. Suppose there are 100, 000 words in the training dataset. How much word frequency and

multi-word adjacent frequency does a four-gram need to store?

``````what does multi word adjacent frequency mean? i reckon about 99,997 - 100000
``````
1. How would you model a dialogue?

by taking out the speakers name, through putting in as the stop word, rest shouldbe the same.

2. Estimate the exponent of ZipfĘĽs law for unigrams, bigrams, and trigrams.

okay.two decaying functions. Not able to make unigram.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0053227

3. What other methods can you think of for reading long sequence data?

maybe storing them in a dict,for oft repeated words memory can be saved

4. Consider the random offset that we use for reading long sequences.

1. Why is it a good idea to have a random offset?

Since the input prompt would be random model would be able to handle it better, otherwise the model might just learn from adjacent word sets.

2. Does it really lead to a perfectly uniform distribution over the sequences on the document?

It should.

3. What would you have to do to make things even more uniform?

Would maybe take subsequences based on some distribution.

5. If we want a sequence example to be a complete sentence, what kind of problem does this
introduce in minibatch sampling? How can we fix the problem?

All the sentences have different number of words so we wont be able to get a minibatch that is uniform.

Quick questions about the `seq_data_iter_sequential` code:

(1)
I noticed you generated the set of starting indices incrementing by the step size

``````indices = list(range(0, num_subsequences * step_size, step_size))
``````

to ensure you do not go out of bounds when computing the subsequences in the offset in each batch

``````        X = [corpus_offset[index:index + step_size] for index in indices_subset]
``````

Doesnâ€™t this limit the number of choices each sequence can have and skew the randomness to select tokens from that structured set of indices? Notice your indices are always

``````{ step_size*i : i = 0, 1, ..., len(offset) -1 }
``````

which means some indices will start the sequence more often than others will. I think in the (list(range(35)), 2, 5) example, indices of the form 5*k Â±{0,1,2} will be more likely to start the sequence.

(2)
You shuffle the indices and compute sliding windows of length batch_size. We choose to shift them by batch_size as determined by the for-loop.

``````    for i in range(0, batch_size * num_batches, batch_size):
indices_subset = indices[i:i + batch_size] # sliding window shifted by batch size
``````

Why do you choose to shift them by batch size instead of looping from 0 to num_batches? Does that choice matter?

For the sequential processing, why do we not keep the sequential subsequences in the same minibatch? For an example, currently if we check what do we get with the `my_seq` example, current implementation could return

``````X:  tensor([[ 0,  1,  2,  3,  4], [17, 18, 19, 20, 21]])
``````

in the first minibatch and

``````X:  tensor([[ 5,  6,  7,  8,  9], [22, 23, 24, 25, 26]])
``````

in the second minibatch.

Is there a reason we donâ€™t use the following structure:

``````    X:  tensor([[ 0,  1,  2,  3,  4], [ 5,  6,  7,  8,  9]])  # First minibatch
X:  tensor([[10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])  # Second minibatch
``````