Synthetic Regression Data

https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html

My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:

@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

and test with

len(data.train_dataloader())

the result changed from 32 to 31

ex.2
This one is too much for me now.

ex.3

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self):
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y

test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])

result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])

ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time

data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)
1 Like

ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save

1 Like

Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self, seed):
        torch.manual_seed(seed)
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
    X, y = data.get_dataloader(seed = 1)
    print(torch.cat((X,y),1))

The result is:

tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])

In the page of 3.32:

X, y = next(iter(data.get_dataloader()))

But I think it is loss a parameter ‘train’

My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)

I think it should be modified

Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.

1 Like

Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.

I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).

The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.

So when we try to add get_dataloader, the new definition does not replace the one that is already there.

You can check it by running d2l.SyntheticRegressionData?? after attempted addition.

When we call next(iter(data.train_dataloader()))) it calls train_dataloader in SyntheticRegressionData which is inherited from DataModule. This train_dataloader then call the get_dataloader we defined not the one from DataModule. When we added this method to SyntheticRegressionData it replaces original one.

You can try it my modifying an putting an print call or something and calling it again.

In 3.3.3 get_tensorloader actually we can avoid to create a tuple for tensors, which leads to unnecessary memory allocation. We can just keep the generator expression and operator unpack (*tensors) will do the rest when we create the TensorDataset.

@d2l.torch.add_to_class(d2l.torch.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = (a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)
  1. If num_examples % batch size != 0, your final batch will be smaller than batch_size. To ignore this last batch and move on to the next epoch, one can set drop_last=True when instantiating their torch.utils.data.DataLoader.
  2. If we cannot hold all the data in memory, one has to lazily load/evaluate it on the fly in the dataset’s __getitem__ method (maybe with chunked/queued reads from disk for efficiency). As the hint suggests, one could use a pseudorandom permutation generator to generate new indices on the fly without storing a full permutation table - you’d only have to store a new seed for each epoch, and find a strategy for reading from disk based on these indices.

For question 3:

def lazy_dataset():
    while True:
        yield torch.randn(32, 2), torch.randn(32)

For question 4:

def same_dataset():
    data = (torch.randn(32, 2), torch.randn(32))
    while True:
        yield data

My ex 2 solution:

  1. It’ll raise an error: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can’t allocate memory: you tried to allocate 80000000000000 bytes. Error code 12 (Cannot allocate memory)
  2. Using a Pseudorandom permutation generator we can generate an order for those data. And we use the order as a slice of the data. (Ex. In origin code if we want to get data[i], now we use data[order[i]])

and my Ex 3 solution:

class MySyntheticOnFlyDataloader(d2l.DataModule):  #@save
    """Synthetic data for linear regression."""
    def __init__(self,w,b,noise=0.01,batch_size=32):
        super().__init__()
        self.save_hyperparameters()
    def get_dataloader(self, train):
        while True:
            X=torch.randn(self.batch_size,len(self.w))
            noise=self.noise*torch.randn(self.batch_size,1)
            y=torch.matmul(X,self.w.reshape((-1,1)))+self.b+noise
            yield X,y

data = MySyntheticOnFlyDataloader(w=torch.tensor([2, -3.4]), b=4.2)
dataloader=data.train_dataloader()
X,y=next(dataloader)
print(X[0],y[0])

ex 4
Just need to fix the random seed.

import random
def random_generator(seed=42):
    random.seed(seed)
    return random.random()
random_generator()