Synthetic Regression Data

https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html

My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:

@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

and test with

len(data.train_dataloader())

the result changed from 32 to 31

ex.2
This one is too much for me now.

ex.3

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self):
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y

test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])

result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])

ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time

data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)
1 Like

ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save

1 Like

Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self, seed):
        torch.manual_seed(seed)
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
    X, y = data.get_dataloader(seed = 1)
    print(torch.cat((X,y),1))

The result is:

tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])

In the page of 3.32:

X, y = next(iter(data.get_dataloader()))

But I think it is loss a parameter ‘train’

My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)

I think it should be modified

Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.

1 Like

Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.

I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).

The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.

So when we try to add get_dataloader, the new definition does not replace the one that is already there.

You can check it by running d2l.SyntheticRegressionData?? after attempted addition.

When we call next(iter(data.train_dataloader()))) it calls train_dataloader in SyntheticRegressionData which is inherited from DataModule. This train_dataloader then call the get_dataloader we defined not the one from DataModule. When we added this method to SyntheticRegressionData it replaces original one.

You can try it my modifying an putting an print call or something and calling it again.

In 3.3.3 get_tensorloader actually we can avoid to create a tuple for tensors, which leads to unnecessary memory allocation. We can just keep the generator expression and operator unpack (*tensors) will do the rest when we create the TensorDataset.

@d2l.torch.add_to_class(d2l.torch.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = (a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)

My exercise solutions.

Question 1

On its final iteration, the DataLoader would return values with fewer rows. To prevent this, you can pass in drop_last=True.

Question 2

  1. You’ll get an out of memory error.
  2. I would (as the book suggests) likely use a pseudorandom permutation generator to index into the data. A simple Fisher-Yates shuffle would work.

Question 3

Maybe I’m misinterpreting this question? Naively:

def generator():
    while True:
        yield torch.randn(2)

Question 4

Again, I may be misinterpreting. I’m assuming this question is asking for each generator instance to have random data, but for a given instance to return the same data each time.

def generator():
    a = torch.randn(2)
    while True:
        yield a