Synthetic Regression Data

https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html

My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:

@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

and test with

len(data.train_dataloader())

the result changed from 32 to 31

ex.2
This one is too much for me now.

ex.3

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self):
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y

test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])

result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])

ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time

data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)

ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save

1 Like

Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self, seed):
        torch.manual_seed(seed)
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
    X, y = data.get_dataloader(seed = 1)
    print(torch.cat((X,y),1))

The result is:

tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])

In the page of 3.32:

X, y = next(iter(data.get_dataloader()))

But I think it is loss a parameter ‘train’

My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)

I think it should be modified

Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.

1 Like

Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.

I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).

The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.

So when we try to add get_dataloader, the new definition does not replace the one that is already there.

You can check it by running d2l.SyntheticRegressionData?? after attempted addition.