Multilayer Perceptrons - mxnet

Jun '20

Andreas_Terzis

Nit: There is a small typo: “diagramtically” -> “diagrammatically”

1 reply

Jun '20

Andreas_Terzis

Should the equation (4.1.5) be H_1 = \sigma(X W_1 + b_1) to match the definition of the weight and input matrices defined in Section 3.4.1.3?

Jun '20 ▶ Andreas_Terzis

goldpiggy

Hi @Andreas_Terzis, sharp eyes! Fixed here https://github.com/d2l-ai/d2l-en/pull/1050/files

As for your second question, sorry for the inconsistency. They both work, but the matrixes are in “Transposed” form. To be specific:

In equation (4.1.5), we have \mathbf{W}_1, in a dimension of (q , d); and $\mathbf{X}$ in a dimension of (d, n).
On the other hand, in equation of section 3.4.1.3, we have \mathbf{W}, in a dimension of (d, q); and $\mathbf{X}$ in a dimension of (n, d).

Let me know that is clear enough.

Jun '20

Andreas_Terzis

Thanks for the quick reply and for clarifying the differences in matrix dimensions

You can consider whether you want to explicitly mention the dimensions of \mathbf{W}_1 and \mathbf{X} in 4.1.5 to avoid confusion with the previous definitions. Doing so would also help with readers that do go through the book sequentially.

Best

1 reply

Jun '20 ▶ Andreas_Terzis

goldpiggy

Hi @Andreas_Terzis, great feedback! We will consider your suggestions and fix it asap.

Jul '20

Kushagra_Chaturvedy

What would be the explanation for the last question? As far as I can tell, it makes little difference if we apply the activation function row-wise (which I’m guessing refers to applying the activation function to each instance of the batch one by one) or apply the function to the whole batch. Won’t both ways yield a similar result?

1 reply

Jul '20 ▶ Kushagra_Chaturvedy

goldpiggy

Hi @Kushagra_Chaturvedy, minibatch may not be as representative as the whole batch. As a result, parameters learned from the (small) minibatch dataset may get some weird gradients and make the model harder to converge.

1 reply

Jul '20 ▶ goldpiggy

Kushagra_Chaturvedy

Got it. But isn’t the question talking about activation functions? How will applying the activation function row-wise or batch-wise affect the learning? Also if we keep on applying the activation function row-wise for batch_size number of rows, won’t it give the same result as applying the activation function batch-wise for a single batch?

1 reply

Jul '20 ▶ Kushagra_Chaturvedy

goldpiggy

hey @Kushagra_Chaturvedy, from my understanding, the last question in the exercise was asking what if the minibatch size is 1?. In this case, the minibatch is too small to converge.

Jul '20

sahu.vaibhav

How do we explain 2nd question?

1 reply

Jul '20 ▶ sahu.vaibhav

goldpiggy

Hi @sahu.vaibhav! Think from here~

Aug '20

tinkuge

In 4.1.1.3,

For a one-hidden-layer MLP whose hidden layer has h hidden units, denote by H∈Rn×h the outputs of the hidden layer, which are hidden representations

What is the sentence trying to say?

1 reply

Sep '20 ▶ tinkuge

goldpiggy

Hi @tinkuge, we define hidden representations or hidden layer in this sentence. (As a lot of deep learning concepts are referred to the same thing. )

Oct '20

asadalam

How do you write a pRelu function from scratch which can be recorded. I wrote the following

def prelu2(x,a=0.01):
    b = np.linspace(0,0,num=x.size)
    for i in np.arange(x.size):
        if(x[i] < 0):
            b[i]=a*x[i]
        else:
            b[i]=x[i]
    return b

But it doesn’t work and gives the error that inplace operations are not permitted when recording.

1 reply

Oct '20 ▶ asadalam

goldpiggy

Hey @asadalam, great try! One way to learn each operator is to check the source code.

1 reply

Oct '20 ▶ goldpiggy

asadalam

Thanks. So PReLU is defined in mxnet.gluon.nn.activations. So how does one use it?

1 reply

Oct '20 ▶ asadalam

goldpiggy

Hi @asadalam, are you asking about API or the fundamental technique? If the latter, I recommend you to read the paper, it will provide rigorous math logic. If you are asking the former, first you define prelu = nn.PReLU(), then you apply this prelu to your network. Check more at the API.

Feb '21

Reno

don’t understand the last question in the exercise. how could the activate function be applied to the minibatch? suppose we have 256 samples, how is this implemented and what would be the outcome? Thanks!

Feb '21

MINTD_ARGAW

I have been watching many video tutorials and some books.As far as i saw this is the best But having understood the mathemathical part ,i have a problem of memorizing the programming part and also writing it by my self both with frameworks or from scratch. any help on that please…

Aug '21

VolodymyrGavrysh

1, 2, partly 3th

Aug '21

Def255

I implemented pReLU as

def prelu(x, alpha):
    return np.maximum(0,x)+np.minimum(0,alpha*x)

But there is some interesting bug with grad of prelu (alpha=0.1):

It can be fixed by shifting 0 in maximum:

def prelu(x, alpha):
    return np.maximum(0,x)+np.minimum(-1e-30,alpha*x)

But is there a simpler and safer solution?

Sep '21

Ismail_Moussaoui

that’s my answer for pRELU

from d2l import mxnet as d2l
from mxnet import np,npx ,autograd
from matplotlib.widgets import Slider, Button
import matplotlib.pyplot as plt
npx.set_np()
a_min = 0    # the minimial value of the paramater a
a_max = 10   # the maximal value of the paramater a
a_init = 1   # the value of the parameter a to be used initially, when the graph is created

x = np.linspace(-50,50 , 500)
x.attach_grad()
fig = plt.figure(figsize=(8,3))

# first we create the general layount of the figure
# with two axes objects: one for the plot of the function
# and the other for the slider
prelu= plt.axes([0.1, 0.2, 0.8, 0.65])
slider_ax = plt.axes([0.1, 0.05, 0.8, 0.05])
plt.axes(prelu)
with autograd.record():
    
    u=a_init*np.minimum(0,x)+np.maximum(0,x)
u.backward()   
z,= plt.plot(x, x.grad, 'r')         
plt.xlim(-50, 50)
plt.ylim(-20, 20)

         
# the final step is to specify that the slider needs to
# execute the above function when its value changes
v=Slider(slider_ax,'a',-10,10,valinit= a_init)
def update(a):
    with autograd.record():
    
        u=a*np.minimum(0,x)+np.maximum(0,x)
    u.backward()
    z.set_ydata(x.grad) # set new y-coordinates of the plotted points
    fig.canvas.draw_idle()
v.on_changed(update)  
d2l.plt.show()