i believe, for linear regression the assumption is that the residuals has to be normally distributed the noise ‘e’ y=w⊤x+b+ϵ where ϵ∼N(0,σ2) need not be
May I ask a question related to the normalization:
For categorical variable, once convert it to dummy variables (via 1-Hot encoding), is it worth to do standard normalization for it? Or what scenarios it is worth doing and what scenarios it may not?
Thanks in advance!
Sorry for the ambiguous. I mean normalizing the numerical features to a normal distribution. E.g., using sklearn.preprocessing.MinMaxScaler / StandardScaler.
Take preprocessing features like (age, major, sex, height, weights) for a students as a more concrete example. After fill null values and convert categorical variables to dummy vectors, I would like to do normalization for all the features. For the dummy vectors from categorical variables, is it worth to do standard normalization for it? Or what scenarios it is worth doing and what scenarios it may not?
Hey @Angryrou, you are right about normalization! And it is extremely important in deep learning world. For the numerical feature, we just apply normalization on the scalar values. While for the categorical feature, we represent its scalar value as a vector (via one-hot encoding). The feature will be look like a list of zeros and ones, and we don’t normalize further beyond that.
We combine those normalized scalar features (from numerical features) and one-hot vectors (from categorical features) together, and feed the combinations to the neural nets.
How did we arrive at the equation - likelihood of seeing a particular y for a given x (equation 3.1.13 )
Could someone explain ? Thanks
Hey @rammy_vadlamudi, I derived equation 3.1.13 by following the log rules.
By taking the log of a product of many terms, that turned into a sum of the logs of those terms, so negative log of likelihood = - sum(log(p(y(i)|x(i))))
log(p(y(i)|x(i))) can be similarly expanded into log(1/sqrt(…)) + log(e(…)), which simplifies further. Keep in mind that sqrt is the same as a 1/2 power, and you can apply another log rule there. I was able to successfully derive that equation by applying the various log rules a few times.
I am referring to equation 3.1.13
The part where (y-wTx-b)**2 is used in place of (x-mean)**2
@rammy_vadlamudi Ah! My apologies, I had misread your question
I’m not 100% sure on that derivation. My take on the derivation is that, if epsilon ~ N(0, sigma2), and epsilon = y - wTx - b, then (y - wTx -b) ~ N(0, sigma2). Since epsilon is the random variable here and x is given, the distribution of y is a linear transformation of the distribution of epsilon. With a mean of 0, I simply substitute epsilon as a function of y and x into the place of the random variable. So, to me, I took P(y|x) to actually mean P(y|x,e).
But, like I said, I’m not 100% if that’s a strictly correct reasoning mathematically (in fact, I feel like I’ve skipped some steps here, or completely went down the wrong path). I’d love to see if anyone else has a definitive answer?
Just a side note: The likelihood is a function of the parameters, here w and b., see e.g. David MacKay’ s book (https://www.inference.org.uk/itprnn/book.pdf p. 29):
“Never say ‘the likelihood of the data’. Always say ‘the likelihood of the parameters’. The likelihood function is not a probability distribution.”
This is not completely clear in the description.
thanks for your effort here in such book ,
i would like to know , if i can find the answers for the exercises?
Hi @alaa-shubbak, thanks for your engagement! We are focusing on improving the content of current chapter and looking for community contributors for the solutions.