This may be a nitpick. Regarding the drain repair contractor example when it says if some of the variance owes to a few factors besides your two features. Here we have only one feature, the hours worked by the contractor, and two parameters, the hour rate and the bias, i.e. y = ax + b. Since the hourly rate is $100 and charges $50 to show up, a = 100 and b = 50.

1st priniciples model will specify an entire solution irrespective of data - examples are physic laws

Parametric model will assume a form of solution and fits it to the data (e.g. linear regression model assumes y = b0 + b1x then we fit it to the data to find b0 and b1)

Nonparametric will have no assumption about the solution form, this is totally data driven

It is evident that random-access memory has not kept pace with the growth in data. At the same time, the increase in computational power has outpaced that of the data available. This means that statistical models need to become more memory efficient (this is typically achieved by adding nonlinearities) while simultaneously being able to spend more time on optimizing these parameters, due to an increased computational budget.

Can you kindly explain how statistical models can become memory efficient by “adding non-linearities”? Is memory efficiency the reason for adding non-linearities or a good side effect of adding non-linearity?

The error rate is the probability your model makes a mistake and is used to measure how good your model is on a classification problem. Here’s a concrete example:

Suppose you trained a model to predict if an image is a cat or a dog. Suppose you take 100 images with known labels of cat and dog that your model never trained on. Let’s take the first image (suppose it’s a cat) and your model predicts dog. The error would be 1. Suppose the next image is a dog and your model predicts a dog. The error would be 0. You keep doing this and sum up the errors. Finally, divide by the total number of examples you predicted on (100 in this case). That is the error rate. It’s a number between 0 and 1. Suppose the error rate is 0.25 or 25%. It means there’s a 25% chance your model will make a mistake if you were to randomly select an image from your test dataset.

A few more things to know. For you model to make a decision, it means you had to pick a threshold. Neural networks for classification problems will produce a probability it thinks the image is a cat (say 30%). But this is just a probability! Next, you need to pick a threshold. So you might pick a threshold of say 50% and say that if the model emits a probability of 50% of greater, I will say the image is a cat. At this point you might wonder, is there a way to measure how good my model is without having to pick a threshold? In fact there is! That’s what ROC AUC is all about it. Which you can read more about here: Understanding ROC AUC (part 1/2). Introduction | by matt johnson | Building Ibotta | Medium

Hey folks, just wondering if anyone’s come across a resource for exercise solutions for each chapter in ‘Dive into Deep Learning’? Any leads would be awesome!

are we going to learn high level tools?
reading this books, maybe will make me able to do research and be scientist in the field? will i be able to read research papers?

I recently created a Discord bot to prevent spam using simple heuristics – Levenshtein distance from suspicious words, frequent messaging, and the like. Learning from examples of manually annotated spam could be used for more intelligent content-based prevention.

Implementing software based on a high-level description. This may take a few years, but seems possible.

Computation is used to apply algorithms to data. As compute becomes cheaper, simpler algorithms which can more effectively leverage greater amounts of compute will out-perform other approaches (as described by Richard Sutton).

Possibly in spaced repetition systems – Anki uses the SuperMemo 2 algorithm, but leveraging e2e training might allow for repetition which more appropriately adapts to a given user and the content they’re studying.