this code all_items = set([i for i in range(**num_users**)]) should be

all_items = set([i for i in range(**num_items**)])

For each user, the items that a user has not interacted with are candidate items (unobserved entries). The following function takes users identity and candidate items as input, and samples negative items randomly for each user from the candidate set of that user.

Here, in the context, the words âcandidate itemsâ indicate unobserved items for a user. However, this is contradictory with the code, where the variable âcandidatesâ indicate the rated items for a user.

auc = 1.0 * (max - hits_all[0][0]) / max if len(hits_all) > 0 else 0

How does this formula correspond to the auc definition metioned above in the context?

Maybe you can check the way the authoer loads the **train_iter/test_iter**, which is in the mode âseq_wareâ . So actually when we do loop in users, each time we query the userâs most recent behaviour in the test_iter(a key-val dictionary) and then the result list(which length is one and conatins the item_id) is passed to the evaluate function. Moreover, in the **hit_and_auc** function we try to find whether the item the net ârecommendâ is in the top-k list or whatâs the rank of that ârecommendationâ in the hits_al list(which is sorted by the score given by the net)ďźwhich is corresponding to the way we calculate the AUC (find out how many false recommendations rank before the ground truth)

Why is the hadamard product used in the GMF?

From what I recall, a Matrix Factorization is defined as multipliying two matrices, usually of lower rank as the target matrix, using usual matrix multiplication, i. e. trying to reconstruct X as Xâ = A \dot B with some reconstruction error. Is there some additional information available on this?

In the original paper, they wanna add some nonlinear layers over the dot product so they used element-wise multiplication in order to get a vector.

Xâ = A \dot B is seen from the whole matrix level. If we look at it from each recommendation level, we need a score for each user-item pair. So that is vector-vector multiplication instead of matrix multiplication. Hope this helps.

Thank you, this is a good answer.

Is it correct that there is a sigmoid activation in the last layer given that the BPR loss then applies a sigmoid again?

HiďźThe definition of *PRDataset* may not be good enough. *neg_items* can be pre-calculated once during initialization.

Why are the AUC and hit rate curves horizontal lines during training? It doesnât make sense to me.

**Note:** The code I implemented is in PyTorch

First: What is the purpose of a sigmoid function at the prediction layer? I got better results by removing it. The reason I removed Sigmoid was that it did not allow the predictive distance between the positive and negative samples to increase, which caused the loss function to not decrease as expected.

Second: How come your model isnât improving the HIT and AUC metrics with each epoch?

Third (Note: I am not familiar with MXNET syntax): Why does your BRPLoss seem so low? Here are my thoughts. According to your code, you are not implementing batch optimization. This means you are computing loss, gradient, weight updating, etc., for each sample in each batch. And finally in the epoch reports, you average the loss functions which is inconsistent with the book definition of BPRLoss.

My final epoch report without sigmoid in prediction layer:

```
----- epoch number 9/10 -----
training loss is: 337.0008469414465
hit rate is: 0.1951219512195122
AUC is: 0.7946674890472771
----- epoch number 10/10 -----
training loss is: 335.34528846347456
hit rate is: 0.19618239660657477
AUC is: 0.7952683387070411
```

Iâm getting that the AUC is 0.531 which is a very low score - I think the neural net isnât converging properly. Iâll do some fiddling with the parameters to get it to converge.