fuentes_info.txt

# Libros:
Anderson, Chris - The Long Tail
Leskovec, Rajaraman, Ullman - Mining of Massive Datasets
Ricci, Rokach, Shapira, Kanto - Recommender systems handbook
Segaran, Toby - Programming Collective Intelligence
Paterek - Predicting movie ratings and recommender systems


# Papers:
Yehuda Koren, Robert Bell and Chris Volinsky - Matrix factorization for recommender systems
Collaborative Filtering with Temporal Dynamics
Statistical foundations of recommender systems
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
The BellKor Solution to the Netflix Grand Prize
Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity
Recommender Systems Handbook Chapter 5: Advances in Collaborative Filtering, by Yehuda Koren and Robert Bell
Recommender Systems Handbook Chapter 8: Evaluating Recommendation Systems
50 years of data science
The Two Cultures
Do We Need Hundreds of Classifiers


Factorization Meets the Neighborhood:
Our goal is to find the relative place of these “interesting movies” within the total order of movies sorted by predicted ratings for a specific user. To this end, for each such movie i, rated 5-stars by user u, we select 1000 additional random movies and predict the ratings by u for i and for the other 1000 movies. Finally, we order the 1001 movies based on their predicted rating, in a decreasing order. This simulates a situation where the system needs to recommend movies out of 1001 available ones. Thus, those movies with the highest predictions will be recommended to user u. Notice that the 1000 movies are random, some of which may be of interest to user u, but most of them are probably of no interest to u. Hence, the best hoped result is that i (for which we know u gave the highest rating of 5) will precede the rest 1000 random movies, thereby improving the appeal of a top-K recommender. There are 1001 different possible ranks for i, ranging from the best case where none (0%) of the random movies appears before i, to the worst case where all (100%) of the random movies appear before i in the sorted order. Overall, the validation set contains 384,573 5-star ratings. For each of them (separately) we draw 1000 random movies, predict associated ratings, and derive a rank ing between 0% to 100%. Then, the distribution of the 384,573 ranks is analyzed. (Remark: since the number 1000 is arbitrary, reported results are in percentiles (0%–100%), rather than in absolute ranks (0–1000).) ... It is evident that small improvements in RMSE translate into significant improvements in quality of the top K products.


(Hablando de MAE y RMSE)
Compared to MAE, RMSE disproportionately penalizes large errors, so that, given a test set with four hidden items RMSE would prefer a system that makes an error of 2 on three ratings and 0 on the fourth to one hat makes an error of 3 on one rating and 0 on all three others, while MAE would prefer the second system.
(Hablando de precision, recall, etc)
Typically we can expect a trade off between these quantities — while allowing longer recommendation lists typically improves recall, it is also likely to reduce the precision. In applications where the number of recommendations that can be presented to the user is preordained, the most useful measure of interest is Precision at N.
(Hablando de ROC Curve)
Measures that summarize the precision recall of ROC curve such as F-measure [58] and the Area Under the ROC Curve (AUC) [1] are useful for comparing algorithms independently of application, but when selecting an algorithm for use in a particular task, it is preferable to make the choice based on a measure that reflects the specific needs at hand.
In applications where a fixed number of recommendations is made to each user (e.g. when a fixed number of headlines are shown to a user visiting a news portal), we can compute the precision and recall (or true positive rate and false positive rate) at each recommendation list length N for each user, and then compute the average precision and recall (or true positive rate and false positive rate) at each N [51]. The
resulting curves are particularly valuable because they prescribe a value of N for each achievable precision and recall (or true positive rate and false positive rate), and conversely, can be used to estimate performance at a given N. An ROC curve obtained in this manner is termed a Customer ROC (CROC) curve [52].
When different numbers of recommendations can be shown to each user (e.g. when presenting the set of all recommended movies to each user), we can compute ROC or precision-recall curves by aggregating the hidden ratings from the test set into a set of reference user-item pairs, using the recommender system to generate a single ranked list of user-item pairs, picking the top recommendations from the list, and scoring them against the reference set. An ROC curve calculated in this way is termed a Global ROC (GROC) curve [52]. Picking an operating point on the resulting curve can result in a different number of recommendations being made to each user.
- Recommender Systems Handbook Chapter 8: Evaluating Recommendation Systems.


In fact, modeling these biases turned out to be fairly important: in their paper describing their final solution to the Netflix Prize, Bell and Koren write that:
"Of the numerous new algorithmic contributions, I would like to highlight one -- those humble baseline predictors (or biases), which capture main effects in the data. While the literature mostly concentrates on the more sophisticated algorithmic aspects, we have learned that an accurate treatment of main effects is probably at least as signficant as coming up with modeling breakthroughs."
https://www.quora.com/Is-there-any-summary-of-top-models-for-the-Netflix-prize

Hablar de "No Free Lunch Theorem". En el capítulo 5 de Deep Learning Bengio viene resumido. También viene un poco de SGD.


Matrix Factorization: A Simple Tutorial and Implementation in Python
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/

How useful is a lower RMSE?
http://www.netflixprize.com/community/viewtopic.php?id=828

Recommender Systems — It’s Not All About the Accuracy
https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff#.ihj8in7al

Introduction to Restricted Boltzmann Machines
http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/

Movie Recommendations and More via MapReduce and Scalding
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

Paterek - Predicting movie ratings and recommender systems
Sección 3.1

SGD uses a small randomly-selected subset of the training samples to approximate the gradient of the objective function given by Equation 2. The number of training samples used for this approximation is called the batch size. When the batch size is N, the SGD training simply translates into gradient descent (hence is very slow to converge). By using a small batch size, one can update the parameters more frequently than gradient descent and speed up the convergence. The extreme case is a batch size of 1, and it gives the maximum frequency of updates and leads to a very simple perceptron-like algorithm, which we adopt in this work.
Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
http://www.aclweb.org/anthology/P09-1054

Large-scale Parallel Collaborative Filtering for the Netflix Prize
Explica bien el problema de minimmización que se quiere resolver.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.173.2797&rep=rep1&type=pdf

Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
http://www.aclweb.org/anthology/P09-1054

http://stats.stackexchange.com/questions/146689/how-to-divide-dataset-into-training-and-test-set-in-recommender-systems
http://www.contentwise.tv/files/An-evaluation-Methodology-for-Collaborative-Recommender-Systems.pdf

http://sifter.org/~simon/journal/20061102.1.html
http://sifter.org/~simon/journal/20061211.html
http://sifter.org/~simon/journal/20070815.html
http://sifter.org/~simon/journal/20070817.html

Stephen Wright - Optimization Algorithms in Machine Learning
http://videolectures.net/nips2010_wright_oaml/

Hogwild
https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

Leer la parte de actualizar los pesos de nuevos usuarios y/o nuevos artículos
Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.8010&rep=rep1&type=pdf

Leer
Yehuda Koren - Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model
http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf


http://alex.smola.org/teaching/berkeley2012/slides/8_Recommender.pdf
https://classes.soe.ucsc.edu/cmps242/Fall09/proj/mpercy_svd_talk.pdf
http://www.decibelsanddecimals.com/dbdblog/2016/6/13/spotify-related-artists


RMSE penalizes larger errors stronger than MAE and thus is suitable for situations where small prediction errors are not very important.
https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf


Comparación de métodos de SVD y fact de matrices, o comparar con regularización y sin regularización
Factorizar distintas mmatrices utilizando distintos valores iniciales y predecir utilizando el promedio.
Ver diferencia de ajuste con media ajustada por n y sin media ajustada por n

Adjusted cosine similarity
http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html

Índice:
Cap 1. Introducción. 
	Hablar sobre la long tail, la cantidad de información del mundo actual, cómo las cosas en línea eliminan barreras en la demanda. Qué es un sistema de recomendación, cómo surgieron, cómo ayudaron a Netflix, etc. Cómo se usan en la actualidad y quiénes usan
Cap 2. Revisión de literatura. 
	Hablar de filtrado colaborativo y content-based RS. Métodos de neighborhood, user-user, item-item, factorización de matrices, etc. Cold-start problem.
Cap 3. Metodología.
	Aprendizaje estadístico (funciones de pérdida, conjuntos de entrenamiento y de prueba, validación cruzada, etc). 
	Modelo base. Modelo de factores latentes, explicar qué son los factores latentes y qué significan. Modelar los residuales del modelo base usando factorización de matrices. Regularización. Hablar de regularización.
	Descenso en gradiente. Descenso en gradiente estocástico, en general y en particular para el problema de minimizar el error de predicción de los residuales. Explicar por qué descenso en gradiente estocástico sirve para estos casos. Mencionar mínimos cuadrados alternados. Orden computacional del descenso en grad y desc en grad est.
	RMSE y qué significa, por qué esa métrica.
Cap 4. Análisis y resultados.
	Análisis exploratorio de los datos. Ver el RMSE de un modelo aleatorio, de un modelo de popularidad, de modelo base, y luego del modelo. Ver los RMSE de distintos modelos (distintas k's y distintos valores de regularización). Hacer perfiles de rendimiento del número de iteraciones y tiempo de cada modelo.
	Ver serendipity y coverage de cada modelo. Ver algunas recomendaciones del modelo ganador.
	Correlación entre valores altos de factores y popularidad de item.
Cap 5. Discusión.
Cap 6. Conclusiones.
Bibliografía y Apéndices.

Accurately Measuring Model Prediction Error:
http://scott.fortmann-roe.com/docs/MeasuringError.html

Understanding the Bias-Variance Tradeoff:
http://scott.fortmann-roe.com/docs/BiasVariance.html	

Competing in a data science contest without reading the data
http://blog.mrtz.org/2015/03/09/competition.html

The zen of gradient descent
http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html

Lessons from the Netflix Prize Challenge - Bell y Koren

Why Netflix Never Implemented The Algorithm That Won The Netflix $1 Million Challenge
https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml


\bibitem{pareto} Bunkley, Nick (March 3, 2008), \href{http://www.nytimes.com/2008/03/03/business/03juran.html}{\emph{Joseph Juran, 103, Pioneer in Quality Control, Dies}}, New York Times.
\bibitem{longtail} Anderson, Chris. \emph{The Long Tail. Why the Future of Business Is Selling Less of More}, 2008 (pp 125-146).
\bibitem{bloombergnetflixsales} Mullaney, Timothy. \href{http://www.bloomberg.com/bw/stories/2006-05-24/netflix}{\emph{Netflix}}, Bloomberg Business, May 24, 2006.

\bibitem{handbook} Lops, Lops; de Gemmis, Marco \& Semeraro, Giovanni. \emph{Chapter 3: Content-based Recommender Systems: State of the Art and Trends}, del libro \emph{Recommender Systems Handbook}, Springer 2011.

\bibitem{} Leskovec, Jure;  Rajaraman,  Anand \& Ullman, Jeffrey. \emph{Mining of Massive Datasets}, 2014.


Finding Similar Music using Matrix Factorization: 
http://www.benfrederickson.com/matrix-factorization/

SVD: 
https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/

ANN:
https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1/
https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces/
https://erikbern.com/2015/10/20/nearest-neighbors-and-vector-models-epilogue-curse-of-dimensionality/


Reccommending Music on Spotify with Deep Learning:
http://benanne.github.io/2014/08/05/spotify-cnns.html

Deep neural networks for YouTube recommendations:
https://blog.acolyer.org/2016/09/19/deep-neural-networks-for-youtube-recommendations/
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf


Pedro Domingos - A Few Useful Things to Know about Machine Learning
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft - When Is “Nearest Neighbor” Meaningful?

Collaborative Filtering at Spotify
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

ANN:
http://www.cs.umd.edu/~mount/ANN/Files/1.1.2/ANNmanual_1.1.pdf