Collaborative filtering


Collaborative filtering is, besides content-based filtering, one of two major techniques used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions about a user's interests by utilizing preferences or taste information collected from many users. This approach assumes that if persons A and B share similar opinions on one issue, they are more likely to agree on other issues compared to a random pairing of A with another person. For instance, a collaborative filtering system for television programming could predict which shows a user might enjoy based on a limited list of the user's tastes. These predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average score for each item of interest, for example based on its number of votes.
In the more general sense, collaborative filtering is the process of filtering information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; and user data from electronic commerce and web applications.
This article focuses on collaborative filtering for user data, but some of the methods also apply to other major applications.

Overview

The growth of the Internet has made it much more difficult to effectively extract useful information from all the available online information. The overwhelming amount of data necessitates mechanisms for efficient information filtering. Collaborative filtering is one of the techniques used for dealing with this problem.
The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with tastes similar to themselves. Collaborative filtering encompasses techniques for matching people with similar interests and making recommendations on this basis.
Collaborative filtering algorithms often require users' active participation, an easy way to represent users' interests, and algorithms that are able to match people with similar interests.
Typically, the workflow of a collaborative filtering system is:
  1. A user expresses his or her preferences by rating items of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
  2. The system matches this user's ratings against other users' and finds the people with most "similar" tastes.
  3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user
A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

Methodology

Collaborative filtering systems have many forms, but many common systems can be reduced to two steps:
  1. Look for users who share the same rating patterns with the active user.
  2. Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user
This falls under the category of user-based collaborative filtering. A specific application of this is the user-based Nearest Neighbor algorithm.
Alternatively, item-based collaborative filtering, proceeds in an item-centric manner:
  1. Build an item-item matrix determining relationships between pairs of items
  2. Infer the tastes of the current user by examining the matrix and matching that user's data
See, for example, the Slope One item-based collaborative filtering family.
Another form of collaborative filtering can be based on implicit observations of normal user behavior. These systems observe what a user has done together with what all users have done and use that data to predict the user's behavior in the future, or to predict how a user might like to behave given the chance. These predictions then have to be filtered through business logic to determine how they might affect the actions of a business system. For example, it is not useful to offer to sell somebody a particular album of music if they already have demonstrated that they own that music.
Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest. However, there are other methods to combat information explosion, such as web search and data clustering.

Types

Memory-based

The memory-based approach uses user rating data to compute the similarity between users or items. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user u gives to item i is calculated as an aggregation of some similar users' rating of the item:
where U denotes the set of top N users that are most similar to user u who rated item i. Some examples of the aggregation function include:
where k is a normalizing factor defined as, and
where is the average rating of user u for all the items rated by u.
The neighborhood-based algorithm calculates the similarity between two users or items, and produces a prediction for the user by taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as Pearson correlation and vector cosine based similarity are used for this.
The Pearson correlation similarity of two users x, y is defined as
where Ixy is the set of items rated by both user x and user y.
The cosine-based approach defines the cosine-similarity between two users x and y as:
The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive hashing, which implements the nearest neighbor mechanism in linear time.
The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items.
There are also several disadvantages of this approach. Its performance decreases when data is sparse, which is common for web-related items. This hinders the scalability of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes more complicated because that representation usually relies on a specific vector space. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.

Model-based

An alternative to memory-based methods is to learn models to predict users' rating of unrated items. Model-based CF algorithms include Bayesian networks, clustering models, latent semantic models such as singular value decomposition, probabilistic latent semantic analysis, multiple multiplicative factor, latent Dirichlet allocation and Markov decision process-based models.
Through this approach, dimensionality reduction methods are mostly used for improving robustness and accuracy of memory-based methods. Specifically, methods like singular value decomposition, principal component analysis, known as latent factor models, compress a user-item matrix into a low-dimensional representation in terms of latent factors. This transforms the large matrix that contains many missing values, into a much smaller matrix. A compressed matrix can be used to find neighbors of a user or item as per the previous section. Compression has two advantages in large, sparse data: it is more accurate and scales better.

Hybrid

A number of applications combine the memory-based and the model-based CF algorithms. These overcome the limitations of native CF approaches and improve prediction performance. Importantly, they overcome the CF problems such as sparsity and loss of information. However, they have increased complexity and are expensive to implement. Usually most commercial recommender systems are hybrid, for example, the Google news recommender system.

Deep-learning

In recent years, many neural and deep-learning techniques have been proposed for collaborative filtering. Some generalize traditional matrix factorization algorithms via a non-linear neural architecture, or leverage new model types like Variational Autoencoders. Deep learning has been applied to many scenarios.
However, deep learning effectiveness for collaborative recommendation has been questioned. A systematic analysis of publications using deep learning or neural methods to the top-k recommendation problem, published in top conferences, found that, on average, less than 40% of articles are reproducible, and only 14% in some conferences. Overall, the study identifies 18 articles, only 7 of them could be reproduced and 6 could be outperformed by older and simpler properly tuned baselines. The article highlights potential problems in today's research scholarship and calls for improved scientific practices. Similar issues have been spotted by others and also in sequence-aware recommender systems.