In a survey on the problem of concept drift, Tsymbal19
argues that three approaches can be distinguished in the literature. The instance selection approach discards instances
that are less relevant to the current state of the system. A
common variant is time-window approaches were only
recent instances are considered. A possible disadvantage of
this simple model is that it is giving the same significance
to all instances within the considered time-window, while
completely discarding all other instances. Equal significance might be reasonable when the time shift is abrupt,
but less so when time shift is gradual. Thus, a refinement
is instance weighting were instances are weighted based on
their estimated relevance. Frequently, a time decay function
is used, underweighting instances as they occur deeper into
the past. The third approach is based on ensemble learning, which maintains a family of predictors that together
produce the final outcome. Those predictors are weighted
by their perceived relevance to the present time point, e.g.,
predictors that were more successful on recent instances get
We performed extensive experiments with instance
weighting schemes, trying different exponential time
decay rates on both neighborhood and factor models. The
consistent finding was that prediction quality improves as
we moderate that time decay, reaching best quality when
there is no decay at all. This finding is despite the fact that
users do change their taste and rating scale over the years,
as we show later. However, much of the old preferences still
persist or, more importantly, help in establishing useful
cross-user or cross-product patterns in the data. Thus, just
underweighting past actions lose too many signals along
with the lost noise, which is detrimental, given the scarcity
of data per user.
As for ensemble learning, having multiple models, each
of which considers only a fraction of the total behavior
may miss those global patterns that can be identified only
when considering the full scope of user behavior. What
makes them even less appealing in our case is the need to
keep track of the independent drifting behaviors of many
customers. This, in turn, would require building a separate
ensemble for each user. Such a separation will significantly
complicate our ability to integrate information across users
along multiple time points, which is the cornerstone of
collaborative filtering. For example, an interesting relation
between products can be established by related actions of
many users, each of them at a totally different point of time.
Capturing such a collective signal requires building a single
model encompassing all users and items together.
All those considerations led us to the following guidelines
we adopt for modeling drifting user preferences.
or preferences per user and/or item, it is essential to
combine all those concepts within a single framework. This combination allows modeling interactions
crossing users and items thereby identifying higher
dynamics, e.g., estimating future changes in a user’s
preferences. Extrapolation could be very helpful but is
seemingly too difficult, especially given a limited
amount of known data. Rather than that, our goal is to
capture past temporal patterns in order to isolate persistent signal from transient noise. The result, indeed,
helps in predicting future behavior.
Now we turn to how these desirable principles are
incorporated into two leading approaches to CF—matrix
factorization and neighborhood methods.
4. timE-AWARE fACtoR moDEL
4. 1. the anatomy of a factor model
Matrix factorization is a well-recognized approach to
CF. 9, 11, 17 This approach lends itself well to an adequate modeling of temporal effects. Before we deal with those temporal effects, we would like to establish the foundations of
a static factor model.
In its basic form, matrix factorization characterizes both
items and users by vectors of factors inferred from patterns
of item ratings. High correspondence between item and user
factors leads to recommendation of an item to a user. More
specifically, both users and items are mapped to a joint latent
factor space of dimensionality f, such that ratings are mod-
eled as inner products in that space. Accordingly, each user
u is associated with a vector pu Î R f and each item i is associ-
ated with a vector qi Î R f. A rating is predicted by the rule
full extent of the time period, not only the present
behavior (while subject to performance limitations).
Such modeling is key to being able to extract signal
from each time point, while neglecting only the noise.
are user-dependent and some are item-dependent.
Similarly, some are gradual while others are sudden.
The major challenge is computing the mapping of each
item and user to factor vectors qi, pu Î R f. After this mapping
is accomplished, we can easily compute the ratings a user
will give to any item by using Equation 1.
Such a model is closely related to singular value decomposition (SVD), which is a well-established technique for
identifying latent semantic factors in the information
retrieval. Applying SVD in the CF domain would require
factoring the user–item rating matrix. Such a factorization
raises difficulties due to the high portion of missing values, due to the sparseness in the user–item ratings matrix.
Conventional SVD is undefined when knowledge about the
matrix is incomplete. Moreover, carelessly addressing only
the relatively few known entries is highly prone to overfitting. Earlier works13 relied on imputation to fill in missing
ratings and make the rating matrix dense. However, imputation can be very expensive as it significantly increases the
amount of data. In addition, the data may be considerably
distorted due to inaccurate imputation. Hence, more recent